./bin/hdata --reader READER_NAME -Rk1=v1 -Rk2=v2 --writer WRITER_NAME -Wk1=v1 -Wk2=v2
READER_NAME、WRITER_NAME分别为读/写插件的名称,例如:jdbc、hive Reader插件的参数配置以-R为前缀,Writer插件的参数配置以-W为前缀。
例子(Mysql -> Hive):
./bin/hdata --reader jdbc -Rurl="jdbc:mysql://127.0.0.1:3306/testdb" -Rdriver="com.mysql.jdbc.Driver" -Rtable="testtable" -Rusername="username" -Rpassword="password" -Rparallelism=3 --writer hive -Wmetastore.uris="thrift://127.0.0.1:9083" -Whdfs.conf.path="/path/to/hdfs-site.xml" -Wdatabase="default" -Wtable="testtable" -Whadoop.user="hadoop" -Wparallelism=2
2、XML配置方式
job.xml
jdbc:mysql://127.0.0.1:3306/testdb
com.mysql.jdbc.Driver
testtable
username
password
3
thrift://127.0.0.1:9083
/path/to/hdfs-site.xml
default
testtable
hadoop
2
运行命令
./bin/hdata -f /path/to/job.xml
2、使用实战
【github下载】
https://github.com/uptonking/hdata
【编译环境】
jdk1.7(只能是1.7)
maven 3+
【配置maven】
建议用独立setting.xml配置,不影响项目工程开发
hdata-settting.xml
D:/apache-maven-3.5.4/repository
*
mirror-all
http://mirrors.cloud.tencent.com/nexus/repository/maven-public/
custom
jdk-1.7
true
1.7
1.7
1.7
1.7
maven-home
central
https://repo1.maven.org/maven2
true
warn
false
central
http://repo1.maven.org/maven2
注意:因为部份依赖包无法从中央仓库下载,aliyun镜像仓库也无法下载成功,因此配了多个mirror配置,那个能下载成功就用那个,多试几次,自已手动切换;
【执行打包】
mvn clean package -Pmake-package --settings D:apache-maven-3.5.4confhdata-settings.xml -Dmaven.test.skip=true
打包过程执行的有些久,因为要下载大量依赖包,稍等片刻,直到最后打包完成
....
....
.....
[INFO] Reading assembly descriptor: src/build/package.xml
[INFO] Building tar: D:Workspacesidea_2HData-masterassembly..uildhdata-0.2.8.tar.gz
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] HData 0.2.8 ........................................ SUCCESS [ 18.604 s]
[INFO] hdata-api .......................................... SUCCESS [ 14.069 s]
[INFO] hdata-core ......................................... SUCCESS [ 4.442 s]
[INFO] hdata-console ...................................... SUCCESS [ 0.234 s]
[INFO] hdata-csv .......................................... SUCCESS [ 0.632 s]
[INFO] hdata-jdbc ......................................... SUCCESS [ 1.343 s]
[INFO] hdata-ftp .......................................... SUCCESS [ 0.753 s]
[INFO] hdata-http ......................................... SUCCESS [ 0.234 s]
[INFO] hdata-kafka ........................................ SUCCESS [ 6.452 s]
[INFO] hdata-hdfs ......................................... SUCCESS [ 20.008 s]
[INFO] hdata-hive ......................................... SUCCESS [ 30.733 s]
[INFO] hdata-hbase ........................................ SUCCESS [ 35.710 s]
[INFO] hdata-mongodb ...................................... SUCCESS [ 3.634 s]
[INFO] hdata-excel ........................................ SUCCESS [ 14.700 s]
[INFO] hdata-wit .......................................... SUCCESS [ 4.037 s]
[INFO] hdata-assembly 0.2.8 ............................... SUCCESS [ 17.239 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:53 min
[INFO] Finished at: 2020-06-16T12:12:33+08:00
[INFO] ------------------------------------------------------------------------
【查看目录结构】
D:Workspacesidea_2HData-master>dir
...
2020/06/16 12:12 .
2020/06/16 12:12 ..
2018/01/11 19:22 14 .gitignore
2020/06/16 15:45 .idea
2020/06/16 12:12 assembly
2020/06/15 13:49 bin
2020/06/16 14:06 build
2020/06/15 13:49 conf
2020/06/15 13:49 doc
2020/06/16 12:10 hdata-api
2020/06/16 12:10 hdata-console
2020/06/16 12:10 hdata-core
2020/06/16 12:10 hdata-csv
2020/06/16 12:12 hdata-excel
2020/06/16 12:10 hdata-ftp
2020/06/16 12:11 hdata-hbase
2020/06/16 12:10 hdata-hdfs
2020/06/16 12:11 hdata-hive
2020/06/16 12:10 hdata-http
2020/06/16 12:10 hdata-jdbc
2020/06/16 12:10 hdata-kafka
2020/06/16 12:11 hdata-mongodb
2020/06/16 12:12 hdata-wit
2020/06/15 13:52 574 hdata.iml
2020/06/16 11:44 5,337 pom.xml
2018/01/11 19:22 13,774 README.md
2020/06/16 12:10 target
...
多了一个build,即为编译后的目录,在该目录下有一个
hdata-0.2.8.tar.gz压缩文件,即打包后的可执行程序包;解压该包,进入到根目录即可执行命令;
【执行同步】
将
hdata-0.2.8.tar.gz压缩文件解压,复制到D:/test/目录下。
D: esthdata-0.2.8>dir
....
2020/06/16 16:06 .
2020/06/16 16:06 ..
2020/06/16 14:06 bin
2020/06/16 16:17 conf
2020/06/16 14:06 lib
2020/06/16 14:06 plugins
...
bin:存放hdata和hdata.bat脚本,分别用于linux和windows下运行的起始脚本;
conf:存放hdata的配置,如:缓冲区大小、线程等待策略等..
lib:存放hdata核心框架jar程序
plugins:存放hdata的各种插件与依赖jar程序
- 【执行mysql到mysql的数据库同步】
java -Xss256k -Xms1G -Xmx1G -Xmn512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:SoftRefLRUPolicyMSPerMB=0 -Dhdata.conf.dir="D: esthdata-0.2.8conf" -Dlog4j.configurationFile=file:///D: esthdata-0.2.8conflog4j2.xml -classpath ".;D: esthdata-0.2.8lib*" "com.github.stuxuhai.hdata.CliDriver" -Dhttps.protocols=TLSv1.2 -Dfile.encoding="UTF-8" --reader jdbc -Rurl="jdbc:mysql://127.0.0.1:3306/gateway?characterEncoding=utf8&useSSL=false" -Rdriver="com.mysql.jdbc.Driver" -Rtable="client" -Rusername="root" -Rpassword="123456" --writer jdbc -Wurl="jdbc:mysql://192.168.1.35:3306/gateway?characterEncoding=utf8&useSSL=false" -Wdriver="com.mysql.jdbc.Driver" -Wtable="client" -Wusername="root" -Wpassword="root_it_123465"
执行xml配置参数
hdata.bat -f D:/test/hdata-0.2.8/conf/mysqlToMysql.xml
mysqlToMysql.xml
com.mysql.jdbc.Driver
client
root
123456
3
com.mysql.jdbc.Driver
client
root
123456
3
- 【执行mysql到excel的数据同步】
java -Xss256k -Xms1G -Xmx1G -Xmn512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:SoftRefLRUPolicyMSPerMB=0 -Dhdata.conf.dir="D: esthdata-0.2.8conf" -Dlog4j.configurationFile=file:///D: esthdata-0.2.8conflog4j2.xml -classpath ".;D: esthdata-0.2.8lib*" "com.github.stuxuhai.hdata.CliDriver" -Dhttps.protocols=TLSv1.2 -Dfile.encoding="UTF-8" --reader jdbc -Rurl="jdbc:mysql://127.0.0.1:3306/gateway?characterEncoding=utf8&useSSL=false" -Rdriver="com.mysql.jdbc.Driver" -Rtable="client" -Rusername="root" -Rpassword="123456" --writer excel -Wpath="D://test//client.xlsx" -Winclude.column.names="true"
执行xml配置参数
hdata.bat -f D:/test/hdata-0.2.8/conf/mysqlToExcel.xml
mysqlToExcel.xml
com.mysql.jdbc.Driver
client
root
123456
3
true
- 【执行http到excel的数据同步】
java -Xss256k -Xms1G -Xmx1G -Xmn512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:SoftRefLRUPolicyMSPerMB=0 -Dhdata.conf.dir="D: esthdata-0.2.8conf" -Dlog4j.configurationFile=file:///D: esthdata-0.2.8conflog4j2.xml -classpath ".;D: esthdata-0.2.8lib*" "com.github.stuxuhai.hdata.CliDriver" -Dhttps.protocols=TLSv1.2 -Dfile.encoding="UTF-8" --reader http -Rurl="https://www.baidu.com/" --writer excel -Wpath="D://test//html2.xlsx" -Winclude.column.names="true"
执行xml配置参数
hdata.bat -f D:/test/hdata-0.2.8/conf/httpToExcel.xml
httpToExcel.xml
true
3、IDEA中查看代码
导入工程
2.配置启动参数
Configuration配置
VM Options: -Dhttps.protocols=TLSv1.2 -Dhdata.conf.dir="D:\Workspaces\idea_2\HData-master\conf"
Program arguments: --reader jdbc -Rurl="jdbc:mysql://127.0.0.1:3306/gateway?characterEncoding=utf8&useSSL=false" -Rdriver="com.mysql.jdbc.Driver" -Rtable="client" -Rusername="root" -Rpassword="123456" --writer jdbc -Wurl="jdbc:mysql://192.168.1.35:3306/gateway?characterEncoding=utf8&useSSL=false" -Wdriver="com.mysql.jdbc.Driver" -Wtable="client" -Wusername="root" -Wpassword="123456"
3.直接从hdata-core模块的CliDriver类中运行即可
4、问题 1.[Fatal Error] mysqlToMysql.xml:5:73: 对实体 "useSSL" 的引用必须以 ";" 分隔符结尾。
[Fatal Error] mysqlToMysql.xml:5:73: 对实体 "useSSL" 的引用必须以 ";" 分隔符结尾。
因为xml结构内有&=等特殊符号,xml解析失败,将内容以 包装起来即可;