DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。
目前支持的数据源如下:

与sqoop的对比:

idea打开datax项目,执行mvn命令:
mvn -U clean package assembly:assembly -Dmaven.test.skip=true

编译成功,打好的tar包在target目录,上传到linux指定目录。
idea打开datax-web项目,执行mvn命令
mvn clean install

编译成功,打好的包在build目录,也上传到linux指定目录。
将datax.tar.gz解压,进入datax目录,执行测试程序,查看结果。
python bin/datax.py job/job.json
- 2023-09-13 15:27:28.851 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.023s | All Task WaitReaderTime 0.030s | Percentage 100.00%
- 2023-09-13 15:27:28.851 [job-0] INFO JobContainer -
- 任务启动时刻 : 2023-09-13 15:27:18
- 任务结束时刻 : 2023-09-13 15:27:28
- 任务总计耗时 : 10s
- 任务平均流量 : 253.91KB/s
- 记录写入速度 : 10000rec/s
- 读出记录总数 : 100000
- 读写失败总数 : 0
将datax-web-2.1.2.tar.gz解压,进入datax-web/bin目录,执行脚本
- ./install
- 2023-09-13 15:31:33.797 [INFO] (101530) Creating directory: [/data/etl/datax-web-2.1.2/bin/../modules].
- 2023-09-13 15:31:33.806 [INFO] (101530) ####### Start To Uncompress Packages ######
- 2023-09-13 15:31:33.809 [INFO] (101530) Uncompressing....
- Do you want to decompress this package: [datax-admin_2.1.2_1.tar.gz]? (Y/N)y
- 2023-09-13 15:32:46.982 [INFO] (101530) Uncompress package: [datax-admin_2.1.2_1.tar.gz] to modules directory
- Do you want to decompress this package: [datax-executor_2.1.2_1.tar.gz]? (Y/N)y
- 2023-09-13 15:33:03.246 [INFO] (101530) Uncompress package: [datax-executor_2.1.2_1.tar.gz] to modules directory
- 2023-09-13 15:33:03.485 [INFO] (101530) ####### Finish To Umcompress Packages ######
- Scan modules directory: [/data/etl/datax-web-2.1.2/bin/../modules] to find server under dataxweb
- 2023-09-13 15:33:03.489 [INFO] (101530) ####### Start To Install Modules ######
- 2023-09-13 15:33:03.491 [INFO] (101530) Module servers could be installed:
- [datax-admin] [datax-executor]
- Do you want to confiugre and install [datax-admin]? (Y/N)y
- 2023-09-13 15:33:09.255 [INFO] (101530) Install module server: [datax-admin]
- Start to make directory
- 2023-09-13 15:33:09.276 [INFO] (105228) Start to build directory
- 2023-09-13 15:33:09.278 [INFO] (105228) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-admin/bin/../logs].
- 2023-09-13 15:33:09.336 [INFO] (105228) Directory or file: [/data/etl/datax-web-2.1.2/modules/datax-admin/bin/../conf] has been exist
- 2023-09-13 15:33:09.338 [INFO] (105228) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-admin/bin/../data].
- end to make directory
- Start to initalize database
- 2023-09-13 15:33:09.487 [INFO] (105228) Scan out mysql command, so begin to initalize the database
- Do you want to initalize database with sql: [/data/etl/datax-web-2.1.2/bin/db/datax_web.sql]? (Y/N)y
- Please input the db host(default: 127.0.0.1):
- Please input the db port(default: 3306):
- Please input the db username(default: root):
- Please input the db password(default: ): XXXXXX
- Please input the db name(default: dataxweb)
- mysql: [Warning] Using a password on the command line interface can be insecure.
- Do you want to confiugre and install [datax-executor]? (Y/N)y
- 2023-09-13 15:34:10.863 [INFO] (101530) Install module server: [datax-executor]
- 2023-09-13 15:34:10.884 [INFO] (107668) Start to build directory
- 2023-09-13 15:34:10.886 [INFO] (107668) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-executor/bin/../logs].
- 2023-09-13 15:34:10.917 [INFO] (107668) Directory or file: [/data/etl/datax-web-2.1.2/modules/datax-executor/bin/../conf] has been exist
- 2023-09-13 15:34:10.919 [INFO] (107668) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-executor/bin/../data].
- 2023-09-13 15:34:10.953 [INFO] (107668) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-executor/bin/../json].
- 2023-09-13 15:34:10.984 [INFO] (101530) ####### Finish To Install Modules ######
编辑datax-web-2.1.2/modules/datax-executor/bin/env.properties
- DATAX_ADMIN_PORT=9527
-
- ## PYTHON脚本执行位置
- #PYTHON_PATH=/home/hadoop/install/datax/bin/datax.py
- PYTHON_PATH=/data/etl/datax/bin/datax.py
./start-all.sh
- 2023-09-13 15:36:26.441 [INFO] (112931) Try To Start Modules In Order
- 2023-09-13 15:36:26.448 [INFO] (112939) ####### Begin To Start Module: [datax-admin] ######
- 2023-09-13 15:36:26.454 [INFO] (112947) load environment variables
- 2023-09-13 15:36:26.755 [INFO] (112947) /data/base/jdk1.8.0_211/bin/java
- 2023-09-13 15:36:26.757 [INFO] (112947) Waiting DATAX-ADMIN to start complete ...
- 2023-09-13 15:36:26.964 [INFO] (112947) DATAX-ADMIN start success
- 2023-09-13 15:36:26.991 [INFO] (113168) ####### Begin To Start Module: [datax-executor] ######
- 2023-09-13 15:36:26.997 [INFO] (113176) load environment variables
- 2023-09-13 15:36:27.390 [INFO] (113176) /data/base/jdk1.8.0_211/bin/java
- 2023-09-13 15:36:27.393 [INFO] (113176) Waiting DATAX-EXEXUTOR to start complete ...
- 2023-09-13 15:36:27.763 [INFO] (113176) DATAX-EXEXUTOR start success
打开页面,登录密码是admin/123456
http://xx.16.60.31:9527/index.html

打开接口文档页面
http://xx.16.60.31:9527/doc.html

mysql建表并插入数据
- CREATE TABLE `user_behavior_mysql` (
- `user_id` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
- `item_id` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
- `category_id` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
- `behavior_type` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
- `time_stamp` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
- PRIMARY KEY (`user_id`, `item_id`, `time_stamp`) USING BTREE
- ) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = DYNAMIC;
- INSERT INTO `user_behavior_mysql` VALUES ('1', '1531036', '2920476', 'pv', '1511733732');
- INSERT INTO `user_behavior_mysql` VALUES ('1', '2268318', '2520377', 'pv', '1511544070');
- INSERT INTO `user_behavior_mysql` VALUES ('1', '230380', '411153', 'pv', '1511644942');
- INSERT INTO `user_behavior_mysql` VALUES ('1', '2333346', '2520771', 'pv', '1511561733');
- INSERT INTO `user_behavior_mysql` VALUES ('1', '2576651', '149192', 'pv', '1511572885');
- INSERT INTO `user_behavior_mysql` VALUES ('1', '3745169', '2891509', 'pv', '1511725471');
- INSERT INTO `user_behavior_mysql` VALUES ('1', '3827899', '2920476', 'pv', '1511713473');
- INSERT INTO `user_behavior_mysql` VALUES ('1', '3830808', '4181361', 'pv', '1511593493');
- INSERT INTO `user_behavior_mysql` VALUES ('1', '4365585', '2520377', 'pv', '1511596146');
- INSERT INTO `user_behavior_mysql` VALUES ('1', '4606018', '2735466', 'pv', '1511616481');
hive中建表
- CREATE TABLE `user_behavior_mysql` (
- `user_id` string,
- `item_id` string,
- `category_id` string,
- `behavior_type` string,
- `time_stamp` string
- ) row format delimited fields terminated by ",";







生成json如下:
- {
- "job": {
- "setting": {
- "speed": {
- "channel": 3
- },
- "errorLimit": {
- "record": 0,
- "percentage": 0.02
- }
- },
- "content": [
- {
- "reader": {
- "name": "mysqlreader",
- "parameter": {
- "username": "yRjwDFuoPKlqya9h9H2Amg==",
- "password": "XCYVpFosvZBBWobFzmLWvA==",
- "splitPk": ",",
- "connection": [
- {
- "querySql": [
- "select * from user_behavior_mysql;"
- ],
- "jdbcUrl": [
- "jdbc:mysql://10.16.60.31:3306/datasets"
- ]
- }
- ]
- }
- },
- "writer": {
- "name": "hdfswriter",
- "parameter": {
- "defaultFS": "hdfs://10.16.60.31:8020",
- "fileType": "text",
- "path": "/user/hive/warehouse/datasets.db/user_behavior_mysql",
- "fileName": "user_behavior_mysql",
- "writeMode": "append",
- "fieldDelimiter": ",",
- "column": [
- {
- "name": "user_id",
- "type": "string"
- },
- {
- "name": "item_id",
- "type": "string"
- },
- {
- "name": "category_id",
- "type": "string"
- },
- {
- "name": "behavior_type",
- "type": "string"
- },
- {
- "name": "time_stamp",
- "type": "string"
- }
- ]
- }
- }
- }
- ]
- }
- }
hive中数据如下

另外测试从mysql同步约1亿条记录,测试结果如下
