• ETL可视化工具 DataX -- 安装部署 ( 二)


    引言

    DataX 系列文章:

    DataX 私有仓库 :

    https://gitee.com/dazhong000/datax.git
    https://gitee.com/dazhong000/datax-web.git
    本地地址:E:\soft\2023-08-datax

    2.1 DataX安装

    安装文档 git地址:https://github.com/alibaba/DataX/blob/master/userGuid.md

    2.1.1 解压安装

    • 方法一、直接下载DataX工具包:
    • 下载地址 (https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202308/datax.tar.gz
      下载后解压至本地某个目录,进入bin目录,即可运行同步作业:
    $ cd  {YOUR_DATAX_HOME}/bin
    $ python datax.py {YOUR_JOB.json}
    

    自检脚本:

    python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
    
    • 方法二、下载DataX源码,自己编译:
      DataX源码

    (1)、下载DataX源码:

    $ git clone git@github.com:alibaba/DataX.git
    

    (2)、通过maven打包

    $ cd  {DataX_source_code_home}
    $ mvn -U clean package assembly:assembly -Dmaven.test.skip=true
    打包成功,日志显示如下:
    [INFO] BUILD SUCCESS
    [INFO] -----------------------------------------------------------------
    [INFO] Total time: 08:12 min
    [INFO] Finished at: 2015-12-13T16:26:48+08:00
    [INFO] Final Memory: 133M/960M
    [INFO] -----------------------------------------------------------------
    

    打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/ ,结构如下:

    $ cd  {DataX_source_code_home}
    $ ls ./target/datax/datax/
    bin        conf        job        lib        log        log_perf    plugin        script        tmp
    

    2.1.2 配置示例 从stream读取数据并打印到控制台

    • 第一步、创建作业的配置文件(json格式)

    可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}

    $ cd  {YOUR_DATAX_HOME}/bin
    $  python datax.py -r streamreader -w streamwriter
    DataX (UNKNOWN_DATAX_VERSION), From Alibaba !
    Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved.
    Please refer to the streamreader document:
        https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md 
    
    Please refer to the streamwriter document:
         https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md 
     
    Please save the following configuration as a json file and  use
         python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
    to run the job.
    
    {
        "job": {
            "content": [
                {
                    "reader": {
                        "name": "streamreader", 
                        "parameter": {
                            "column": [], 
                            "sliceRecordCount": ""
                        }
                    }, 
                    "writer": {
                        "name": "streamwriter", 
                        "parameter": {
                            "encoding": "", 
                            "print": true
                        }
                    }
                }
            ], 
            "setting": {
                "speed": {
                    "channel": ""
                }
            }
        }
    }
    

    根据模板配置json如下:

    #stream2stream.json
    {
      "job": {
        "content": [
          {
            "reader": {
              "name": "streamreader",
              "parameter": {
                "sliceRecordCount": 10,
                "column": [
                  {
                    "type": "long",
                    "value": "10"
                  },
                  {
                    "type": "string",
                    "value": "hello,你好,世界-DataX"
                  }
                ]
              }
            },
            "writer": {
              "name": "streamwriter",
              "parameter": {
                "encoding": "UTF-8",
                "print": true
              }
            }
          }
        ],
        "setting": {
          "speed": {
            "channel": 5
           }
        }
      }
    }
    

    示例:Mysql 同步数据配置:

    {
        "job": {
            "content": [
                {
                    "reader": {
                        //读取端"name": "mysqlreader",
                        "parameter": {
                            //源数据库连接用户"username": "root",
                            //源数据库连接密码"password": "root",
                            //需要同步的列(*表示所有的列)"column": [
                                "*"
                            ],
                            "connection": [
                                {
                                    //源数据库连接"jdbcUrl": [
                                        "jdbc:mysql://127.0.0.3:3360/studysource?useUnicode=true&characterEncoding=utf8"
                                    ],
                                    //源表"table": [
                                        "staff_info"
                                    ]
                                }
                            ]
                        }
                    },
                    "writer": {
                        //写入端"name": "mysqlwriter",
                        "parameter": {
                            //目标数据库连接用户"username": "root",
                            //目标数据库连接密码"password": "root",
                            "connection": [
                                {
                                    //目标数据库连接"jdbcUrl": "jdbc:mysql://127.2.3.4:3360/studysync?useUnicode=true&characterEncoding=utf8",
                                    //目标表"table": [
                                        "staff_info"
                                    ]
                                }
                            ],
                            //同步前.要做的事"preSql": [
                                "TRUNCATE TABLE staff_info"
                            ],
                            //需要同步的列"column": [
                                "*"
                            ]
                        }
                    }
                }
            ],
            "setting": {
                "speed": {
                    //指定并发数"channel": "5"
                }
            }
        }
    }
    
    • 第二步:启动DataX
    $ cd {YOUR_DATAX_DIR_BIN}
    $ python datax.py ./stream2stream.json 
    

    同步结束,显示日志如下:

    ...
    2015-12-17 11:20:25.263 [job-0] INFO  JobContainer - 
    任务启动时刻                    : 2015-12-17 11:20:15
    任务结束时刻                    : 2015-12-17 11:20:25
    任务总计耗时                    :                 10s
    任务平均流量                    :              205B/s
    记录写入速度                    :              5rec/s
    读出记录总数                    :                  50
    读写失败总数                    :                   0
    
  • 相关阅读:
    【网络安全 --- xss-labs靶场】xss-labs靶场安装详细教程,让你巩固对xss漏洞的理解及绕过技巧和方法(提供资源)
    第十九章·迭代器模式
    MYSQL补充SQL语句
    【计算机网络】网络基础知识
    在windows10上怎么安装Kafka
    es 安装 ik 中文分词插件
    Spring注解开发和XML开发
    [附源码]java毕业设计高校创新创业项目管理系统
    VTK----深入理解3D坐标系统和相机
    AI/ML如何在山林防火中大显身手
  • 原文地址:https://blog.csdn.net/dazhong2012/article/details/139668893