• 元数据管理Apache Atlas编译集成部署及测试


    🍑一、背景


    Atlas采集Hive元数据过程,通过hive hook及Kafka作为中间件完成元数据采集。直接通过hive hook采集(同步)的话,会对元数据采集源性能造成影响。因此通过Kafka作为中间件传送消息。本文直接通过前文的内嵌式编译包部署(通用-只需修改配置文件),新增Kafka、hive、hbase、solr部署。此次简单记录一下吧,哪天忍不住买工作站了,好好整理一下这些集群的详细部署,最近都部署吐了,电脑也吐了,硬盘坏过,蓝屏等等…

    🍑二、基础组件


    🍊2.1、hadoop


    虚拟机之前部署过(2.7.3版本),略!

    🍊2.2、Kafka


    kafka_2.13-3.2.0.tgz下载地址,速度比较快,如果不行,只能去官网下载了。

    • config/server.properties
      直接解压,指定zk地址即可,目前都是单机版,没办法,资源有限
      在这里插入图片描述
    • 启动命令
    nohup bin/kafka-server-start.sh  config/server.properties &
    
    • 1
    • 查看topic命令
    ./kafka-topics.sh --bootstrap-server localhost:9092 --list
    
    • 1
    • 查看具体topic内容命令
    ./kafka-console-consumer.sh  --bootstrap-server localhost:9092 --topic ATLAS_HOOK --from-beginning
    
    • 1

    🍊2.3、hive


    自行官网下载,有积分的花两分快速下载

    • conf/hive-env.sh

    此次主要是增加HIVE_AUX_JARS_PATH变量,其路径部署atlas时会涉及到

    HADOOP_HOME=/root/hadoop-2.7.3
    export HIVE_CONF_DIR=/root/apache-hive-3.1.3-bin/conf
    export HIVE_AUX_JARS_PATH=/home/atlas/apache-atlas-2.2.0/hook/hive
    
    • 1
    • 2
    • 3
    • bin/hive

    指定hbase
    在这里插入图片描述

    • conf/hive-site.xml

    添加hooks(还需要部署MySQL哟,也可以不部署,采用内嵌模式-应该不影响采集吧,尝试的越多越感觉无知)

    <configuration>
        <property>
            <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://192.168.38.10:3306/hive_metastore?createDatabaseIfNotExist=true&amp;useSSL=false&amp;allowPublicKeyRetrieval=true</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.cj.jdbc.Driver</value>
        </property>
            <property>
            <name>javax.jdo.option.ConnectionUserName</name>
            <value>root</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionPassword</name>
            <value>Test2021@</value>
        </property>
        <property>
            <name>datanucleus.schema.autoCreateAll</name>
            <value>true</value>
        </property>
        <property>
            <name>hive.server2.thrift.bind.host</name>
            <value>192.168.38.10</value>
        </property>
        <property>
            <name>hive.metastore.warehouse.dir</name>
            <value>/user/hive/warehouse</value>
        </property>
        
       <!-- Hive元数据存储的验证 -->
        <property>
            <name>hive.metastore.schema.verification</name>
            <value>false</value>
        </property>
       
        <!-- 元数据存储授权  -->
        <property>
            <name>hive.metastore.event.db.notification.api.auth</name>
            <value>false</value>
        </property>
    <property>
        <name>hive.exec.post.hooks</name>
        <value>org.apache.atlas.hive.hook.HiveHook</value>
    </property>
    </configuration>
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • conf/atlas-application.properties
      从atlas/conf目录下拷贝过来
    • 元数据初始化
    bin/schematool -dbType mysql -initSchema
    
    • 1
    • 启动
    nohup bin/hive --service hiveserver2 &
    
    • 1

    🍊2.4、zookeeper-3.4.1


    🍊2.5、hbase和solr


    下载地址
    本文直接用的内嵌式编译包中的hbase和solr,修改相关配置即可
    在这里插入图片描述

    • hbase-env.sh

    单独部署了zk,此处设为false

    export HBASE_MANAGES_ZK=false
    
    • 1
    • hbase-site.xml
    <configuration>
      <!--
        The following properties are set for running HBase as a single process on a
        developer workstation. With this configuration, HBase is running in
        "stand-alone" mode and without a distributed file system. In this mode, and
        without further configuration, HBase and ZooKeeper data are stored on the
        local filesystem, in a path under the value configured for `hbase.tmp.dir`.
        This value is overridden from its default value of `/tmp` because many
        systems clean `/tmp` on a regular basis. Instead, it points to a path within
        this HBase installation directory.
        Running against the `LocalFileSystem`, as opposed to a distributed
        filesystem, runs the risk of data integrity issues and data loss. Normally
        HBase will refuse to run in such an environment. Setting
        `hbase.unsafe.stream.capability.enforce` to `false` overrides this behavior,
        permitting operation. This configuration is for the developer workstation
        only and __should not be used in production!__
      -->
      <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
      </property>
      <property>
        <name>hbase.tmp.dir</name>
        <value>./tmp</value>
      </property>
      <property>
        <name>hbase.unsafe.stream.capability.enforce</name>
        <value>false</value>
      </property>
    </configuration>
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 启动
    bin/start-hbase.sh
    
    • 1
    • solr 启停
    bin/solr start
    
    • 1
    bin/solr stop -p 8983
    
    • 1

    🍑三、Apache Atlas


    🍊3、配置


    🍓3.1、conf/atlas-application.properties


    编译后文件内容都有的,只要改下地址、路径即可

    atlas.graph.storage.backend=hbase2
    atlas.graph.storage.hbase.table=apache_atlas_janus
    atlas.graph.storage.hostname=192.168.38.10
    atlas.graph.storage.hbase.regions-per-server=1
    atlas.EntityAuditRepository.impl=org.apache.atlas.repository.audit.HBaseBasedAuditRepository
    atlas.graph.index.search.backend=solr
    atlas.graph.index.search.solr.mode=cloud
    atlas.graph.index.search.solr.zookeeper-url=192.168.38.10:2181
    atlas.graph.index.search.solr.zookeeper-connect-timeout=60000
    atlas.graph.index.search.solr.zookeeper-session-timeout=60000
    atlas.graph.index.search.solr.wait-searcher=false
    atlas.graph.index.search.max-result-set-size=150
    atlas.notification.embedded=false
    atlas.kafka.data=/home/atlas/apache-atlas-2.2.0/data/kafka
    atlas.kafka.zookeeper.connect=192.168.38.10:2181/kafka
    atlas.kafka.bootstrap.servers=192.168.38.10:9092
    atlas.kafka.zookeeper.session.timeout.ms=400
    atlas.kafka.zookeeper.connection.timeout.ms=200
    atlas.kafka.zookeeper.sync.time.ms=20
    atlas.kafka.auto.commit.interval.ms=1000
    atlas.kafka.hook.group.id=atlas
    atlas.kafka.enable.auto.commit=false
    atlas.kafka.auto.offset.reset=earliest
    atlas.kafka.session.timeout.ms=30000
    atlas.kafka.offsets.topic.replication.factor=1
    atlas.kafka.poll.timeout.ms=1000
    atlas.notification.create.topics=true
    atlas.notification.replicas=1
    atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIES
    atlas.notification.log.failed.messages=true
    atlas.notification.consumer.retry.interval=500
    atlas.notification.hook.retry.interval=1000
    atlas.enableTLS=false
    atlas.authentication.method.kerberos=false
    atlas.authentication.method.file=true
    atlas.authentication.method.ldap.type=none
    atlas.authentication.method.file.filename=${sys:atlas.home}/conf/users-credentials.properties
    atlas.rest.address=http://192.168.38.10:21000
    atlas.audit.hbase.tablename=apache_atlas_entity_audit
    atlas.audit.zookeeper.session.timeout.ms=1000
    atlas.audit.hbase.zookeeper.quorum=192.168.38.10:2181
    atlas.server.ha.enabled=false
    atlas.authorizer.impl=simple
    atlas.authorizer.simple.authz.policy.file=atlas-simple-authz-policy.json
    atlas.rest-csrf.enabled=true
    atlas.rest-csrf.browser-useragents-regex=^Mozilla.*,^Opera.*,^Chrome.*
    atlas.rest-csrf.methods-to-ignore=GET,OPTIONS,HEAD,TRACE
    atlas.rest-csrf.custom-header=X-XSRF-HEADER
    atlas.metric.query.cache.ttlInSecs=900
    atlas.search.gremlin.enable=false
    atlas.ui.default.version=v1
    atlas.hook.hive.synchronous=false
    atlas.hook.hive.numRetries=3
    atlas.hook.hive.queueSize=10000
    atlas.cluster.name=primary
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55

    🍓3.2、conf/atlas-env.sh


    export MANAGE_EMBEDDED_CASSANDRA=false
    export MANAGE_LOCAL_ELASTICSEARCH=false
    export HBASE_CONF_DIR=/home/atlas/hbase/conf
    
    • 1
    • 2
    • 3

    🍓3.3、apache-atlas-2.2.0-hive-hook.tar.gz


    下载地址,解压apache-atlas-2.2.0-hive-hook.tar.gz,将内容拷贝到atlas安装目录下
    在这里插入图片描述
    对应hive中增加的HIVE_AUX_JARS_PATH变量
    在这里插入图片描述

    🍑四、测试


    hive创建数据库、表

    [root@host1 bin]# ./beeline -u jdbc:hive2://192.168.38.10:10000 -n root
    Connecting to jdbc:hive2://192.168.38.10:10000
    Connected to: Apache Hive (version 3.1.3)
    Driver: Hive JDBC (version 3.1.3)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Beeline version 3.1.3 by Apache Hive
    0: jdbc:hive2://192.168.38.10:10000> show databases;
    +----------------+
    | database_name  |
    +----------------+
    | default        |
    +----------------+
    1 row selected (3.811 seconds)
    0: jdbc:hive2://192.168.38.10:10000> create database testatlas;
    No rows affected (0.375 seconds)
    0: jdbc:hive2://192.168.38.10:10000> use testatlas;
    No rows affected (0.152 seconds)
    0: jdbc:hive2://192.168.38.10:10000> CREATE  TABLE  atlas_table_test(id int,name string);
    No rows affected (2.664 seconds)
    0: jdbc:hive2://192.168.38.10:10000> show tables;
    +-------------------+
    |     tab_name      |
    +-------------------+
    | atlas_table_test  |
    +-------------------+
    1 row selected (0.195 seconds)
    0: jdbc:hive2://192.168.38.10:10000> select * from atlas_table_test;
    +----------------------+------------------------+
    | atlas_table_test.id  | atlas_table_test.name  |
    +----------------------+------------------------+
    +----------------------+------------------------+
    No rows selected (2.983 seconds)
    0: jdbc:hive2://192.168.38.10:10000> 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33

    在这里插入图片描述

    一段时间后,atlas便可以查看到(历史数据不会同步,需要通过hook-bin/import-hive.sh导入)
    在这里插入图片描述

    • 进入hbase命令界面
    bin/hbase shell
    
    • 1
    • 列举表
    list
    
    • 1
    • 全表查询
    scan "apache_atlas_entity_audit"
    
    • 1

    在这里插入图片描述

  • 相关阅读:
    猫头虎分享已解决Bug || SyntaxError: Unexpected token < in JSON at position 0
    SNMP放大攻击
    【SSR服务端渲染+CSR客户端渲染+post请求+get请求+总结】三种开启服务器的方法总结
    八股文学习四(kafka)
    git常用指令
    【概念】详细介绍:什么是BP神经网络?(Sigmoid 激活函数,再次介绍) || 感受野 || 前向传播 和 反向传播
    23种设计模式-原型设计模式介绍加实战代码
    linux-kernel 启动过程 一
    三款“非主流”日志查询分析产品初探
    阿里的easyexcal包实现表格动态导出
  • 原文地址:https://blog.csdn.net/qq_36434219/article/details/125491324