元数据管理Apache Atlas编译集成部署及测试

Apache Atlas集成部署

🍑一、背景
🍑二、基础组件
🍑三、Apache Atlas
- 🍊3、配置
🍑四、测试

🍑一、背景

Atlas采集Hive元数据过程，通过hive hook及Kafka作为中间件完成元数据采集。直接通过hive hook采集（同步）的话，会对元数据采集源性能造成影响。因此通过Kafka作为中间件传送消息。本文直接通过前文的内嵌式编译包部署（通用-只需修改配置文件），新增Kafka、hive、hbase、solr部署。此次简单记录一下吧，哪天忍不住买工作站了，好好整理一下这些集群的详细部署，最近都部署吐了,电脑也吐了，硬盘坏过，蓝屏等等…

🍑二、基础组件

🍊2.1、hadoop

虚拟机之前部署过（2.7.3版本），略！

🍊2.2、Kafka

kafka_2.13-3.2.0.tgz下载地址，速度比较快，如果不行，只能去官网下载了。

config/server.properties
直接解压，指定zk地址即可，目前都是单机版，没办法，资源有限
启动命令

nohup bin/kafka-server-start.sh  config/server.properties &
1

查看topic命令

./kafka-topics.sh --bootstrap-server localhost:9092 --list
1

查看具体topic内容命令

./kafka-console-consumer.sh  --bootstrap-server localhost:9092 --topic ATLAS_HOOK --from-beginning
1

🍊2.3、hive

自行官网下载，有积分的花两分快速下载

conf/hive-env.sh

此次主要是增加HIVE_AUX_JARS_PATH变量，其路径部署atlas时会涉及到

HADOOP_HOME=/root/hadoop-2.7.3
export HIVE_CONF_DIR=/root/apache-hive-3.1.3-bin/conf
export HIVE_AUX_JARS_PATH=/home/atlas/apache-atlas-2.2.0/hook/hive
1
2
3

bin/hive

指定hbase
在这里插入图片描述

conf/hive-site.xml

添加hooks（还需要部署MySQL哟，也可以不部署，采用内嵌模式-应该不影响采集吧，尝试的越多越感觉无知）

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://192.168.38.10:3306/hive_metastore?createDatabaseIfNotExist=true&amp;useSSL=false&amp;allowPublicKeyRetrieval=true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.cj.jdbc.Driver</value>
    </property>
        <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>Test2021@</value>
    </property>
    <property>
        <name>datanucleus.schema.autoCreateAll</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>192.168.38.10</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>
    
   <!-- Hive元数据存储的验证 -->
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>
   
    <!-- 元数据存储授权  -->
    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>
<property>
    <name>hive.exec.post.hooks</name>
    <value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
</configuration>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

conf/atlas-application.properties
从atlas/conf目录下拷贝过来
元数据初始化

bin/schematool -dbType mysql -initSchema
1

启动

nohup bin/hive --service hiveserver2 &
1

🍊2.4、zookeeper-3.4.1

略

🍊2.5、hbase和solr

下载地址
本文直接用的内嵌式编译包中的hbase和solr,修改相关配置即可
在这里插入图片描述

hbase-env.sh

单独部署了zk,此处设为false

export HBASE_MANAGES_ZK=false
1

hbase-site.xml

<configuration>
  <!--
    The following properties are set for running HBase as a single process on a
    developer workstation. With this configuration, HBase is running in
    "stand-alone" mode and without a distributed file system. In this mode, and
    without further configuration, HBase and ZooKeeper data are stored on the
    local filesystem, in a path under the value configured for `hbase.tmp.dir`.
    This value is overridden from its default value of `/tmp` because many
    systems clean `/tmp` on a regular basis. Instead, it points to a path within
    this HBase installation directory.
    Running against the `LocalFileSystem`, as opposed to a distributed
    filesystem, runs the risk of data integrity issues and data loss. Normally
    HBase will refuse to run in such an environment. Setting
    `hbase.unsafe.stream.capability.enforce` to `false` overrides this behavior,
    permitting operation. This configuration is for the developer workstation
    only and __should not be used in production!__
  -->
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.tmp.dir</name>
    <value>./tmp</value>
  </property>
  <property>
    <name>hbase.unsafe.stream.capability.enforce</name>
    <value>false</value>
  </property>
</configuration>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

启动

bin/start-hbase.sh
1

solr 启停

bin/solr start
1

bin/solr stop -p 8983
1

🍑三、Apache Atlas

🍊3、配置

🍓3.1、conf/atlas-application.properties

编译后文件内容都有的，只要改下地址、路径即可

atlas.graph.storage.backend=hbase2
atlas.graph.storage.hbase.table=apache_atlas_janus
atlas.graph.storage.hostname=192.168.38.10
atlas.graph.storage.hbase.regions-per-server=1
atlas.EntityAuditRepository.impl=org.apache.atlas.repository.audit.HBaseBasedAuditRepository
atlas.graph.index.search.backend=solr
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.zookeeper-url=192.168.38.10:2181
atlas.graph.index.search.solr.zookeeper-connect-timeout=60000
atlas.graph.index.search.solr.zookeeper-session-timeout=60000
atlas.graph.index.search.solr.wait-searcher=false
atlas.graph.index.search.max-result-set-size=150
atlas.notification.embedded=false
atlas.kafka.data=/home/atlas/apache-atlas-2.2.0/data/kafka
atlas.kafka.zookeeper.connect=192.168.38.10:2181/kafka
atlas.kafka.bootstrap.servers=192.168.38.10:9092
atlas.kafka.zookeeper.session.timeout.ms=400
atlas.kafka.zookeeper.connection.timeout.ms=200
atlas.kafka.zookeeper.sync.time.ms=20
atlas.kafka.auto.commit.interval.ms=1000
atlas.kafka.hook.group.id=atlas
atlas.kafka.enable.auto.commit=false
atlas.kafka.auto.offset.reset=earliest
atlas.kafka.session.timeout.ms=30000
atlas.kafka.offsets.topic.replication.factor=1
atlas.kafka.poll.timeout.ms=1000
atlas.notification.create.topics=true
atlas.notification.replicas=1
atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIES
atlas.notification.log.failed.messages=true
atlas.notification.consumer.retry.interval=500
atlas.notification.hook.retry.interval=1000
atlas.enableTLS=false
atlas.authentication.method.kerberos=false
atlas.authentication.method.file=true
atlas.authentication.method.ldap.type=none
atlas.authentication.method.file.filename=${sys:atlas.home}/conf/users-credentials.properties
atlas.rest.address=http://192.168.38.10:21000
atlas.audit.hbase.tablename=apache_atlas_entity_audit
atlas.audit.zookeeper.session.timeout.ms=1000
atlas.audit.hbase.zookeeper.quorum=192.168.38.10:2181
atlas.server.ha.enabled=false
atlas.authorizer.impl=simple
atlas.authorizer.simple.authz.policy.file=atlas-simple-authz-policy.json
atlas.rest-csrf.enabled=true
atlas.rest-csrf.browser-useragents-regex=^Mozilla.*,^Opera.*,^Chrome.*
atlas.rest-csrf.methods-to-ignore=GET,OPTIONS,HEAD,TRACE
atlas.rest-csrf.custom-header=X-XSRF-HEADER
atlas.metric.query.cache.ttlInSecs=900
atlas.search.gremlin.enable=false
atlas.ui.default.version=v1
atlas.hook.hive.synchronous=false
atlas.hook.hive.numRetries=3
atlas.hook.hive.queueSize=10000
atlas.cluster.name=primary
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

🍓3.2、conf/atlas-env.sh

export MANAGE_EMBEDDED_CASSANDRA=false
export MANAGE_LOCAL_ELASTICSEARCH=false
export HBASE_CONF_DIR=/home/atlas/hbase/conf
1
2
3

🍓3.3、apache-atlas-2.2.0-hive-hook.tar.gz

下载地址，解压apache-atlas-2.2.0-hive-hook.tar.gz,将内容拷贝到atlas安装目录下
在这里插入图片描述
对应hive中增加的HIVE_AUX_JARS_PATH变量

🍑四、测试

hive创建数据库、表

[root@host1 bin]# ./beeline -u jdbc:hive2://192.168.38.10:10000 -n root
Connecting to jdbc:hive2://192.168.38.10:10000
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://192.168.38.10:10000> show databases;
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+
1 row selected (3.811 seconds)
0: jdbc:hive2://192.168.38.10:10000> create database testatlas;
No rows affected (0.375 seconds)
0: jdbc:hive2://192.168.38.10:10000> use testatlas;
No rows affected (0.152 seconds)
0: jdbc:hive2://192.168.38.10:10000> CREATE  TABLE  atlas_table_test(id int,name string);
No rows affected (2.664 seconds)
0: jdbc:hive2://192.168.38.10:10000> show tables;
+-------------------+
|     tab_name      |
+-------------------+
| atlas_table_test  |
+-------------------+
1 row selected (0.195 seconds)
0: jdbc:hive2://192.168.38.10:10000> select * from atlas_table_test;
+----------------------+------------------------+
| atlas_table_test.id  | atlas_table_test.name  |
+----------------------+------------------------+
+----------------------+------------------------+
No rows selected (2.983 seconds)
0: jdbc:hive2://192.168.38.10:10000> 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

在这里插入图片描述

一段时间后，atlas便可以查看到（历史数据不会同步，需要通过hook-bin/import-hive.sh导入）
在这里插入图片描述

进入hbase命令界面

bin/hbase shell
1

列举表

list
1

全表查询

scan "apache_atlas_entity_audit"
1

在这里插入图片描述

相关阅读:
猫头虎分享已解决Bug || SyntaxError: Unexpected token ＜ in JSON at position 0
SNMP放大攻击
 【SSR服务端渲染+CSR客户端渲染+post请求+get请求+总结】三种开启服务器的方法总结
 八股文学习四（kafka）
git常用指令
 【概念】详细介绍：什么是BP神经网络？(Sigmoid 激活函数，再次介绍) || 感受野 || 前向传播和反向传播
 23种设计模式-原型设计模式介绍加实战代码
 linux-kernel 启动过程一
 三款“非主流”日志查询分析产品初探
 阿里的easyexcal包实现表格动态导出
原文地址：https://blog.csdn.net/qq_36434219/article/details/125491324