Atlas采集Hive元数据过程,通过hive hook及Kafka作为中间件完成元数据采集。直接通过hive hook采集(同步)的话,会对元数据采集源性能造成影响。因此通过Kafka作为中间件传送消息。本文直接通过前文的内嵌式编译包部署(通用-只需修改配置文件),新增Kafka、hive、hbase、solr部署。此次简单记录一下吧,哪天忍不住买工作站了,好好整理一下这些集群的详细部署,最近都部署吐了,电脑也吐了,硬盘坏过,蓝屏等等…
虚拟机之前部署过(2.7.3版本),略!
kafka_2.13-3.2.0.tgz下载地址,速度比较快,如果不行,只能去官网下载了。
config/server.properties
nohup bin/kafka-server-start.sh config/server.properties &
./kafka-topics.sh --bootstrap-server localhost:9092 --list
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic ATLAS_HOOK --from-beginning
自行官网下载,有积分的花两分快速下载
此次主要是增加HIVE_AUX_JARS_PATH变量,其路径部署atlas时会涉及到
HADOOP_HOME=/root/hadoop-2.7.3
export HIVE_CONF_DIR=/root/apache-hive-3.1.3-bin/conf
export HIVE_AUX_JARS_PATH=/home/atlas/apache-atlas-2.2.0/hook/hive
指定hbase

添加hooks(还需要部署MySQL哟,也可以不部署,采用内嵌模式-应该不影响采集吧,尝试的越多越感觉无知)
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.38.10:3306/hive_metastore?createDatabaseIfNotExist=true&useSSL=false&allowPublicKeyRetrieval=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Test2021@</value>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>192.168.38.10</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<!-- Hive元数据存储的验证 -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<!-- 元数据存储授权 -->
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
</configuration>
bin/schematool -dbType mysql -initSchema
nohup bin/hive --service hiveserver2 &
略
下载地址
本文直接用的内嵌式编译包中的hbase和solr,修改相关配置即可

单独部署了zk,此处设为false
export HBASE_MANAGES_ZK=false
<configuration>
<!--
The following properties are set for running HBase as a single process on a
developer workstation. With this configuration, HBase is running in
"stand-alone" mode and without a distributed file system. In this mode, and
without further configuration, HBase and ZooKeeper data are stored on the
local filesystem, in a path under the value configured for `hbase.tmp.dir`.
This value is overridden from its default value of `/tmp` because many
systems clean `/tmp` on a regular basis. Instead, it points to a path within
this HBase installation directory.
Running against the `LocalFileSystem`, as opposed to a distributed
filesystem, runs the risk of data integrity issues and data loss. Normally
HBase will refuse to run in such an environment. Setting
`hbase.unsafe.stream.capability.enforce` to `false` overrides this behavior,
permitting operation. This configuration is for the developer workstation
only and __should not be used in production!__
-->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>./tmp</value>
</property>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
</configuration>
bin/start-hbase.sh
bin/solr start
bin/solr stop -p 8983
编译后文件内容都有的,只要改下地址、路径即可
atlas.graph.storage.backend=hbase2
atlas.graph.storage.hbase.table=apache_atlas_janus
atlas.graph.storage.hostname=192.168.38.10
atlas.graph.storage.hbase.regions-per-server=1
atlas.EntityAuditRepository.impl=org.apache.atlas.repository.audit.HBaseBasedAuditRepository
atlas.graph.index.search.backend=solr
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.zookeeper-url=192.168.38.10:2181
atlas.graph.index.search.solr.zookeeper-connect-timeout=60000
atlas.graph.index.search.solr.zookeeper-session-timeout=60000
atlas.graph.index.search.solr.wait-searcher=false
atlas.graph.index.search.max-result-set-size=150
atlas.notification.embedded=false
atlas.kafka.data=/home/atlas/apache-atlas-2.2.0/data/kafka
atlas.kafka.zookeeper.connect=192.168.38.10:2181/kafka
atlas.kafka.bootstrap.servers=192.168.38.10:9092
atlas.kafka.zookeeper.session.timeout.ms=400
atlas.kafka.zookeeper.connection.timeout.ms=200
atlas.kafka.zookeeper.sync.time.ms=20
atlas.kafka.auto.commit.interval.ms=1000
atlas.kafka.hook.group.id=atlas
atlas.kafka.enable.auto.commit=false
atlas.kafka.auto.offset.reset=earliest
atlas.kafka.session.timeout.ms=30000
atlas.kafka.offsets.topic.replication.factor=1
atlas.kafka.poll.timeout.ms=1000
atlas.notification.create.topics=true
atlas.notification.replicas=1
atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIES
atlas.notification.log.failed.messages=true
atlas.notification.consumer.retry.interval=500
atlas.notification.hook.retry.interval=1000
atlas.enableTLS=false
atlas.authentication.method.kerberos=false
atlas.authentication.method.file=true
atlas.authentication.method.ldap.type=none
atlas.authentication.method.file.filename=${sys:atlas.home}/conf/users-credentials.properties
atlas.rest.address=http://192.168.38.10:21000
atlas.audit.hbase.tablename=apache_atlas_entity_audit
atlas.audit.zookeeper.session.timeout.ms=1000
atlas.audit.hbase.zookeeper.quorum=192.168.38.10:2181
atlas.server.ha.enabled=false
atlas.authorizer.impl=simple
atlas.authorizer.simple.authz.policy.file=atlas-simple-authz-policy.json
atlas.rest-csrf.enabled=true
atlas.rest-csrf.browser-useragents-regex=^Mozilla.*,^Opera.*,^Chrome.*
atlas.rest-csrf.methods-to-ignore=GET,OPTIONS,HEAD,TRACE
atlas.rest-csrf.custom-header=X-XSRF-HEADER
atlas.metric.query.cache.ttlInSecs=900
atlas.search.gremlin.enable=false
atlas.ui.default.version=v1
atlas.hook.hive.synchronous=false
atlas.hook.hive.numRetries=3
atlas.hook.hive.queueSize=10000
atlas.cluster.name=primary
export MANAGE_EMBEDDED_CASSANDRA=false
export MANAGE_LOCAL_ELASTICSEARCH=false
export HBASE_CONF_DIR=/home/atlas/hbase/conf
下载地址,解压apache-atlas-2.2.0-hive-hook.tar.gz,将内容拷贝到atlas安装目录下

对应hive中增加的HIVE_AUX_JARS_PATH变量

hive创建数据库、表
[root@host1 bin]# ./beeline -u jdbc:hive2://192.168.38.10:10000 -n root
Connecting to jdbc:hive2://192.168.38.10:10000
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://192.168.38.10:10000> show databases;
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (3.811 seconds)
0: jdbc:hive2://192.168.38.10:10000> create database testatlas;
No rows affected (0.375 seconds)
0: jdbc:hive2://192.168.38.10:10000> use testatlas;
No rows affected (0.152 seconds)
0: jdbc:hive2://192.168.38.10:10000> CREATE TABLE atlas_table_test(id int,name string);
No rows affected (2.664 seconds)
0: jdbc:hive2://192.168.38.10:10000> show tables;
+-------------------+
| tab_name |
+-------------------+
| atlas_table_test |
+-------------------+
1 row selected (0.195 seconds)
0: jdbc:hive2://192.168.38.10:10000> select * from atlas_table_test;
+----------------------+------------------------+
| atlas_table_test.id | atlas_table_test.name |
+----------------------+------------------------+
+----------------------+------------------------+
No rows selected (2.983 seconds)
0: jdbc:hive2://192.168.38.10:10000>

一段时间后,atlas便可以查看到(历史数据不会同步,需要通过hook-bin/import-hive.sh导入)

bin/hbase shell
list
scan "apache_atlas_entity_audit"
