Atlas with Hive 安装总结

架构组件

+-- Atlas --+
| HBase(ZK) |
| Kafka     | <----- Hive(hive hook)，实时导入元数据 -----> MySQL
| Solr      |                                     |
| REST      | <----- import-hive.sh <----- MetaStore，批量导入元数据
+-----------+
1
2
3
4
5
6

Atlas 单机版

基本安装

详见Atlas系列/Atlas 源码编译.md中集成hbase和solr版本的atlas
参考，https://atlas.apache.org/#/Installation

# 1. 解压安装atlas
tar -xf apache-atlas-2.1.0-server.tar.gz -C /opt/modules/
cd /opt/modules/ && mv apache-atlas-2.1.0 atlas-2.1.0-alone

# 清理没用的.cmd 文件
cd /opt/modules/atlas-2.1.0-alone
find -name '*.cmd' | xargs rm -rf 

# 2. 启动atlas
## 设置启动内嵌的hbase 和solr
### export MANAGE_LOCAL_HBASE=true
### export MANAGE_LOCAL_SOLR=true
## 实际上conf/atlas-env.sh 中已经包含以上两项，直接启动即可
bin/bin/atlas_start.py

# 3. 单独启动组件
## 3.1 单独启动内嵌的hbase
hbase/bin/start-hbase.sh

## 3.2 单独启动内嵌的solr
solr/bin/solr start -c -z localhost:2181 -p 8983
### 指定zookeeper 地址和solr 的端口

## 3.3 创建solr 的索引
#solr/bin/solr create -c vertex_index -d conf/solr/
#solr/bin/solr create -c edge_index -d conf/solr/ 
#solr/bin/solr create -c fulltext_index -d conf/solr/
### atlas 的索引储存在图数据库solr 中
### 图数据主要有三个要素：点 vertex，线(点与点的联系) edge，面(整个图)fulltext
### 实际测试中，单机版的atlas 不需要手动创建solr 的collection，现有索引可通过以下url 查看
### http://hadoop112:9838/solr/admin/collections?action=list
### 或者查看solr WebUI，默认端口配置在`bin/atlas_config.py`
### http://hadoop112:9838

## 停止内嵌的hbase
## 停止内嵌的solr
solr/bin/solr stop

## 查看atlas 的启动日志，启动过程大概要10 分钟
tail -f logs/application.log


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

补充修改

修改单机版HBase自带Zookeeper程序的端口为2182
避免与集群中的Zookeeper冲突(ZK默认端口为2181)
/opt/modules/atlas-2.1.0-alone/hbase/conf/hbase-site.xml



<configuration>
  
  property>
    <property>
    <name>hbase.zookeeper.property.clientPortname>
    <value>2182value>
  property>
  
  <property>
    <name>hbase.rootdirname>
    <value>file:///opt/modules/atlas-2.1.0-alone/data/hbase-rootvalue>
  property>
  <property>
    <name>hbase.zookeeper.property.dataDirname>
    <value>/opt/modules/atlas-2.1.0-alone/data/hbase-zookeeper-datavalue>
  property>
  <property>
    <name>hbase.master.info.portname>
    <value>61510value>
  property>
  <property>
    <name>hbase.regionserver.info.portname>
    <value>61530value>
  property>
  <property>
    <name>hbase.master.portname>
    <value>61500value>
  property>
  <property>
    <name>hbase.regionserver.portname>
    <value>61520value>
configuration>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

修改atlas对应的Zookeeper配置，完整配置如下
/opt/modules/atlas-2.1.0-alone/conf/atlas-application.properties

atlas.graph.storage.backend=hbase2
atlas.graph.storage.hbase.table=apache_atlas_janus
# hostname 须为localhost，否则atlas_start.py 不会启动hbase
## 也就没有本地的zookeeper，solr 和内嵌的kafka 也就无法启动
atlas.graph.storage.hostname=localhost
atlas.graph.storage.hbase.regions-per-server=1
atlas.graph.storage.lock.wait-time=10000
atlas.EntityAuditRepository.impl=org.apache.atlas.repository.audit.HBaseBasedAuditRepository
atlas.graph.index.search.backend=solr
atlas.graph.index.search.solr.mode=cloud
# solr.zookeeper-url 修改为localhost:2182
atlas.graph.index.search.solr.zookeeper-url=localhost:2182
atlas.graph.index.search.solr.zookeeper-connect-timeout=60000
atlas.graph.index.search.solr.zookeeper-session-timeout=60000
atlas.graph.index.search.solr.wait-searcher=true
atlas.graph.index.search.max-result-set-size=150
atlas.notification.embedded=true
atlas.kafka.data=${sys:atlas.home}/data/kafka
# kafka 的地址可设为主机名，让其他节点的hive hook 也能调用
atlas.kafka.zookeeper.connect=hadoop112:9026
atlas.kafka.bootstrap.servers=hadoop112:9027
atlas.kafka.zookeeper.session.timeout.ms=400
atlas.kafka.zookeeper.connection.timeout.ms=200
atlas.kafka.zookeeper.sync.time.ms=20
atlas.kafka.auto.commit.interval.ms=1000
atlas.kafka.hook.group.id=atlas
atlas.kafka.enable.auto.commit=false
atlas.kafka.auto.offset.reset=earliest
atlas.kafka.session.timeout.ms=30000
atlas.kafka.offsets.topic.replication.factor=1
atlas.kafka.poll.timeout.ms=1000
atlas.notification.create.topics=true
atlas.notification.replicas=1
atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIES
atlas.notification.log.failed.messages=true
atlas.notification.consumer.retry.interval=500
atlas.notification.hook.retry.interval=1000
atlas.enableTLS=false
atlas.authentication.method.kerberos=false
atlas.authentication.method.file=true
atlas.authentication.method.ldap.type=none
atlas.authentication.method.file.filename=${sys:atlas.home}/conf/users-credentials.properties
# atlas rest 接口可设为主机名，让其他节点也能调用
atlas.rest.address=http://hadoop112:21000
atlas.audit.hbase.tablename=apache_atlas_entity_audit
atlas.audit.zookeeper.session.timeout.ms=1000
# hbase.zookeeper.quorum 修改为localhost:2182
atlas.audit.hbase.zookeeper.quorum=localhost:2182
atlas.server.ha.enabled=false
atlas.authorizer.impl=simple
atlas.authorizer.simple.authz.policy.file=atlas-simple-authz-policy.json
atlas.rest-csrf.enabled=true
atlas.rest-csrf.browser-useragents-regex=^Mozilla.*,^Opera.*,^Chrome.*
atlas.rest-csrf.methods-to-ignore=GET,OPTIONS,HEAD,TRACE
atlas.rest-csrf.custom-header=X-XSRF-HEADER
atlas.metric.query.cache.ttlInSecs=900
atlas.search.gremlin.enable=false
atlas.ui.default.version=v1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

Atlas 集群版

暂时略

导入Hive 表

Hive 需要运行MetaStore 程序
检查Hive配置，确保conf/hive-site.xml有MetaStore配置

  
  <property>
    <name>hive.metastore.urisname>
    <value>thrift://hadoop114:9083value>
    
  property>
1
2
3
4
5
6
7
8

启动MetaStore服务

cd /opt/modules/hive-3.1.2
# 后台启动MetaStore
nohup bin/hive --service metastore &> log/metastore.log &
1
2
3

补充：如果使用MySQL而不是内嵌的Derby数据库储存元数据，是可以不用配置和启动MetaStore的
Atlas会通过Hive配置文件里的用户名和密码直接去连接MySQL，但使用MetaStore是最稳妥的

解压hive hook

# 解压并重命名
tar -xf apache-atlas-2.1.0-hive-hook.tar.gz -C /opt/modules/
cd /opt/modules/ && mv apache-atlas-hive-hook-2.1.0/ atlas-hive-hook-2.1.0/
1
2
3

配置hive 加上hook
在conf/hive-site.xml添加以下配置，参考，https://atlas.apache.org/#/HookHive

  
  <property>
    <name>hive.exec.post.hooksname>
    <value>org.apache.atlas.hive.hook.HiveHookvalue>
  property>
1
2
3
4
5

在conf/hive-env.sh添加hive hook的jar依赖包

export HADOOP_HOME=/opt/modules/hadoop-3.1.3
# 添加TEZ_HOME 和TEZ_JARS 包路径
export TEZ_HOME=/opt/modules/tez-0.10
export TEZ_CONF_DIR=$TEZ_HOME/conf
export TEZ_JARS=$(find $TEZ_HOME -name '*.jar' | xargs echo | tr ' ' ':')
export HIVE_AUX_JARS_PATH=$HADOOP_HOME/share/hadoop/common/hadoop-lzo-0.4.20.jar:$TEZ_JARS
# 在HIVE_AUX_JARS_PATH 后面添加Atlas Hive Hook 的依赖jar 包
export ATLAS_HIVE_HOOK_HOME=/opt/modules/atlas-hive-hook-2.1.0/hook/hive
export ATLAS_HIVE_HOOK_JARS=$(find $ATLAS_HIVE_HOOK_HOME -name '*.jar' | xargs echo | tr ' ' ':')
export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH:$ATLAS_HIVE_HOOK_JARS
1
2
3
4
5
6
7
8
9
10

拷贝atlas-application.properties到/opt/modules/hive-3.1.2/conf

# 获取atlas app 配置文件
cd /opt/modules/hive-3.1.2/conf &&\
  sftp hadoop112 <<< "get /opt/modules/atlas-2.1.0-alone/conf/atlas-application.properties"
1
2
3

运行Atlas 导入Hive 元数据的程序

# 使用Simple Command
cd /opt/modules/atlas-hive-hook-2.1.0 &&\
  HIVE_HOME=/opt/modules/hive-3.1.2 hook-bin/import-hive.sh
# 或者将HIVE_HOME 写入hook-bin/import-hive.sh
## 下次就可以直接调用了
cd /opt/modules/atlas-hive-hook-2.1.0 && hook-bin/import-hive.sh
## Hive Meta Data imported successfully!!!
1
2
3
4
5
6
7

注意事项

如果有多个Hive的元数据储存在不同地方，那么导入的同名的元数据会被合并

参考资料

数据治理平台Apache Atlas搭建与导入hive表

00大数据/尚硅谷大数据技术之Atlas

相关阅读:
案例题——需求分析
 这段时间面试遇到的问题
 求助帖：React Native failed installing Ruby Gems(rn 下载 Runby Gems 失败)
【PAT甲级】1077 Kuchiguse
64线LiDAR上速度可达120Hz！一种基于图像表示的快速精确的LiDAR地面分割算法
 并发修改异常
 Linux学习笔记-Ubuntu系统下配置用户ssh只能访问git仓库
 [附源码]java毕业设计一点到家小区微帮服务系统
 Spark Streaming（二）
Java多线程探究【二线程状态】
原文地址：https://blog.csdn.net/yoshubom/article/details/126106436