• hive-on-spark


    修改源码这里不讨论,因为我不会,我是下载修改后的源码

    这里只讲报错和配置

    使用的是hive3.1.2和spark3.0.0

    修改hadoop配置

    因为要安装spark,需要让hadoop的yarn知道spark,因此添加sprak的配置

    进入hadoop的yarn-site.xml

    1. [root@hadoop102 hadoop]# cd /opt/module/hadoop-3.1.3/etc/hadoop
    2. [root@hadoop102 hadoop]# vim yarn-site.xml
    3. <property>
    4. <name>yarn.nodemanager.env-whitelist</name>
    5. <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME,SPARK_HOME</value>
    6. </property>

    完整yarn-site配置

    1. <?xml version="1.0"?>
    2. <!--
    3. Licensed under the Apache License, Version 2.0 (the "License");
    4. you may not use this file except in compliance with the License.
    5. You may obtain a copy of the License at
    6. http://www.apache.org/licenses/LICENSE-2.0
    7. Unless required by applicable law or agreed to in writing, software
    8. distributed under the License is distributed on an "AS IS" BASIS,
    9. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    10. See the License for the specific language governing permissions and
    11. limitations under the License. See accompanying LICENSE file.
    12. -->
    13. <configuration>
    14. <!-- MR uffle -->
    15. <property>
    16. <name>yarn.nodemanager.aux-services</name>
    17. <value>mapreduce_shuffle</value>
    18. </property>
    19. <property>
    20. <name>yarn.resourcemanager.hostname</name>
    21. <value>hadoop103</value>
    22. </property>
    23. <property>
    24. <name>yarn.nodemanager.env-whitelist</name>
    25. <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME,SPARK_HOME</value>
    26. </property>
    27. <!-- 开启日志聚集功能 -->
    28. <property>
    29. <name>yarn.log-aggregation-enable</name>
    30. <value>true</value>
    31. </property>
    32. <!-- 设置日志聚集服务器地址 -->
    33. <property>
    34. <name>yarn.log.server.url</name>
    35. <value>http://hadoop102:19888/jobhistory/logs</value>
    36. </property>
    37. <!-- 设置日志保留时间为7天 -->
    38. <property>
    39. <name>yarn.log-aggregation.retain-seconds</name>
    40. <value>604800</value>
    41. </property>
    42. <!-- 是否将虚拟核数当做cpu核数,默认是false,采用物理cpu -->
    43. <property>
    44. <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
    45. <value>false</value>
    46. </property>
    47. <!-- NodeManager使用内存数,默认是8g,我本机配置的是3g,因此修改为2g -->
    48. <property>
    49. <name>yarn.nodemanager.resource.memory-mb</name>
    50. <value>2048</value>
    51. </property>
    52. <!-- NodeManager的cpu核数,默认是8核,我本机配置的是2核,因此修改为2核 -->
    53. <property>
    54. <name>yarn.nodemanager.resource.cpu-vcores</name>
    55. <value>2</value>
    56. </property>
    57. <!-- 容器最小内存 修改为1g -->
    58. <property>
    59. <name>yarn.scheduler.minimum-allocation-mb</name>
    60. <value>1024</value>
    61. </property>
    62. <!-- 容器最大内存默认8g 修改为2g -->
    63. <property>
    64. <name>yarn.scheduler.maximum-allocation-mb</name>
    65. <value>2048</value>
    66. </property>
    67. <!-- 容器最小cpu核数 修改为1 -->
    68. <property>
    69. <name>yarn.scheduler.minimum-allocation-vcores</name>
    70. <value>1</value>
    71. </property>
    72. <!-- 容器最大cpu核数 默认是4个 修改为1 -->
    73. <property>
    74. <name>yarn.scheduler.maximum-allocation-vcores</name>
    75. <value>1</value>
    76. </property>
    77. <!-- 虚拟内存检查,默认打开,改为关闭 -->
    78. <property>
    79. <name>yarn.nodemanager.vmem-check-enabled</name>
    80. <value>false</value>
    81. </property>
    82. <!-- 虚拟内存和物理内存设置比例,默认2.1 -->
    83. <property>
    84. <name>yarn.nodemanager.vmem-pmem-ratio</name>
    85. <value>2.1</value>
    86. </property>
    87. <property>
    88. <name>yarn.application.classpath</name>
    89. <value>/opt/module/hadoop-3.1.3/etc/hadoop:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/common/*:/opt/module/hadoop-3.1.3/share/hadoop/hdfs:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/*:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/*:/opt/module/hadoop-3.1.3/share/hadoop/yarn:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/yarn/*</value>
    90. </property>
    91. </configuration>

    spark配置

    环境变量配置

    下载地址:Index of /dist/spark/spark-3.0.0 

    1. 解压之后修改名称为spark
    2. [root@hadoop102 software]# tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module/
    3. [root@hadoop102 software]# cd /opt/module/
    4. [root@hadoop102 software]# mv spark-3.0.0-bin-hadoop3.2 spark
    5. ###########配置环境变量
    6. [root@hadoop102 module]# vim /etc/profile.d/my_env.sh
    7. ##SPARK_HOME
    8. export SPARK_HOME=/opt/module/spark
    9. export PATH=$PATH:$SPARK_HOME/bin
    10. [root@hadoop102 module]# source /etc/profile.d/my_env.sh

    hive配置

    hive中创建spark配置文件

    1. [root@hadoop102 module]# vim /opt/module/hive/conf/spark-defaults.conf
    2. spark.master yarn
    3. spark.eventLog.enabled true
    4. spark.eventLog.dir hdfs://hadoop102:8020/spark-history
    5. spark.executor.memory 1g
    6. spark.driver.memory 1g

    HDFS创建如下路径,用于存储历史日志

    1. [root@hadoop102 software]$ hadoop fs -mkdir /spark-history
    2. [root@hadoop102 software]$ tar -zxvf /opt/software/spark-3.0.0-bin-without-hadoop.tgz
    3. [root@hadoop102 software]$ hadoop fs -mkdir /spark-jars
    4. [root@hadoop102 software]$ hadoop fs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars
    5. [root@hadoop102 software]$ hadoop fs -mkdir /spark-jars

    修改hive-site.xml文件

    1. [root@hadoop102 ~]$ vim /opt/module/hive/conf/hive-site.xml
    2. 添加如下内容
    3. <!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)-->
    4. <property>
    5. <name>spark.yarn.jars</name>
    6. <value>hdfs://hadoop102:8020/spark-jars/*</value>
    7. </property>
    8. <!--Hive执行引擎-->
    9. <property>
    10. <name>hive.execution.engine</name>
    11. <value>spark</value>
    12. </property>

    测试

    1. [root@hadoop102 hive]$ bin/hive
    2. hive (default)> create table student(id int, name string);
    3. hive (default)> insert into table student values(1,'abc');

    修改配置防止小文件合并

    因为会导致lzo文件和索引合并,无法使用切片

    只有在lzo下面才要关闭小文件合并

    在执行sql之前执行,只在本窗口有效

    1. 默认
    2. hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
    3. 在hive中运行sql:
    4. set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

  • 相关阅读:
    nginx做负载均衡服务器配置动静分离
    RestTemplat
    联邦学习在移动通信网络智能化的应用
    VISIO 2013软件和安装教程
    为什么不建议在 Docker 中跑 MySQL?
    springBoot项目 ObjectMapper 序列化统一格式处理
    RoaringBitMap学习和实践
    【Oralce】导出所有表名、表注释、创建时间、最后修改时间、主键
    iic驱动oled屏幕显示温湿度基于FreeRTOS实现多任务
    Java解析Json格式数据
  • 原文地址:https://blog.csdn.net/yangguangniubi/article/details/127687454