修改源码这里不讨论,因为我不会,我是下载修改后的源码
这里只讲报错和配置
使用的是hive3.1.2和spark3.0.0
因为要安装spark,需要让hadoop的yarn知道spark,因此添加sprak的配置
进入hadoop的yarn-site.xml
- [root@hadoop102 hadoop]# cd /opt/module/hadoop-3.1.3/etc/hadoop
-
- [root@hadoop102 hadoop]# vim yarn-site.xml
-
-
- <property>
- <name>yarn.nodemanager.env-whitelist</name>
- <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME,SPARK_HOME</value>
- </property>
完整yarn-site配置
- <?xml version="1.0"?>
- <!--
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License. See accompanying LICENSE file.
- -->
- <configuration>
- <!-- MR uffle -->
- <property>
- <name>yarn.nodemanager.aux-services</name>
- <value>mapreduce_shuffle</value>
- </property>
- <property>
- <name>yarn.resourcemanager.hostname</name>
- <value>hadoop103</value>
- </property>
- <property>
- <name>yarn.nodemanager.env-whitelist</name>
- <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME,SPARK_HOME</value>
- </property>
- <!-- 开启日志聚集功能 -->
- <property>
- <name>yarn.log-aggregation-enable</name>
- <value>true</value>
- </property>
- <!-- 设置日志聚集服务器地址 -->
- <property>
- <name>yarn.log.server.url</name>
- <value>http://hadoop102:19888/jobhistory/logs</value>
- </property>
- <!-- 设置日志保留时间为7天 -->
- <property>
- <name>yarn.log-aggregation.retain-seconds</name>
- <value>604800</value>
- </property>
- <!-- 是否将虚拟核数当做cpu核数,默认是false,采用物理cpu -->
- <property>
- <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
- <value>false</value>
- </property>
- <!-- NodeManager使用内存数,默认是8g,我本机配置的是3g,因此修改为2g -->
- <property>
- <name>yarn.nodemanager.resource.memory-mb</name>
- <value>2048</value>
- </property>
- <!-- NodeManager的cpu核数,默认是8核,我本机配置的是2核,因此修改为2核 -->
- <property>
- <name>yarn.nodemanager.resource.cpu-vcores</name>
- <value>2</value>
- </property>
- <!-- 容器最小内存 修改为1g -->
- <property>
- <name>yarn.scheduler.minimum-allocation-mb</name>
- <value>1024</value>
- </property>
- <!-- 容器最大内存默认8g 修改为2g -->
- <property>
- <name>yarn.scheduler.maximum-allocation-mb</name>
- <value>2048</value>
- </property>
- <!-- 容器最小cpu核数 修改为1 -->
- <property>
- <name>yarn.scheduler.minimum-allocation-vcores</name>
- <value>1</value>
- </property>
- <!-- 容器最大cpu核数 默认是4个 修改为1 -->
- <property>
- <name>yarn.scheduler.maximum-allocation-vcores</name>
- <value>1</value>
- </property>
- <!-- 虚拟内存检查,默认打开,改为关闭 -->
- <property>
- <name>yarn.nodemanager.vmem-check-enabled</name>
- <value>false</value>
- </property>
- <!-- 虚拟内存和物理内存设置比例,默认2.1 -->
- <property>
- <name>yarn.nodemanager.vmem-pmem-ratio</name>
- <value>2.1</value>
- </property>
- <property>
- <name>yarn.application.classpath</name>
- <value>/opt/module/hadoop-3.1.3/etc/hadoop:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/common/*:/opt/module/hadoop-3.1.3/share/hadoop/hdfs:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/*:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/*:/opt/module/hadoop-3.1.3/share/hadoop/yarn:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/yarn/*</value>
- </property>
- </configuration>

下载地址:Index of /dist/spark/spark-3.0.0
- 解压之后修改名称为spark
- [root@hadoop102 software]# tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module/
- [root@hadoop102 software]# cd /opt/module/
-
- [root@hadoop102 software]# mv spark-3.0.0-bin-hadoop3.2 spark
- ###########配置环境变量
- [root@hadoop102 module]# vim /etc/profile.d/my_env.sh
-
- ##SPARK_HOME
- export SPARK_HOME=/opt/module/spark
- export PATH=$PATH:$SPARK_HOME/bin
- [root@hadoop102 module]# source /etc/profile.d/my_env.sh
-
- [root@hadoop102 module]# vim /opt/module/hive/conf/spark-defaults.conf
-
-
- spark.master yarn
- spark.eventLog.enabled true
- spark.eventLog.dir hdfs://hadoop102:8020/spark-history
- spark.executor.memory 1g
- spark.driver.memory 1g
- [root@hadoop102 software]$ hadoop fs -mkdir /spark-history
- [root@hadoop102 software]$ tar -zxvf /opt/software/spark-3.0.0-bin-without-hadoop.tgz
- [root@hadoop102 software]$ hadoop fs -mkdir /spark-jars
- [root@hadoop102 software]$ hadoop fs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars
- [root@hadoop102 software]$ hadoop fs -mkdir /spark-jars
- [root@hadoop102 ~]$ vim /opt/module/hive/conf/hive-site.xml
- 添加如下内容
- <!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)-->
- <property>
- <name>spark.yarn.jars</name>
- <value>hdfs://hadoop102:8020/spark-jars/*</value>
- </property>
-
- <!--Hive执行引擎-->
- <property>
- <name>hive.execution.engine</name>
- <value>spark</value>
- </property>
- [root@hadoop102 hive]$ bin/hive
-
- hive (default)> create table student(id int, name string);
-
- hive (default)> insert into table student values(1,'abc');
因为会导致lzo文件和索引合并,无法使用切片
只有在lzo下面才要关闭小文件合并
在执行sql之前执行,只在本窗口有效
- 默认
- hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
- 在hive中运行sql:
- set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat