这是一个wordcount例子
下载
解压,hadoop到某个目录
将hadoop.dll和winutils.exe放入C:\Windows\System32目录下
将hadoop.dll和winutils.exe放入解压后的hadoop\bin目录下
配置环境变量
cmd,输入hadoop -version
成功显示版本
重启IDEA
3.scala环境搭建
离线在idea中安装scala插件_我要用代码向我喜欢的女孩表白的博客-CSDN博客_idea离线安装scala插件
4.spark的相关依赖
版本一定要精准,否则,会报错
- <project xmlns="http://maven.apache.org/POM/4.0.0"
- xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
- xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
- <modelVersion>4.0.0modelVersion>
- <properties>
- <project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
- <project.reporting.outputEncoding>UTF-8project.reporting.outputEncoding>
- <maven.compiler.source>1.8maven.compiler.source>
- <maven.compiler.target>1.8maven.compiler.target>
-
- <scala.version>2.11.12scala.version>
- <spark.version>2.4.0spark.version>
- <java.version>1.8java.version>
-
- properties>
- <groupId>org.examplegroupId>
- <artifactId>DataPrepareartifactId>
- <version>1.0-SNAPSHOTversion>
- <dependencies>
-
- <dependency>
- <groupId>org.scala-langgroupId>
- <artifactId>scala-libraryartifactId>
- <version>${scala.version}version>
- dependency>
- <dependency>
- <groupId>org.apache.sparkgroupId>
- <artifactId>spark-core_2.11artifactId>
- <version>${spark.version}version>
- <exclusions>
- <exclusion>
- <groupId>org.slf4jgroupId>
- <artifactId>slf4j-log4j12artifactId>
- exclusion>
- exclusions>
- dependency>
- <dependency>
- <groupId>org.slf4jgroupId>
- <artifactId>slf4j-simpleartifactId>
- <version>1.7.25version>
- <scope>compilescope>
- dependency>
- <dependency>
- <groupId>org.apache.sparkgroupId>
- <artifactId>spark-sql_2.11artifactId>
- <version>${spark.version}version>
- dependency>
- dependencies>
- <build>
- <sourceDirectory>src/main/scalasourceDirectory>
-
- <plugins>
- <plugin>
- <groupId>org.scala-toolsgroupId>
- <artifactId>maven-scala-pluginartifactId>
- <version>2.15.0version>
- <executions>
- <execution>
- <goals>
- <goal>compilegoal>
- <goal>testCompilegoal>
- goals>
- <configuration>
- <args>
- <arg>-dependencyfilearg>
- <arg>${project.build.directory}/.scala_dependenciesarg>
- args>
- configuration>
- execution>
- executions>
- plugin>
- <plugin>
- <groupId>org.apache.maven.pluginsgroupId>
- <artifactId>maven-surefire-pluginartifactId>
- <version>2.6version>
- <configuration>
- <useFile>falseuseFile>
- <disableXmlReport>truedisableXmlReport>
- <includes>
- <include>**/*Test.*include>
- <include>**/*Suite.*include>
- includes>
- configuration>
- plugin>
-
- <plugin>
- <artifactId>maven-assembly-pluginartifactId>
- <configuration>
- <archive>
- <manifest>
- <mainClass>mainClass>
- manifest>
- archive>
- <descriptorRefs>
- <descriptorRef>jar-with-dependenciesdescriptorRef>
- descriptorRefs>
- configuration>
- plugin>
- plugins>
- build>
-
- project>
4.你很可能会遇到的问题
忘记截图了
5.代码部分(最简单)
一个maven工程,创建一个scala目录。右键目录->mark directory as ->sources Root
他就会变蓝,变蓝就可以创建工程
将wordcount.txt发送到,hdfs上
先上传到服务器,然后hadoop fs -put /wordcount.txt /
- import org.apache.spark.rdd.RDD
- import org.apache.spark.{SparkConf, SparkContext}
-
-
- object WordCount {
-
- def main(args: Array[String]): Unit = {
- print(1)
- val conf: SparkConf = new SparkConf().setMaster("local[3]").setAppName("hdfsTest")
- val sc = new SparkContext(conf)
-
- val value: RDD[String] = sc.textFile("hdfs://192.168.30.101:8020/wordcount.txt")
- value.flatMap(line=>line.split(",")).map(word=>(word,1)).reduceByKey(_+_).collect().foreach(
- r=>print(r)
- )
- }
-
- }
6.将测试好的代码,打包成jar
7.打包后,想在linux上跑