现有一份汽车销售记录,销售记录月、市、区县、品牌、车辆类型、使用性质和数量7列
按步骤完成如下操作(建议在Spark-shell中完成):
使用SparkSQL完成
数据如下:
- 3,朔州区,朔城区,五菱,小型普通客车,非运营,1
- 3,晋城区,城区,东风,小型普通客车,非运营,1
- 12,长治市,长治城区,北京,小型普通客车,非运营,1
- 12,长治市,长治城区,东风,小型普通客车,非运营,1
- 12,长治市,长治城区,东风,小型普通客车,非运营,1
- 12,长治市,长治城区,东风,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
- 12,长治市,长治城区,航天,小型普通客车,非运营,1
代码:
- //创建文本
- vim car.txt
- //上传文本
- hdfs dfs -put car.txt /
- 1.读取rdd文件
- scala> val carRDD = sc.textFile("/car.txt")
- 2.查看是否读取成功
- scala> carRDD.collect.foreach(println)

- 3.表头信息
- scala> import org.apache.spark.sql.types._
-
- scala> val str = Array(StructField("month",IntegerType),StructField("city",StringType),StructField("county",StringType),StructField("brand",StringType),StructField("catType",StringType),StructField("nature",StringType),StructField("quantity",IntegerType))

- 4.表格信息
- scala> import org.apache.spark.sql._
-
- val carsRDD = carRDD.map{line => val fields = line.split(",");
- Row(fields(0).toInt,fields(1),fields(2),fields(3),fields(4),fields(5),fields(6).toInt)
- }
- 5.拼接
- scala> import spark.implicits._
-
- scala> val st = StructType(str)
-
- scala> val carDataFrame = spark.createDataFrame(carsRDD,st)
-
carDataFrame.write.format("json").save("/car.json")
- 上面构建DataFrame
- 直接创建临时表
- carDataFrame.createOrReplaceTempView("car")
spark.sql("select brand,sum(quantity) c from car group by brand order by c desc").show
spark.sql("select brand,month,sum(quantity) c from car group by brand,month").show

spark.sql("select city,sum(quantity) c from car group by city order by c").show
spark.sql("select city,catType,sum(quantity) c from car group by city,catType").show
- 1.先统计各城市各区县的汽车销量
- select city,county,sum(quantity) c from cars group by city,county
-
- 2.最大的区县
- spark.sql("select a.city,a.county,max(a.c) from (
- select city,county,sum(quantity) c from cars group by city,county
- )a group by a.city,a.county").show

- 1.1-12月非运营车辆中各个品牌的销量排序
- spark.sql("select brand,sum(quantity) c from car where month between 1 and 6 and nature = '非运营' group by brand order by c
-
- 2.求在前三名
- spark.sql("select a.brand,a.c from (select brand,sum(quantity) c from car where month between 1 and 12 and nature = '非运营' group by brand order by c desc) a limit 2").show
