• Hive高手之路:二、hive数据分析实验


    数据定义、数据插入、数据查询实验跳过

    目录

    一.ELT原始数据处理

    1.新建项目

    2.添加项目依赖

    3.文件内容 

    1)数据抽取、拆分工具类函数 

    2)继承Mapper

    3)运行Runner 

    4.打包

    ​编辑

     5.上传至hdfs

    1)启动环境

     2)上传数据文件与jar包

    3)数据上传至集群

    4)运行jar包

     二.准备工作

    1.准备hive环境

    2.创建表

    1)gulivideo_ori

    2)gulivideo_user_ori 

    3) gulivideo_orc 

     4)gulivideo_user_orc

    3.导入ELT后的数据

    4.向ORC表插入数据 

     5.查看导入的数据结果

     三、业务分析

    1.统计观看视频数Top10

    ​编辑 2.统计视频类热度

     3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

     4.统计视频观看数Top50所关联视频的所属类别Rank

     5.统计每个类别中的视频热度Top10,以Music为例

    1)创建类别表

     2)向类别表中插入数据

     3)统计Music类别的Top10(也可以统计其他)

     6.统计每个类别中视频流量Top10,以Music例

    7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

     8.统计每个类别视频观看数Top10


    一.ELT原始数据处理

    1.新建项目

    2.添加项目依赖

    3.文件内容 

    1)数据抽取、拆分工具类函数 

    2)继承Mapper

    3)运行Runner 

    4.打包

     5.上传至hdfs

    1)启动环境

     2)上传数据文件与jar包

    3)数据上传至集群

    [root@hadoop001 hive]#  hdfs dfs -put guiliVideo /

    4)运行jar包

    [root@hadoop001 hive]# yarn jar

     /root/hive/guli.video-1.0-SNAPSHOT.jar

    com.guli.mapper.VideoETLRunner /guiliVideo/video/2008/0222 /guliout

     二.准备工作

    1.准备hive环境

    hive> SHOW DATABASES;

    hive> SET hive.cli.print.current.db=true;

    hive (default)> USE text;

    hive (text)> set hive.cli.print.header=true;

    2.创建表

    1)gulivideo_ori

    hive (text)> create table gulivideo_ori(

               >     videoId string,

               >     uploader string,

               >     age int,

               >     category array,

               >     length int,

               >     views int,

               >     rate float,

               >     ratings int,

               >     comments int,

               >     relatedId array)

               > row format delimited fields terminated by "\t"

               > collection items terminated by "&"

               > stored as textfile;

    2)gulivideo_user_ori 

    hive (text)> create table gulivideo_user_ori(

               >     uploader string,

               >     videos int,

               >     friends int)

               > row format delimited fields terminated by "\t"

               > stored as textfile;

    3) gulivideo_orc 

    hive (text)> create table gulivideo_orc(

               >     videoId string,

               >     uploader string,

               >     age int,

               >     category array,

               >     length int,

               >     views int,

               >     rate float,

               >     ratings int,

               >     comments int,

               >     relatedId array)

               > clustered by (uploader) into 8 buckets

               > row format delimited fields terminated by "\t"

               > collection items terminated by "&"

               > stored as orc;

     4)gulivideo_user_orc

    hive (text)> create table gulivideo_user_orc(

               >     uploader string,

               >     videos int,

               >     friends int)

               > row format delimited fields terminated by "\t"

               > stored as orc;

    3.导入ELT后的数据

    hive (text)> load data inpath "/guliout" into table gulivideo_ori;

    hive (text)> load data inpath "/guiliVideo/user/2008/0903" into table gulivideo_user_ori;

    4.向ORC表插入数据 

    hive (text)> insert into table gulivideo_orc select * from gulivideo_ori;

    hive (text)> insert into table gulivideo_user_orc select * from gulivideo_user_ori;

     5.查看导入的数据结果

    hive (text)> select * from gulivideo_orc limit 2;

    hive (text)> select * from gulivideo_user_orc limit 2;

     三、业务分析

    1.统计观看视频数Top10

    hive (text)> select videoId,uploader,age,category,length,views,rate,ratings,comments from gulivideo_orc order by views desc limit 10;

     2.统计视频类热度

    hive (text)> select

               >     category_name as category,

               >     count(t1.videoId) as hot

               > from (

               >     select

               >         videoId,

               >         category_name

               >     from

               >         gulivideo_orc lateral view explode(category) t_catetory as

               > category_name) t1

               > group by

               >     t1.category_name

               > order by

               >     hot

               > desc limit

               >     10;

     3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

    hive (text)> select

               >     category_name as category,

               >     count(t2.videoId) as hot_with_views

               > from (

               >     select

               >         videoId,

               >         category_name

               >     from (

               >         select

               >             *

               >         from

               >             gulivideo_orc

               >         order by

               >             views

               >             desc limit

               >                 20) t1 lateral view explode(category) t_catetory as

               > category_name) t2

               > group by

               >     category_name

               > order by

               >     hot_with_views

               > desc;

     4.统计视频观看数Top50所关联视频的所属类别Rank

    hive (text)> select

               >     category_name as category,

               >     count(t5.videoId) as hot

               > from (

               >     select

               >         videoId,

               >         category_name

               >     from (

               >         select

               >         distinct(t2.videoId),

               >         t3.category

               >         from (

               >             select

               >                 explode(relatedId) as videoId

               >             from (

               >                 select

               >                     *

               >                 from

               >                     gulivideo_orc

               >                 order by

               >                     views

               >                 desc limit

               >                     50) t1) t2

               >         inner join

               >             gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view

               > explode(category) t_catetory as category_name) t5

               > group by

               >     category_name

               > order by

               >     hot

               > desc;

     5.统计每个类别中的视频热度Top10,以Music为例

    1)创建类别表

    hive (text)> create table gulivideo_category(

               >     videoId string,

               >     uploader string,

               >     age int,

               >     categoryId string,

               >     length int,

               >     views int,

               >     rate float,

               >     ratings int,

               >     comments int,

               >     relatedId array)

               > row format delimited

               > fields terminated by "\t"

               > collection items terminated by "&"

               > stored as orc;

     2)向类别表中插入数据

    hive (text)> insert into table gulivideo_category

               >     select

               >         videoId,

               >         uploader,

               >         age,

               >         categoryId,

               >         length,

               >         views,

               >         rate,

               >         ratings,

               >         comments,

               >         relatedId

               >     from

               >         gulivideo_orc lateral view explode(category) catetory as

               > categoryId;

     3)统计Music类别的Top10(也可以统计其他)

    hive (text)> select

               >     videoId,

               >     views

               > from

               >     gulivideo_category

               > where

               >     categoryId = "Music"

               > order by

               >     views

               > desc limit

               >     10;

     6.统计每个类别中视频流量Top10,以Music例

    hive (text)> select

               >     videoId,

               >     views,

               >     ratings

               > from

               >     gulivideo_category

               > where

               >     categoryId = "Music"

               > order by

               >     ratings

               > desc limit

               >     10;

    7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

    hive (text)> select

               >     t2.videoId,

               >     t2.views,

               >     t2.ratings,

               >     t1.videos,

               >     t1.friends

               > from (

               >     select

               >         *

               >     from

               >         gulivideo_user_orc

               >     order by

               >         videos desc

               >     limit

               >         10) t1

               >     join

               >         gulivideo_orc t2

               >     on

               >         t1.uploader = t2.uploader

               >     order by

               >         views desc

               >     limit

               >         20;

     8.统计每个类别视频观看数Top10

    hive (text)> select

               >     t1.*

               > from (

               >     select

               >         videoId,

               >         categoryId,

               >         views,

               > row_number() over(partition by categoryId order by views desc)

               > rank from gulivideo_category) t1

               > where

               >     rank <= 10;

  • 相关阅读:
    实训十八:RIP2邻居认证
    Feign远程调用
    【GCN-RS】MCL: Mixed-Centric Loss for Collaborative Filtering (WWW‘22)
    EventBus3 笔记
    For Further Reference
    北斗导航 | GNSS数据处理:预处理与参数估计模型
    UTF-8、UTF-16 和 UTF-32 字符编码之间的区别?[图文详解]
    【openEuler】内核模块的插入删除
    Docker的安装(以CentOS 7为例)
    详解闲鱼推荐系统(长文收藏)
  • 原文地址:https://blog.csdn.net/weixin_46490924/article/details/127313496