目录
- --方式一:正式建表
- CREATE TABLE `hdw_dim.dim_city_info_df`(
- city_id STRING COMMENT '城市id'
- ,city_name STRING COMMENT '城市名称'
- ,provice_id STRING COMMENT '省份id'
- ,provice_name STRING COMMENT '省份名称'
- )
- COMMENT '城市信息维表'
- PARTITIONED BY (
- `pdate` STRING COMMENT '天分区'
- )
- STORED AS orc
- TBLPROPERTIES (
- 'creator'='210XXXXX',
- 'orc.compress'='SNAPPY',
- 'ttl' = '30'
- );
- --方式二:仿照现有的表建表
- create table hdw_dim.dim_city_info_tmp like hdw_dim.dim_city_info_df;
- --方式三:根据查询结果自动创建表并且插入数据
- create table dw_dim.dim_city_info_tmp1 as dw_dim.dim_city_info_tmp
- CREATE EXTERNAL TABLE `hdw_dim.dim_city_info_df`(
- city_id STRING COMMENT '城市id'
- ,city_name STRING COMMENT '城市名称'
- ,provice_id STRING COMMENT '省份id'
- ,provice_name STRING COMMENT '省份名称'
- )
- COMMENT '城市信息维表'
- PARTITIONED BY (
- `pdate` STRING COMMENT '天分区'
- )
- STORED AS orc
- TBLPROPERTIES (
- 'creator'='210XXXXX',
- 'orc.compress'='SNAPPY',
- 'ttl' = '30'
- );
- --查看表类型
- desc formatted address
-
- --将内部表修改为外部表
- alter table address set tblproperties('EXTERNAL' = 'TRUE');
-
- --将外部表修改为内部表
- alter table address set tblproperties('EXTERNAL' = 'FALSE');
- --方式一:TEMPORARY
- create temporary table `hdw_dim.dim_city_info`(
- city_id STRING COMMENT '城市id'
- ,city_name STRING COMMENT '城市名称'
- ,provice_id STRING COMMENT '省份id'
- ,provice_name STRING COMMENT '省份名称'
- )
- COMMENT '城市信息维表'
- --方式二:with as
- with t1 as
- (
- select
- *
- from hdw_dim.dim_city_info
- where provice_name = '山东省'
- ) ,
- t2 as
- (
- select
- *
- from hdw_dim.dim_city_info
- where provice_name = '北京市'
- )
- --方式三:真实建表,用完手动删除
- create table hdw_tmp.tmp_city_info as
- select
- *
- from hdw_dim.dim_city_info;
- drop table if exists hdw_tmp.tmp_city_info

分区是HDFS上表目录的子目录,数据按照分区存储在子目录中。如果查询的where子句中包含分区条件,则直接从该分区查找,而不是扫描整个目录,合理的分区可以极大的提高查询速度和性能。
- CREATE EXTERNAL TABLE emp_partition
- (
- empno INT,
- ename STRING,
- job STRING,
- mgr INT,
- hiredate TIMESTAMP,
- sal DECIMAL(7,2),
- comm DECIMAL(7,2)
- )
- PARTITIONED BY (deptno INT) -- 按照部门编号进行分区
- ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
- LOCATION '/hive/emp_partition';
在数据仓库管理大规模数据集的时候都需要进行分区,比如将日志文件按天进行分区,从而保证数据细粒度的划分,使得查询性能得到提升。
查看分区目录
hadoop fs -ls hdfs://hadoop001:8020/hive/emp_partition/
分区提供了一个隔离数据和优化查询的可行性方案,但是并非所有的数据集都可以形成合理的分区,分区的数量也不是越多越好,过多的分区条件可能导致很多分区上没有数据。同时hive会限制动态分区可以创建最大的分区数,用来避免过多的文件对文件系统产生负担。
鉴于以上原因,Hive还提供了一种更细粒度的数据拆分方案:分桶表(bucket Table)。分桶表会将指定的列的值进行哈希散列,并对bucket(桶数量)取余,然后存储到对应的bucket中。
- CREATE EXTERNAL TABLE emp_bucket
- (
- empno INT,
- ename STRING,
- job STRING,
- mgr INT,
- hiredate TIMESTAMP,
- sal DECIMAL(7,2),
- comm DECIMAL(7,2),
- deptno INT
- )
- CLUSTERED BY(empno)
- SORTED BY(empno ASC) INTO 4 BUCKETS --按照员工编号散列到四个 bucket 中
- ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
- LOCATION '/hive/emp_bucket';
分区表和分桶表的本质都是将数据按照不同粒度进行拆分,从而使得在查询的时候不必扫描全表,只需要扫描对应的分区或分桶,从而提升查询的效率。两者可以结合使用,从而保证数据在不同粒度上都能得到合理的拆分,官方示例如下:
- CREATE TABLE page_view_bucketed
- (
- viewTime INT,
- userid BIGINT,
- page_url STRING,
- referrer_url STRING,
- ip STRING
- )
- PARTITIONED BY(dt STRING)
- CLUSTERED BY(userid)
- SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED
- FIELDS TERMINATED BY '\001'
- COLLECTION ITEMS TERMINATED BY '\002'
- MAP KEYS TERMINATED BY '\003'
- STORED AS SEQUENCEFILE;