Hive基操_资讯_银财中国网

Hive基操

创始人

2024-05-26 12:33:09

0次

数据交换

//hive导出到hdfs /outstudentpt 目录
0: jdbc:hive2://guo146:10000> export table student_pt to '/outstudentpt';

//从hdfs导入到hive
0: jdbc:hive2://guo146:10000> import table studentpt from '/outstudentpt';

数据排序

Order by会对所给的全部数据进行全局排序，只启动一个reduce来处理。

Sort by是局部排序，它可以根据数据量的大小启动一到多个reducer来工作，并且在每个reduce中单独排序。

Distribute by 类似于mr中的partition，采用hash算法，在map端将查询结果中hash值相同的结果分发到对应的reduce中，结合sort by使用

Cluster by 可以看作是distribute by 和sort by的结合，当两者后面所跟的字段列名相同时，效果就等同于使用cluster by，但是cluster by最终的结果只能是降序，无法指定升序和降序

分桶及抽样查询

分桶的好处
1、基于分桶字段查询时，减少全表扫描
2、join时可以提高MR程序效率，减少笛卡尔积数量
3、分桶表数据进行高效抽样

// 任务数量两个,开启分桶设置
set map.reduce.tasks=2; --根据任务分配
set hive.enforce.bucketing=true;

//分桶表
首先创建一个内部表，分桶表需要通过insert into 插入数据
create table employee_id_buckets
(name         string,employee_id  int,work_place   array,gender_age   struct,skills_score map,depart_title map>
)clustered by (employee_id) into 2 buckets   -- 分成两个桶，分桶的字段一定是表中已经存在的字段row format delimited fields terminated by '|'collection items terminated by ','map keys terminated by ':';//使用insert+select 语法将数据加载到分桶表中		
insert into table employee_id_buckets select * from employee_id;

分桶规则：编号相同的数据会被分到同一个桶当中
Bucket number=hash_function(bucketing_colum) mod num_buckets
分桶编号 = 哈希方法（分桶字段）取模分桶个数

hash_function取决于分桶字段 bucketing_colum 的类型：
1、int类型 hash_function(int)==int
2、其他类型则是从该类型派生的某个数字

select * from employee_id_buckets tablesample(bucket x out of y);

y 必须是 table 总 bucket 数的倍数或者因子。hive 根据 y 的大小，决定抽样的比例。例
如，table 总共分了 4 份，当 y=2 时，抽取(4/2=)2 个 bucket 的数据，当 y=8 时，抽取(4/8=)1/2个 bucket 的数据。

x 表示从哪个 bucket 开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上
y。例如，table 总 bucket 数为 4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2 个bucket 的数据，抽取第 1(x)个和第 3(x+y)个 bucket 的数据。

注意：x 的值必须小于等于 y 的值，否则
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

WordCount案例

create table if not exists word_count
(line string
);
//传参数
load data local inpath '/opt/stufile/wordcount.txt' overwrite into table word_count;with t1 as (select explode(split(line, ' ')) word from word_count)
select t1.word, count(1) num
from t1
group by t1.word
order by num desc

视图

视图是一种虚拟表，只保存定义，不实际存储数据

使用视图的好处
1、将真实表中特定的列数提供给客户，保护数据隐私
2、降低查询的复杂度，优化查询语句

-- 创建视图
create view  emp as select * from employee limit 2;-- 显示当前已有的视图 v2.2.0之后支持
show views;-- 查看视图定义
show create table emp;-- 删除视图
drop view emp-- 侧视图
select name,wps,gender_age.gender,gender_age.age,skill,score,depart,title
from bigdata.employee lateral view explode(work_place) work_pplace_single as wpslateral view explode(skills_score) sks as skill, scorelateral view explode(depart_title) dt as depart, title;

词库加载错误:未能找到文件“E:\highferrum_mysql\Configuration\Dict_Stopwords.txt”。

上一篇：操作系统（day13）-- 虚拟内存；页面分配策略

下一篇：秋招面试问题整理之机器学习篇

Hive基操

数据交换

数据排序

分桶及抽样查询

WordCount案例

视图

相关内容

热门资讯