博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
From Hive To PIG 的工作日志
阅读量:5768 次
发布时间:2019-06-18

本文共 3190 字,大约阅读时间需要 10 分钟。

hot3.png

============基础=====================

1.载入原始数据:select from  XXX ===>

LOG =  LOAD '/user/hive/....../$date' USING PigStorage('\t') AS (AA:int, BB:int, CC:chararray, DD:chararray, dt:chararray);

2.过滤 where x=XX and y!=YY ==>

FILTER_LOG = FILTER AA by  ==0 AND CC != '';
3.Limit 测试
LIMIT_LOG =limit FILTER_LOG 10;
4.打印查看
dump LIMIT_LOG;
5.运行pig脚本
pig -x map reduce user_session_stat.pig
6.调用python
DEFINE user_session_stat_pig `user_session_stat_pig.py` SHIP('user_session_stat_pig.py');
USER_SESSION_ROWS = STREAM LIMIT_LOG THROUGH user_session_stat_pig AS (填写返回结果); --10
dump USER_SESSION_ROWS;
DEBUG:
1. 查看失败的job:
2.从web端查看job具体信息:http://hd09:50030/jobdetails.jsp?jobid=job_201209031356_11573&refresh=0
3.看子任务的报错,从而定位:
4.与python衔接时,可以先保存下最终需要处理的文件,然后down到本地,再本地进行测试
for line in sys.stdin: == >  for line in open("test.csv"):
5.测试通过,再次运行PIG脚本。
pig user_session_stat.pig

==================详解+调优======================

1.PigStorage表示文件的分隔符
LOAD '/hadoop/spam_daily/$date' USING PigStorage('\t') AS (type:int, op:int, value:chararray, algo:chararray, dt:chararray);
2.多字段JOIN,效率会低,可以把2个字段拼成一个KEY做JOIN。
JOIN  LOG_SMAL BY (class_name, method_name) LEFT OUTER, CMM BY (class, method) USING 'replicated'….

字符串拼接 CONCAT(a,b)

JOIN  LOG_SMAL BY CONCAT(class_name, method_name) LEFT OUTER, CMM BY CONCAT(class, method) USING 'replicated'….

3 .replicated 翻译为重复的。参考文档,把小表放到后面能提高效率,这点和hive相反。

C = join A by $0, B by $1 using “replicated”; // 分段复制链接,B表中的数据将会放在内存中
Replicated JoinsFragment replicate join is a special type of join that works well if one or more relations are small enough to fit into main memory. In such cases, Pig can perform a very efficient join because all of the hadoop work is done on the map side. In this type of join the large relation is followed by one or more small relations. The small relations must be small enough to fit into main memory; if they don't, the process fails and an error is generated.
UsagePerform a replicated join with the USING clause (see inner joins and outer joins). In this example, a large relation is joined with two smaller relations. Note that the large relation comes first followed by the smaller relations; and, all small relations together must fit into main memory, otherwise an error is generated.
big = LOAD 'big_data' AS (b1,b2,b3);
tiny = LOAD 'tiny_data' AS (t1,t2,t3);
mini = LOAD 'mini_data' AS (m1,m2,m3);
C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
ConditionsFragment replicate joins are experimental; we don't have a strong sense of how small the small relation must be to fit into memory. In our tests with a simple query that involves just a JOIN, a relation of up to 100 M can be used if the process overall gets 1 GB of memory. Please share your observations and experience with us.

4. 字符串操作matches走的是java的正则表达式

LOG_VALID = FOREACH (JOIN LOG_VALID_CMM BY sessid, VALID_SID BY sessid USING 'replicated') GENERATE sessid, user_id, visi     tip, uri, refer, agent, tag, hour, minute, second ,is_valid:int, ((refer matches '^http://www.meilishuo.com.*') ? 1 : 0)      as is_meilishuo, ((SPAM::sessid is null) ? 1 : 0) AS is_spam:int;
6.好像是类似于SQL 别名:

LOG_VALID_CMM::LOG_SMALL::sessid

 

转载于:https://my.oschina.net/wangjiankui/blog/81189

你可能感兴趣的文章
SQL注入原理小结
查看>>
eclipse中格式化代码快捷键Ctrl+Shift+F失效的解决办法
查看>>
五大常用算法之二:动态规划算法
查看>>
atitit.浏览器插件解决方案----ftp插件 attilax 总结
查看>>
DataSnap 用TStream 传递大数据 返回流大小为-1的情况
查看>>
使用关节型
查看>>
Python小爬虫实例
查看>>
upload4j安全、高效、易用的java http文件上传框架
查看>>
Android 从硬件到应用程序:一步一步爬上去 6 -- 我写的APP测试框架层硬件服务(终点)...
查看>>
七牛云存储更新缓存图片的方法
查看>>
web项目从域名申请到发布
查看>>
mybatis对mysql进行分页
查看>>
js 对象属性复制到另一个对象
查看>>
查看进程数
查看>>
IClient for js开发之地图的加载
查看>>
二叉搜索树与双向链表
查看>>
JavaWeb 项目开发中的技术总结
查看>>
postman使用之一:安装启动篇
查看>>
关于windows service不能访问网络共享盘(NetWork Drive)的解决方案
查看>>
Win7 & VS2013 编译 WebKit 总结
查看>>