Pig Heaping 约 185 场演出(202260828 条记录)

Posted

技术标签:

【中文标题】Pig Heaping 约 185 场演出(202260828 条记录)【英文标题】:Pig Heaping on approx 185 gigs (202260828 records) 【发布时间】:2014-01-08 21:20:56 【问题描述】:

我对 PIG 还是很陌生,但我了解 map/reduce 作业的基本概念。我试图根据一些简单的日志为用户找出一些统计数据。我们有一个实用程序可以解析日志中的字段,我正在使用 DataFu 来计算方差和四分位数。

我的脚本如下:

log = LOAD '$data' USING SieveLoader('node', 'uid', 'long_timestamp');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL;
--Find all users
SPLIT log_map INTO cloud IF $0#'node' MATCHES '*.mis01*', dev OTHERWISE;
--For the real cloud
cloud = FOREACH cloud GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '192.168.0.231' AS ldap_server;
dev = FOREACH dev GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '10.0.0.231' AS ldap_server;
modified_logs = UNION dev, cloud;

--Calculate user times
user_times = FOREACH modified_logs GENERATE *, ToDate((long)long_timestamp) as date;
--Based on weekday/weekend
aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetWeekOrWeekend(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;
--Based on actual day of week
--aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetDayOfWeek(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;

user_days = GROUP aliased_user_times BY (uid, ldap_server,domain, year, month, day, day_of_week);

some_times_by_day = FOREACH user_days GENERATE FLATTEN(group) AS (uid, ldap_server, domain, year, month, day, day_of_week), MAX(aliased_user_times.miliseconds_into_day) AS max, MIN(aliased_user_times.miliseconds_into_day) AS min;

times_by_day = FOREACH some_times_by_day GENERATE *, max-min AS time_on;

times_by_day_of_week = GROUP times_by_day BY (uid, ldap_server, domain, day_of_week);
STORE times_by_day_of_week INTO '/data/times_by_day_of_week';

--New calculation, mean, var, std_d, (min, 25th quartile, 50th (aka median), 75th quartile, max)
averages = FOREACH times_by_day_of_week GENERATE FLATTEN(group) AS (uid, ldap_server, domain, day_of_week), 'USER' as type, AVG(times_by_day.min) AS start_avg, VAR(times_by_day.min) AS start_var, SQRT(VAR(times_by_day.min)) AS start_std, Quartile(times_by_day.min) AS start_quartiles;
--AVG(times_by_day.max) AS end_avg, VAR(times_by_day.max) AS end_var, SQRT(VAR(times_by_day.max)) AS end_std, Quartile(times_by_day.max) AS end_quartiles, AVG(times_by_day.time_on) AS hours_avg, VAR(times_by_day.time_on) AS hours_var, SQRT(VAR(times_by_day.time_on)) AS hours_std, Quartile(times_by_day.time_on) AS hours_quartiles ;

STORE averages INTO '/data/averages';

我看到其他人在 DataFu 一次计算多个分位数时遇到问题,所以我只尝试一次计算一个。自定义加载程序一次加载一行,通过一个实用程序将其转换为地图,并且有一个小的 UDF 可以检查日期是工作日还是周末(最初我们希望根据天获取统计信息周,但是加载足够的数据来获得有趣的四分位数正在扼杀 map/reduce 任务。

使用猪 0.11

【问题讨论】:

【参考方案1】:

看起来我的具体问题是由于试图计算一个 PigLatin 行中的最小值和最大值。将工作分成两个不同的命令然后加入它们似乎解决了我的记忆问题

【讨论】:

以上是关于Pig Heaping 约 185 场演出(202260828 条记录)的主要内容,如果未能解决你的问题,请参考以下文章

笑果新厂牌——噗哧SKETCH(公测场)启动啦!

MySQL时区错误导致server time zone value 'Öйú±ê׼ʱ&

MySQL报错The server time zone value 'Öйú±ê׼ʱ&

mysql?????????The server time zone value 'Öйú±ê×¼Ê

The server time zone value 'Öйú±ê׼ʱ¼&#

连接mysql报错 : The server time zone value 'Öйú±ê×¼Ê&#