MIn max group wise and filter without join in pig

Posted

技术标签:

【中文标题】MIn max group wise and filter without join in pig【英文标题】: 【发布时间】:2015-07-28 17:59:03 【问题描述】:

我正在尝试为每个组找到 (max+min)/2。以下是我的架构

UrlXpathsCount: url: chararray,leafpathstr: chararray,urlpath_count: long

我正在尝试按 url 字段对其进行分组

byUrl = GROUP UrlXpathsCount by url;

我正在尝试通过以下方式找到 (max+min)/2。

midRangeByUrl = FOREACH byUrl
    urls_desc = order UrlXpathsCount by urlpath_count desc;
    urls_max = limit urls_desc 1;
    urls_asc = order UrlXpathsCount by urlpath_count asc;
    urls_min = limit urls_asc 1;

    GENERATE FLATTEN(urls_max),FLATTEN(urls_min);
;

以下是 midRangeByUrl 的架构

midRangeByUrl: urls_max::url: chararray,urls_max::leafpathstr: chararray,urls_max::urlpath_count: long,urls_min::url: chararray,urls_min::leafpathstr: chararray,urls_min::urlpath_count: long

我现在面临的问题是添加 FLATTEN(group) ,FLATTEN(urls_max) , FLATTEN(urls_min) 给了我很多我不想要的组合。

我想为每个组获取 max + min/2。

为此,我通过以下方式预测 max 和 min 的 urlpath_count

computeMidRange = FOREACH midRangeByUrl generate urls_max::url as mid_url,((DOUBLE)urls_max::urlpath_count+(DOUBLE) urls_min::urlpath_count)/2 as midRange;

我将通过以下方式加入这两个表

/* Join computeMidRange  and UrlXpathsCount */
midRangeJoin = join UrlXpathsCount by url , computeMidRange by mid_url using 'replicated';
midRangeOut = FOREACH midRangeJoin GENERATE UrlXpathsCount::url as url,UrlXpathsCount::leafpathstr as leafpathstr,
    UrlXpathsCount::urlpath_count as urlpath_count,computeMidRange::midRange as midRange;

然后应用过滤器过滤

templates = FILTER midRangeOut by urlpath_count > midRange;

我想避免 midRangeJoin 。通过以某种方式计算 midRangeByUrl 并投影以下字段 url, urlpath_count ,leafpathstr , (min+max)/2 而无需加入。

请帮助我解决这个问题。 谢谢

【问题讨论】:

【参考方案1】:

您可以改用内置的 MAXMIN UDF:

UrlXpathsCount = load 'your_data' using PigStorage(',') as (url: chararray,leafpathstr: chararray,urlpath_count: long);
B = GROUP UrlXpathsCount by url;
C = foreach B generate group as url, MAX(UrlXpathsCount.urlpath_count) as max_count, 
                                     MIN(UrlXpathsCount.urlpath_count) as min_count;
D = foreach C generate url, ((double)max_count + (double)min_count)/2 as val;

这将完全符合您的要求,无需嵌套 foreach 或 join。我将计算分为CD 以避免出现极长的行,但您也可以仅在一行中完成。请记住将值转换为double,因为您的urlpath_countlong,因此如果您不转换它,您将不会得到任何小数。

【讨论】:

以上是关于MIn max group wise and filter without join in pig的主要内容,如果未能解决你的问题,请参考以下文章

Django 查询模型 - GROUP BY、MIN、MAX

不能在 Group by/Order by/Where/ON 子句中使用 Group 或 Aggregate 函数(min()、max()、sum()、count()、...等)

关于max()/min()和group by 的坑

带有 MIN 和 MAX 的 GROUP BY - 属于解决方案的日期范围

SQL Group By and min (MySQL)

这行 CSS 是啥意思? @media only screen and (min-device-width: 320px) and (max-device-width: 480px)