my_relation: entityId: chararray,attributeName: chararray,bytearray


bytearray 列中可以有任意数量的值/时间戳对(甚至为零)。

我想将此关系转换成这样(每个 entityId、attributeName、value、时间戳四重奏一行):


另外,这也可以 - 我对没有值/时间戳的行不感兴趣


有什么想法吗?基本上我想规范化 bytearray 列中的映射元组,以便架构是这样的:

my_relation: entityId: chararray,
              attributeName: chararray, 
              value: float, 
              timestamp: int

我是一个猪初学者,如果这很明显,我很抱歉!我需要 UDF 来执行此操作吗?

这个问题类似,但目前没有答案:How do I split in Pig a tuple of many maps into different rows

我正在运行 Apache Pig 版本 0.12.0-cdh5.1.2

EDIT - 添加我目前所做的详细信息。

这是一个猪脚本 sn-p,输出如下:

-- StateVectorFileStorage is a LoadStoreFunc and AttributeData is a UDF, both java. 
ts_to_average = LOAD 'StateVector' USING StateVectorFileStorage();
ts_to_average = LIMIT ts_to_average 10;
ts_to_average = FOREACH ts_to_average GENERATE entityId, FLATTEN(AttributeData(*));
a = FOREACH ts_to_average GENERATE entityId, $1 as attributeName:chararray, $2#'value';
b = foreach a generate entityId, attributeName, FLATTEN($2);

c_no_flatten = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  TOBAG($2 ..);

c = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  FLATTEN(TOBAG($2 ..));

d = foreach c generate
  (float)$2#'value' as value,
  (int)$2#'timestamp' as timestamp;

dump a;
describe a;
dump b;
describe b;
dump c_no_flatten;
describe c_no_flatten;
dump c;
describe c;
dump d;
describe d;

输出如下。注意在关系“c”中,第二个值/时间戳对 [value#52.0,timestamp#1388683516000] 丢失了。

a: entityId: chararray,attributeName: chararray,bytearray

b: entityId: chararray,attributeName: chararray,bytearray

c_no_flatten: entityId: chararray,attributeName: chararray,(bytearray)

c: entityId: chararray,attributeName: chararray,bytearray

b = foreach a generate entityId, attributeName, FLATTEN($2);


c = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  FLATTEN(TOBAG($2 ..));


d = foreach c generate
  (float)$2#'value' as value,
  (int)$2#'timestamp' as timestamp;

更新: 从地图元组中制作一袋地图的其他一些选项:

DataFu 的 TransposeTupleToBag:http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/util/TransposeTupleToBag.html 这个答案中的foo() Python UDF:Pig - how to iterate on a bag of maps


