pig 新手,如何使用 pig 中的键值对子集将 JSON 转换为另一个 JSON?

Posted

技术标签:

【中文标题】pig 新手,如何使用 pig 中的键值对子集将 JSON 转换为另一个 JSON?【英文标题】:New to pig, how to transform JSON to another JSON with a subset of the key value pairs in pig? 【发布时间】:2014-07-11 03:58:08 【问题描述】:

假设我有以下 json:


"state":"VA",
"fruit":[
"name":"Bannana",
"color":"Yellow",
"cost":1.6
,
"name":"Apple",
"color":"Red"
"cost":1.4

]

在猪中,我如何将上面的内容转换为以下内容:


"state":"VA",
"fruit":[
"name":"Bannana",,
"cost":1.6
,
"name":"Apple",
"cost":1.4

]

我试过了:

A = #load file
B = FOREACH A GENERATE
state,
fruit.name,
fruit.cost;

以及以下内容:

A = #load file
B = FOREACH A GENERATE
state,
fruit as (m:bagFruitInfo.(tuple(name:string, cost:double)));

似乎无论我做什么,我都会不断获得嵌套数组。我正在尝试做的事情可能吗?我选择 pig 是因为它能够转换数据。请注意,数据是使用 AvroStorage 加载的。

【问题讨论】:

看看这里:pig.apache.org/docs/r0.12.0/basic.html 尝试使用 UDF。我已将工作代码发布为答案。不使用 UDF 有什么限制吗? 【参考方案1】:

我怀疑是否可以不使用UDF。

Java UDF:

package com.example.exp1;

import java.io.IOException;
import java.util.Iterator;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema;

public class ReformatBag extends EvalFunc<DataBag> 

    @Override
    public DataBag exec(Tuple input) throws IOException 

        DataBag returnBag = BagFactory.getInstance().newDefaultBag();
        if (input == null || input.size() == 0 || input.get(0) == null)
            return null;
        try 
            // Get the bag
            DataBag bag = DataType.toBag(input.get(0));
            // Iterate throughout the bag
            Iterator it = bag.iterator();
            while (it.hasNext()) 
                // assign the current
                Tuple t = (Tuple) it.next();
                // Create a new Tuple of size 2 where the refactor the data
                Tuple reformatTuple = TupleFactory.getInstance().newTuple(2);
                // Add 1st field (name)
                reformatTuple.set(0, t.get(0));
                // Add 3rd field (price)
                reformatTuple.set(1, t.get(2));
                //add to the bag. Continue iterating.
                returnBag.add(reformatTuple);
            
         catch (Exception e) 
            throw new IOException("Caught exception processing input row ", e);
        
        return returnBag;
    

    public Schema outputSchema(Schema input) 
        try 
            Schema outputSchema = new Schema();
            Schema bagSchema = new Schema();
            // Tuple schema which holds name and cost schema
            Schema innerTuple = new Schema();
            innerTuple.add(new FieldSchema("name", DataType.CHARARRAY));
            innerTuple.add(new FieldSchema("cost", DataType.FLOAT));

            // Add the tuple schema holding name & cost to the bag schema
            bagSchema.add(new FieldSchema("t1", innerTuple, DataType.TUPLE));
            // Return type of the UDF
            outputSchema.add(new FieldSchema("fruit", bagSchema, DataType.BAG));
            return outputSchema;
         catch (Exception e) 
            return null;
        
    


猪码:

REGISTER /path/to/jar/exp1-1.0-SNAPSHOT.jar;

input_data = LOAD 'text.json'
        USING JsonLoader('state:chararray,fruit:(name:chararray,color:chararray,cost:float)');

reformat_data = FOREACH input_data
        GENERATE state,com.example.exp1.ReformatBag(fruit);
STORE
        reformat_data
        INTO 'output.json'
        USING JsonStorage();

输出:

"state":"VA","fruit":["name":"Bannana","cost":1.6,"name":"Apple","cost":1.4]

【讨论】:

以上是关于pig 新手,如何使用 pig 中的键值对子集将 JSON 转换为另一个 JSON?的主要内容,如果未能解决你的问题,请参考以下文章

Hadoop PIG Max of Tuple

Pig 多个存储命令创建重复工作

JSON

pig中map的分组键值

将键值元组包转换为 Apache Pig 中的映射

Pig : Cogroup 如何避免空白值