pig 新手,如何使用 pig 中的键值对子集将 JSON 转换为另一个 JSON?
Posted
技术标签:
【中文标题】pig 新手,如何使用 pig 中的键值对子集将 JSON 转换为另一个 JSON?【英文标题】:New to pig, how to transform JSON to another JSON with a subset of the key value pairs in pig? 【发布时间】:2014-07-11 03:58:08 【问题描述】:假设我有以下 json:
"state":"VA",
"fruit":[
"name":"Bannana",
"color":"Yellow",
"cost":1.6
,
"name":"Apple",
"color":"Red"
"cost":1.4
]
在猪中,我如何将上面的内容转换为以下内容:
"state":"VA",
"fruit":[
"name":"Bannana",,
"cost":1.6
,
"name":"Apple",
"cost":1.4
]
我试过了:
A = #load file
B = FOREACH A GENERATE
state,
fruit.name,
fruit.cost;
以及以下内容:
A = #load file
B = FOREACH A GENERATE
state,
fruit as (m:bagFruitInfo.(tuple(name:string, cost:double)));
似乎无论我做什么,我都会不断获得嵌套数组。我正在尝试做的事情可能吗?我选择 pig 是因为它能够转换数据。请注意,数据是使用 AvroStorage 加载的。
【问题讨论】:
看看这里:pig.apache.org/docs/r0.12.0/basic.html 尝试使用 UDF。我已将工作代码发布为答案。不使用 UDF 有什么限制吗? 【参考方案1】:我怀疑是否可以不使用UDF。
Java UDF:
package com.example.exp1;
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema;
public class ReformatBag extends EvalFunc<DataBag>
@Override
public DataBag exec(Tuple input) throws IOException
DataBag returnBag = BagFactory.getInstance().newDefaultBag();
if (input == null || input.size() == 0 || input.get(0) == null)
return null;
try
// Get the bag
DataBag bag = DataType.toBag(input.get(0));
// Iterate throughout the bag
Iterator it = bag.iterator();
while (it.hasNext())
// assign the current
Tuple t = (Tuple) it.next();
// Create a new Tuple of size 2 where the refactor the data
Tuple reformatTuple = TupleFactory.getInstance().newTuple(2);
// Add 1st field (name)
reformatTuple.set(0, t.get(0));
// Add 3rd field (price)
reformatTuple.set(1, t.get(2));
//add to the bag. Continue iterating.
returnBag.add(reformatTuple);
catch (Exception e)
throw new IOException("Caught exception processing input row ", e);
return returnBag;
public Schema outputSchema(Schema input)
try
Schema outputSchema = new Schema();
Schema bagSchema = new Schema();
// Tuple schema which holds name and cost schema
Schema innerTuple = new Schema();
innerTuple.add(new FieldSchema("name", DataType.CHARARRAY));
innerTuple.add(new FieldSchema("cost", DataType.FLOAT));
// Add the tuple schema holding name & cost to the bag schema
bagSchema.add(new FieldSchema("t1", innerTuple, DataType.TUPLE));
// Return type of the UDF
outputSchema.add(new FieldSchema("fruit", bagSchema, DataType.BAG));
return outputSchema;
catch (Exception e)
return null;
猪码:
REGISTER /path/to/jar/exp1-1.0-SNAPSHOT.jar;
input_data = LOAD 'text.json'
USING JsonLoader('state:chararray,fruit:(name:chararray,color:chararray,cost:float)');
reformat_data = FOREACH input_data
GENERATE state,com.example.exp1.ReformatBag(fruit);
STORE
reformat_data
INTO 'output.json'
USING JsonStorage();
输出:
"state":"VA","fruit":["name":"Bannana","cost":1.6,"name":"Apple","cost":1.4]
【讨论】:
以上是关于pig 新手,如何使用 pig 中的键值对子集将 JSON 转换为另一个 JSON?的主要内容,如果未能解决你的问题,请参考以下文章