如何使用 Hive (get_json_object) 查询结构数组?
Posted
技术标签:
【中文标题】如何使用 Hive (get_json_object) 查询结构数组?【英文标题】:How to query struct array with Hive (get_json_object)? 【发布时间】:2015-02-25 10:08:08 【问题描述】:我将以下 JSON 对象存储在 Hive 表中:
"main_id": "qwert",
"features": [
"scope": "scope1",
"name": "foo",
"value": "ab12345",
"age": 50,
"somelist": ["abcde","fghij"]
,
"scope": "scope2",
"name": "bar",
"value": "cd67890"
,
"scope": "scope3",
"name": "baz",
"value": [
"A",
"B",
"C"
]
]
“features”是一个可变长度的数组,即所有对象都是可选的。对象具有任意元素,但都包含“范围”、“名称”和“值”。
这是我创建的 Hive 表:
CREATE TABLE tbl(
main_id STRING,features array<struct<scope:STRING,name:STRING,value:array<STRING>,age:INT,somelist:array<STRING>>>
)
我需要一个 Hive 查询,它返回 main_id 和名为“baz”的结构的值,即,
main_id baz_value
qwert ["A","B","C"]
我的问题是 Hive UDF "get_json_object" 仅支持有限版本的 JSONPath。它不支持像get_json_object(features, '$.features[?(@.name='baz')]')
这样的路径。
如何使用 Hive 查询想要的结果?使用另一个 Hive 表结构可能更容易吗?
【问题讨论】:
【参考方案1】:我找到了解决方案:
使用Hive explode UDTF 分解结构数组,即创建第二个(临时)表,为数组“特征”中的每个结构创建一条记录。
CREATE TABLE tbl_exploded as
select main_id,
f.name as f_name,
f.value as f_value
from tbl
LATERAL VIEW explode(features) exploded_table as f
-- optionally filter here instead of in 2nd query:
-- where f.name = 'baz';
这样的结果是:
qwert, foo, ab12345
qwert, bar, cd67890
qwert, baz, ["A","B","C"]
现在您可以像这样选择 main_id 和值:
select main_id, f_value from tbl_exploded where f_name = 'baz';
【讨论】:
【参考方案2】:这个应该没问题。
ParseJsonWithPath
ADD JAR your-path/ParseJsonWithPath.jar;
CREATE TEMPORARY FUNCTION parseJsonWithPath AS 'com.ntc.hive.udf.ParseJsonWithPath';
SELECT parseJsonWithPath(jsonStr, xpath) FROM ....
要解析的字段可以是json字符串(jsonStr),给定xpath,就可以得到你想要的。
例如
jsonStr
"book": [
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
,
"category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
xpath
"$.book"
return the insider json string [....]
"$.book[?(@.price < 10)]"
return the [8.95]
more detail
【讨论】:
【参考方案3】:我认为下面粘贴的 UDF 接近您的需求。它需要array<struct>
、一个字符串和一个整数。字符串是字段名称,在您的情况下为“名称”,第三个参数是要匹配的值。目前它需要一个整数,但为了您的目的,将其更改为字符串/文本应该相对容易。
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.lazy.LazyString;
import org.apache.hadoop.hive.serde2.lazy.LazyLong;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableConstantIntObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableConstantStringObjectInspector;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import java.util.ArrayList;
import org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyLongObjectInspector;
@Description(name = "extract_value",
value = "_FUNC_( array< struct<value:string> > ) - Collect all \"value\" field values inside an array of struct(s), and return the results in an array<string>",
extended = "Example:\n SELECT _FUNC_(array_of_structs_with_value_field)")
public class StructFromArrayStructDynamicInt
extends GenericUDF
private ArrayList ret;
private ListObjectInspector listOI;
private StructObjectInspector structOI;
private ObjectInspector indOI;
private ObjectInspector valOI;
private ObjectInspector arg1OI;
private ObjectInspector arg2OI;
private String indexName;
WritableConstantStringObjectInspector element1OI;
WritableConstantIntObjectInspector element2OI;
@Override
public ObjectInspector initialize(ObjectInspector[] args)
throws UDFArgumentException
if (args.length != 3)
throw new UDFArgumentLengthException("The function extract_value() requires exactly three arguments.");
if (args[0].getCategory() != Category.LIST)
throw new UDFArgumentTypeException(0, "Type array<struct> is expected to be the argument for extract_value but " + args[0].getTypeName() + " is found instead");
if (args[1].getCategory() != Category.PRIMITIVE)
throw new UDFArgumentTypeException(0, "Second argument is expected to be primitive but " + args[1].getTypeName() + " is found instead");
if (args[2].getCategory() != Category.PRIMITIVE)
throw new UDFArgumentTypeException(0, "Second argument is expected to be primitive but " + args[2].getTypeName() + " is found instead");
listOI = ((ListObjectInspector) args[0]);
structOI = ((StructObjectInspector) listOI.getListElementObjectInspector());
arg1OI = (StringObjectInspector) args[1];
arg2OI = args[2];
this.element1OI = (WritableConstantStringObjectInspector) arg1OI;
this.element2OI = (WritableConstantIntObjectInspector) arg2OI;
indexName = element1OI.getWritableConstantValue().toString();
// if (structOI.getAllStructFieldRefs().size() != 2)
// throw new UDFArgumentTypeException(0, "Incorrect number of fields in the struct, should be one");
//
// StructField valueField = structOI.getStructFieldRef("value");
StructField indexField = structOI.getStructFieldRef(indexName);
//If not, throw exception
// if (valueField == null)
// throw new UDFArgumentTypeException(0, "NO \"value\" field in input structure");
//
if (indexField == null)
throw new UDFArgumentTypeException(0, "Index field not in input structure");
//Are they of the correct types?
//We store these object inspectors for use in the evaluate() method
// valOI = valueField.getFieldObjectInspector();
indOI = indexField.getFieldObjectInspector();
//First are they primitives
// if (valOI.getCategory() != Category.PRIMITIVE)
// throw new UDFArgumentTypeException(0, "value field must be of primitive type");
//
if (indOI.getCategory() != Category.PRIMITIVE)
throw new UDFArgumentTypeException(0, "index field must be of primitive type");
if (arg1OI.getCategory() != Category.PRIMITIVE)
throw new UDFArgumentTypeException(0, "second argument must be primitive type");
if (arg2OI.getCategory() != Category.PRIMITIVE)
throw new UDFArgumentTypeException(0, "third argument must be primitive type");
//Are they of the correct primitives?
// if (((PrimitiveObjectInspector)valOI).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING)
// throw new UDFArgumentTypeException(0, "value field must be of string type");
//
if (((PrimitiveObjectInspector)indOI).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.LONG)
throw new UDFArgumentTypeException(0, "index field must be of long type");
if (((PrimitiveObjectInspector)arg1OI).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING)
throw new UDFArgumentTypeException(0, "second arg must be of string type");
if (((PrimitiveObjectInspector)arg2OI).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.INT)
throw new UDFArgumentTypeException(0, "third arg must be of int type");
// ret = new ArrayList();
return listOI.getListElementObjectInspector();
// return PrimitiveObjectInspectorFactory.javaStringObjectInspector;
// return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
@Override
public Object evaluate(DeferredObject[] arguments)
throws HiveException
// ret.clear();
if (arguments.length != 3)
return null;
if (arguments[0].get() == null)
return null;
int numElements = listOI.getListLength(arguments[0].get());
// long xl = argOI.getPrimitiveJavaObject(arguments[1].get());
// long xl = arguments[1].get(); //9;
long xl2 = element2OI.get(arguments[2].get());
// String xl1 = element1OI.getPrimitiveJavaObject(arguments[2].get());
// long xl = 9;
for (int i = 0; i < numElements; i++)
// LazyString valDataObject = (LazyString) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef("value")));
long indValue = (Long) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef(indexName)));
// throw new HiveException("second arg must be of string type");
// LazyString indDataObject = (LazyString) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef("index")));
// Text valueValue = ((StringObjectInspector) valOI).getPrimitiveWritableObject(valDataObject);
// LongWritable indValue = ((LazyLongObjectInspector) indOI).getPrimitiveWritableObject(indDataObject);
if(indValue == xl2)
return listOI.getListElement(arguments[0].get(), i);
// ret.add(valueValue);
return null;
@Override
public String getDisplayString(String[] strings)
assert (strings.length > 0);
StringBuilder sb = new StringBuilder();
sb.append("extract_value(");
sb.append(strings[0]);
sb.append(")");
return sb.toString();
Here 是此代码以及其他几个使用 array<struct>
执行操作的工作 udf。
【讨论】:
【参考方案4】:假设你有一列 my_data 类型为 array
select * from <table_nmae> where my_data[n].id = 10;
这里n是你要搜索的索引,这样可以省去横向展开查询。
【讨论】:
以上是关于如何使用 Hive (get_json_object) 查询结构数组?的主要内容,如果未能解决你的问题,请参考以下文章