数据流 BigQuery 读取未返回正确的数据类型
Posted
技术标签:
【中文标题】数据流 BigQuery 读取未返回正确的数据类型【英文标题】:Dataflow BigQuery read does not return correct datatype 【发布时间】:2018-05-20 09:58:50 【问题描述】:在 Apache Beam/Dataflow 中,我使用以下代码将数据读入集合:
// read the BigQuery data
PCollection<TableRow> bigQuerySource = p
.apply(BigQueryIO.readTableRows().fromQuery(bigQueryQuery).usingStandardSql().withTemplateCompatibility());
查询是"Select * from .."
查询一个视图,该视图查询其他视图和表。
在下一个转换中,我使用以下内容:
..
public void processElement(ProcessContext c)
Set<Map.Entry<String, Object>> entries = c.element().entrySet();
for (Map.Entry<String, Object> entry : entries)
Object value = entry.getValue();
String x = value.getClass().getName();
..
视图包含多种数据类型,String/Date/Integer/Boolean,但x中返回的数据类型只有String/Boolean。
如何从 BigQuery 架构中获取原始数据类型?
【问题讨论】:
【参考方案1】:如果您获得com.google.cloud.bigquery.BigQuery
的实例,那么您可以获得列的类型。例如,要获取第一列的类型:
BigQuery bigQuery = BigQueryOptions.newBuilder()
.setProjectId(projectId)
.setCredentials(...)
.build()
.getService();
bigQuery.getTable(id).getDefinition().getSchema().getFields().get(0).getType()
这会给你LegacySQLTypeName
。根据源代码,这是你可以期待的:
/** Variable-length binary data. */
public static final LegacySQLTypeName BYTES = type.createAndRegister("BYTES").setStandardType(StandardSQLTypeName.BYTES);
/** Variable-length character (Unicode) data. */
public static final LegacySQLTypeName STRING = type.createAndRegister("STRING").setStandardType(StandardSQLTypeName.STRING);
/** A 64-bit signed integer value. */
public static final LegacySQLTypeName INTEGER = type.createAndRegister("INTEGER").setStandardType(StandardSQLTypeName.INT64);
/** A 64-bit IEEE binary floating-point value. */
public static final LegacySQLTypeName FLOAT = type.createAndRegister("FLOAT").setStandardType(StandardSQLTypeName.FLOAT64);
/** A Boolean value (true or false). */
public static final LegacySQLTypeName BOOLEAN = type.createAndRegister("BOOLEAN").setStandardType(StandardSQLTypeName.BOOL);
/** Represents an absolute point in time, with microsecond precision. */
public static final LegacySQLTypeName TIMESTAMP = type.createAndRegister("TIMESTAMP").setStandardType(StandardSQLTypeName.TIMESTAMP);
/** Represents a logical calendar date. Note, support for this type is limited in legacy SQL. */
public static final LegacySQLTypeName DATE = type.createAndRegister("DATE").setStandardType(StandardSQLTypeName.DATE);
/**
* Represents a time, independent of a specific date, to microsecond precision. Note, support for
* this type is limited in legacy SQL.
*/
public static final LegacySQLTypeName TIME = type.createAndRegister("TIME").setStandardType(StandardSQLTypeName.TIME);
/**
* Represents a year, month, day, hour, minute, second, and subsecond (microsecond precision).
* Note, support for this type is limited in legacy SQL.
*/
public static final LegacySQLTypeName DATETIME = type.createAndRegister("DATETIME").setStandardType(StandardSQLTypeName.DATETIME);
/** A record type with a nested schema. */
public static final LegacySQLTypeName RECORD = type.createAndRegister("RECORD").setStandardType(StandardSQLTypeName.STRUCT);
【讨论】:
感谢您的回答。管道作为模板运行,这限制了我的选择。我正在 TableRow 的上下文中寻找一个选项。此时,所有字段都被称为对象中的未知字段。这确实表明存在已知领域。不知道我在这里缺少什么,但是 TableRow 不知道它包含什么数据类型对我来说听起来很奇怪。【参考方案2】:我刚刚在这里 how can I get a bigquery table schema in java 发现了一个类似的问题,其中架构返回 null,他们通过首先调用 table.reload() 来修复它。
架构模式 = table.getDefinition().getSchema();
另外,你可以在这里查看对应的类实现和方法:http://googlecloudplatform.github.io/google-cloud-java/google-cloud-clients/apidocs/?com/google/cloud/bigquery/package-summary.html
【讨论】:
以上是关于数据流 BigQuery 读取未返回正确的数据类型的主要内容,如果未能解决你的问题,请参考以下文章