Apache Spark Java - 如何遍历行数据集并删除空字段
Posted
技术标签:
【中文标题】Apache Spark Java - 如何遍历行数据集并删除空字段【英文标题】:Apache Spark Java - how to iterate through row dataset and remove null fields 【发布时间】:2018-05-18 17:38:35 【问题描述】:我正在尝试构建从 Hive 表中读取数据并将输出写入为 JSON 的 spark 应用程序。
在下面的代码中,我必须遍历行数据集并在输出之前删除空字段。
我期待我的输出,请建议我怎样才能做到这一点?
"personId":"101","personName":"Sam","email":"Sam@gmail.com"
"personId":"102","personName":"Smith" // as email is null or blank should not be included in output
这是我的代码:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import com.fdc.model.Person;
public class ExtractionExample
public static void main(String[] args) throws Exception
SparkSession spark = SparkSession.builder().appName("ExtractionExample")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse/").enableHiveSupport().getOrCreate();
Dataset<Row> sqlDF = spark.sql("SELECT person_id as personId, person_name as personName, email_id as emailId FROM person");
Dataset<Person> person = sqlDF.as(Encoders.bean(Person.class));
/*
* iterate through all the columns and identify the null value and drop
* Looks like it will drop the column from entire table but when I tried it doesn't do anything.
* String[] columns = sqlDF.columns();
for (String column : columns)
String colValue = sqlDF.select(column).toString();
System.out.println("printing the column: "+ column +" colvalue:"+colValue.toString());
if(colValue != null && colValue.isEmpty() && (colValue).trim().length() == 0)
System.out.println("dropping the null value");
sqlDF = sqlDF.drop(column);
sqlDF.write().json("/data/testdb/test/person_json");
*/
/*
*
* Unable to get the bottom of the solution
* also collect() is heavy operation is there any better way to do this?
* List<Row> rowListDf = person.javaRDD().map(new Function<Row, Row>()
@Override
public Row call(Row record) throws Exception
String[] fieldNames = record.schema().fieldNames();
Row modifiedRecord = new RowFactory().create();
for(int i=0; i < fieldNames.length; i++ )
String value = record.getAs(i).toString();
if (value!= null && !value.isEmpty() && value.trim().length() > 0)
// RowFactory.create(record.get(i)); ---> throwing this error
// return RowFactory object
return null;
).collect();*/
person.write().json("/data/testdb/test/person_json");
【问题讨论】:
这里没有什么可做的。 JSON writer 默认忽略NULL
字段。如果您有空白字符串,则还必须将它们转换为 NULL
。
谢谢;我假设,我们需要遍历数据集的每一行并删除空值。
别提了。
【参考方案1】:
按照user9613318
的建议,JSON 编写器默认忽略NULL
字段。
【讨论】:
以上是关于Apache Spark Java - 如何遍历行数据集并删除空字段的主要内容,如果未能解决你的问题,请参考以下文章