在 Apache Spark 中,我有一个数据框,其中有一列包含字符串(它的日期),但月份和日期中缺少前导零
Posted
技术标签:
【中文标题】在 Apache Spark 中,我有一个数据框,其中有一列包含字符串(它的日期),但月份和日期中缺少前导零【英文标题】:In Apache Spark, I have a dataframe with one column which has string (its a date) but leading zero is missing from month and day 【发布时间】:2020-08-02 05:47:34 【问题描述】:import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "9/11/2020"),
(2, "10/11/2020"),
(3, "1/1/2020"),
(4, "12/7/2020"))).toDF("Id", "x4")
val newDf = df
.withColumn("x4New", regexp_replace(df("x4"), "(?:(\\d2))/(?:(\\d1))/(?:(\\d4))", "$1/0$2/$3"))
val newDf1 = newDf
.withColumn("x4New1", regexp_replace(df("x4"), "(?:(\\d1))/(?:(\\d1))/(?:(\\d4))", "0$1/0$2/$3"))
.withColumn("x4New2", regexp_replace(df("x4"), "(?:(\\d1))/(?:(\\d2))/(?:(\\d4))", "0$1/$2/$3"))
newDf1.show
Output now
+---+----------+----------+-----------+-----------+
| Id| x4| x4New| x4New1| x4New2|
+---+----------+----------+-----------+-----------+
| 1| 9/11/2020| 9/11/2020| 9/11/2020| 09/11/2020|
| 2|10/11/2020|10/11/2020| 10/11/2020|100/11/2020|
| 3| 1/1/2020| 1/1/2020| 01/01/2020| 1/1/2020|
| 4| 12/7/2020|12/07/2020|102/07/2020| 12/7/2020|
+---+----------+----------+-----------+-----------+
`所需的输出,在日或月前面添加前导零是个位数'出于性能原因不想使用 UDF
+---+----------+----------+
| Id| x4| date |
+---+----------+----------+
| 1| 9/11/2020|09/11/2020|
| 2|10/11/2020|10/11/2020|
| 3| 1/1/2020|01/01/2020|
| 4| 12/7/2020|12/07/2020|
+---+----------+----------+-----------+-----------+
【问题讨论】:
在正则表达式的两边使用单词边界,\b
(转义反斜杠,"\\b"
)。你也可以删除1
这也不起作用 - .withColumn("x4New", regexp_replace(df("x4"), "(?:(\\b\\d2))/(?: (\\d))/(?:(\\d4)\\b)", "$1/0$2/$3"))
使用 spark 3.0 预览版,仍然是 NULL,但是当我使用 Databricks 社区版时,结果如您所说,没有个位数问题。不知道是什么问题。
【参考方案1】:
使用 from_unixtime,unix_timestamp
(或)date_format,to_timestamp,(or) to_date
在内置函数中。
Example:(In Spark-2.4)
import org.apache.spark.sql.functions._
//sample data
val df = spark.createDataFrame(Seq((1, "9/11/2020"),(2, "10/11/2020"),(3, "1/1/2020"), (4, "12/7/2020"))).toDF("Id", "x4")
//using from_unixtime
df.withColumn("date",from_unixtime(unix_timestamp(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
//using date_format
df.withColumn("date",date_format(to_timestamp(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
df.withColumn("date",date_format(to_date(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
//+---+----------+----------+
//| Id| x4| date|
//+---+----------+----------+
//| 1| 9/11/2020|09/11/2020|
//| 2|10/11/2020|10/11/2020|
//| 3| 1/1/2020|01/01/2020|
//| 4| 12/7/2020|12/07/2020|
//+---+----------+----------+
【讨论】:
上面的代码没有提供想要的输出。为单个数字的月份和日期抛出空值。 @yogesh,无法重现该场景,您使用的是哪个版本的spark
?正如您在我的example
代码中看到的那样,即使我们有 month and day with single digit
!!
在我的笔记本电脑上使用 Apache Spark Ver 3.0 预览版,当我使用 databricks 社区版时,您的示例代码可以工作。不知道发生了什么。【参考方案2】:
`找到了解决方法,看看有没有更好的解决方案,使用一个数据框,不使用UDF'
import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "9/11/2020"),
(2, "10/11/2020"),
(3, "1/1/2020"),
(4, "12/7/2020"))).toDF("Id", "x4")
val newDf = df.withColumn("x4New", regexp_replace(df("x4"), "(?:(\\b\\d2))/(?:(\\d))/(?:(\\d4)\\b)", "$1/0$2/$3"))
val newDf1 = newDf.withColumn("x4New1", regexp_replace(newDf("x4New"), "(?:(\\b\\d1))/(?:(\\d))/(?:(\\d4)\\b)", "0$1/$2/$3"))
val newDf2 = newDf1.withColumn("x4New2", regexp_replace(newDf1("x4New1"), "(?:(\\b\\d1))/(?:(\\d2))/(?:(\\d4)\\b)", "0$1/$2/$3"))
val newDf3 = newDf2.withColumn("date", to_date(regexp_replace(newDf2("x4New2"), "(?:(\\b\\d2))/(?:(\\d1))/(?:(\\d4)\\b)", "$1/0$2/$3"),"MM/dd/yyyy"))
val formatedDataDf = newDf3
.drop("x4New")
.drop("x4New1")
.drop("x4New2")
formatedDataDf.printSchema
formatedDataDf.show
Output looks like as follows
root
|-- Id: integer (nullable = false)
|-- x4: string (nullable = true)
|-- date: date (nullable = true)
+---+----------+----------+
| Id| x4| date|
+---+----------+----------+
| 1| 9/11/2020|2020-09-11|
| 2|10/11/2020|2020-10-11|
| 3| 1/1/2020|2020-01-01|
| 4| 12/7/2020|2020-12-07|
+---+----------+----------+
【讨论】:
以上是关于在 Apache Spark 中,我有一个数据框,其中有一列包含字符串(它的日期),但月份和日期中缺少前导零的主要内容,如果未能解决你的问题,请参考以下文章
Apache spark内部连接2个数据框得到TreeNodeException
在 spark 数据框中运行 UDF 时,不支持获取 org.apache.spark.sql.Column 类型的架构