从字符串列创建新列
Posted
技术标签:
【中文标题】从字符串列创建新列【英文标题】:create new columns from string column 【发布时间】:2018-05-08 20:06:24 【问题描述】:我有一个带有字符串列的 DataFrame
val df= Seq(("0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:3")).toDF("col_str")
+--------------------+
| col_str|
+--------------------+
|0003C32C-FC1D-482...|
+--------------------+
字符串列由由空格分隔的字符序列组成。如果一个字符序列以 0 开头,我想返回序列的第二个数字和最后一个数字。第二个数字可以是 0 到 8 之间的任何数字。
Array("8,3", "6,1", "7,1", "6,1", "7,1", "8,3")
然后我想将数组对转换为 9 列,其中第一个数字作为列,第二个数字作为值。如果缺少一个数字,它将得到一个值 0。 例如
val df= Seq(("0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:1")).).toDF("col_str", "col0", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")
+--------------------+----+----+----+----+----+----+----+----+----+
| col_str|col0|col1|col2|col3|col4|col5|col6|col7|col8|
+--------------------+----+----+----+----+----+----+----+----+----+
|0003C32C-FC1D-482...| 0| 0| 0| 0| 0| 0| 1| 1| 3|
+--------------------+----+----+----+----+----+----+----+----+----+
我不在乎解决方案是使用 scala 还是 python。
【问题讨论】:
【参考方案1】:您可以执行以下操作(为清楚起见,请注释)
//string defining
val str = """0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:3"""
//string splitting with space
val splittedStr = str.split(" ")
//parsing the splitted string to get the desired format with the second element as key and the last element as value of the elements starting with 0
val parsedStr = List(("col_str"->splittedStr.head)) ++ splittedStr.tail.filter(_.startsWith("0")).map(value => val splittedValue = value.split("[,:]"); ("col"+splittedValue(1)->splittedValue.last)) toMap
//expected header names
val expectedHeader = Seq("col_str", "col0", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")
//populating 0 for the missing header names in the parsed string in above step
val missedHeaderWithValue = expectedHeader.diff(parsedStr.keys.toSeq).map((_->"0")).toMap
//combining both the maps
val expectedKeyValues = parsedStr ++ missedHeaderWithValue
//converting to a dataframe
Seq(expectedDF(expectedKeyValues(expectedHeader(0)), expectedKeyValues(expectedHeader(1)), expectedKeyValues(expectedHeader(2)), expectedKeyValues(expectedHeader(3)), expectedKeyValues(expectedHeader(4)), expectedKeyValues(expectedHeader(5)), expectedKeyValues(expectedHeader(6)), expectedKeyValues(expectedHeader(7)), expectedKeyValues(expectedHeader(8)), expectedKeyValues(expectedHeader(9))))
.toDF()
.show(false)
这应该给你
+------------------------------------+----+----+----+----+----+----+----+----+----+
|col_str |col0|col1|col2|col3|col4|col5|col6|col7|col8|
+------------------------------------+----+----+----+----+----+----+----+----+----+
|0003C32C-FC1D-482F-B543-3CBD7F0A0E36|0 |0 |0 |0 |0 |0 |1 |1 |3 |
+------------------------------------+----+----+----+----+----+----+----+----+----+
当然你需要expectedDF
case class
在超出范围的地方定义
case class expectedDF(col_str: String, col0: String, col1: String, col2: String, col3: String, col4: String, col5: String, col6: String, col7: String, col8: String)
【讨论】:
以上是关于从字符串列创建新列的主要内容,如果未能解决你的问题,请参考以下文章
从数据框字符串列中提取特定单词并存储在 Python 的新列中