如何从包含在单个列中的文本构建 data.frame?
Posted
技术标签:
【中文标题】如何从包含在单个列中的文本构建 data.frame?【英文标题】:How to build data.frame from text contained in one single column? 【发布时间】:2021-01-10 05:44:18 【问题描述】:问题是需要将单行数据格式化为具有适当列的 R data.frame 以将行数据存储到列数据中。原始数据的格式是一个文本文件,有 2 行标题和 N 行数据。数据在单列中。它需要格式化为单独的列。 基本上,如何将文本从每一列拆分为一个新的data.frame?
输入示例:
例如,输入包含两行,其中包含 1) 标题名称、2) 子标题名称以及 N 行,其中包含包含所有数据的单个列,例如 RD,I ,01,027,0001,88101,1,7,105,120,19990103,00:00....
输出预期
一个新的 R data.frame 将具有列标题(子标题)和将从单列数据行解析的列。例如:一个单列数据行如下所示: RD,I,01,027,0001,88101,1,7,105,120,19990103。但是这个单列数据需要格式化成单独的数据列。
预期输出
RD Action.Code State.Code County.Code Site.ID Parameter POC Sample.Duration Unit
1 RC Action Code State Code County Code Site ID Parameter POC Unit Method
2 RD I 01 027 0001 88101 1 7 105
3 RD I 01 027 0001 88101 1 7 105
4 RD I 01 027 0001 88101 1 7 105
5 RD I 01 027 0001 88101 1 7 105
6 RD I 01 027 0001 88101 1 7 105
Method Date Start.Time Sample.Value Null.Data.Code
1 Year Period Number of Samples Composite Type Sample Value
2 120 19990103 00:00 AS
3 120 19990106 00:00 AS
4 120 19990109 00:00 AS
5 120 19990112 00:00 8.841
6 120 19990115 00:00 14.92
Sampling.Frequency Monitor.Protocol..MP..ID Qualifier...1 Qualifier...2
1 Monitor Protocol (MP) ID Qualifier - 1 Qualifier - 2 Qualifier - 3
2 3
3 3
4 3
5 3
6 3
Qualifier...3 Qualifier...4 Qualifier...5 Qualifier...6 Qualifier...7
1 Qualifier - 4 Qualifier - 5 Qualifier - 6 Qualifier - 7 Qualifier - 8
2
3
4
5
6
Qualifier...8 Qualifier...9 Qualifier...10
1 Qualifier - 9 Qualifier - 10 Alternate Method Detectable Limit
2
3
4
5
6
Alternate.Method.Detectable.Limit Uncertainty year
1 Uncertainty NA 1999
2 NA 1999
3 NA 1999
4 NA 1999
5 NA 1999
6 NA 1999
>
DPUT:
> dput(pm_1999[1:20])
c(X..RD = "# RD,Action Code,State Code,County Code,Site ID,Parameter,POC,Sample Duration,Unit,Method,Date,Start Time,Sample Value,Null Data Code,Sampling Frequency,Monitor Protocol (MP) ID,Qualifier - 1,Qualifier - 2,Qualifier - 3,Qualifier - 4,Qualifier - 5,Qualifier - 6,Qualifier - 7,Qualifier - 8,Qualifier - 9,Qualifier - 10,Alternate Method Detectable Limit,Uncertainty",
Action.Code = "# RC,Action Code,State Code,County Code,Site ID,Parameter,POC,Unit,Method,Year,Period,Number of Samples,Composite Type,Sample Value,Monitor Protocol (MP) ID,Qualifier - 1,Qualifier - 2,Qualifier - 3,Qualifier - 4,Qualifier - 5,Qualifier - 6,Qualifier - 7,Qualifier - 8,Qualifier - 9,Qualifier - 10,Alternate Method Detectable Limit,Uncertainty",
State.Code = "RD,I,01,027,0001,88101,1,7,105,120,19990103,00:00,,AS,3,,,,,,,,,,,,,",
County.Code = "RD,I,01,027,0001,88101,1,7,105,120,19990106,00:00,,AS,3,,,,,,,,,,,,,",
Site.ID = "RD,I,01,027,0001,88101,1,7,105,120,19990109,00:00,,AS,3,,,,,,,,,,,,,",
Parameter = "RD,I,01,027,0001,88101,1,7,105,120,19990112,00:00,8.841,,3,,,,,,,,,,,,,",
POC = "RD,I,01,027,0001,88101,1,7,105,120,19990115,00:00,14.92,,3,,,,,,,,,,,,,",
Sample.Duration = "RD,I,01,027,0001,88101,1,7,105,120,19990118,00:00,3.878,,3,,,,,,,,,,,,,",
Unit = "RD,I,01,027,0001,88101,1,7,105,120,19990121,00:00,9.042,,3,,,,,,,,,,,,,",
Method = "RD,I,01,027,0001,88101,1,7,105,120,19990124,00:00,5.464,,3,,,,,,,,,,,,,",
Date = "RD,I,01,027,0001,88101,1,7,105,120,19990127,00:00,20.17,,3,,,,,,,,,,,,,",
Start.Time = "RD,I,01,027,0001,88101,1,7,105,120,19990130,00:00,11.56,,3,,,,,,,,,,,,,",
Sample.Value = "RD,I,01,027,0001,88101,1,7,105,120,19990202,00:00,13.68,,3,,,,,,,,,,,,,",
Null.Data.Code = "RD,I,01,027,0001,88101,1,7,105,120,19990205,00:00,7.251,,3,,,,,,,,,,,,,",
Sampling.Frequency = "RD,I,01,027,0001,88101,1,7,105,120,19990208,00:00,11.47,,3,,,,,,,,,,,,,",
Monitor.Protocol..MP..ID = "RD,I,01,027,0001,88101,1,7,105,120,19990211,00:00,13.46,,3,,,,,,,,,,,,,",
Qualifier...1 = "RD,I,01,027,0001,88101,1,7,105,120,19990214,00:00,46.20,,3,,,,,,,,,,,,,",
Qualifier...2 = "RD,I,01,027,0001,88101,1,7,105,120,19990217,00:00,11.25,,3,,,,,,,,,,,,,",
Qualifier...3 = "RD,I,01,027,0001,88101,1,7,105,120,19990220,00:00,,AN,3,,,,,,,,,,,,,",
Qualifier...4 = "RD,I,01,027,0001,88101,1,7,105,120,19990223,00:00,,AN,3,,,,,,,,,,,,,"
)
>
R Studio 环境中此资源的视图,如下图所示。
【问题讨论】:
您可以编辑您的问题并使用dput(pm_1999[1:20])
添加数据样本并将输出粘贴到您的问题中!
【参考方案1】:
用逗号分割字符串,rbind
将列表分割为行。我们可以从第一行分配标题并从数据框中删除该行。
#Split the data on comma and create a list
df <- do.call(rbind.data.frame, strsplit(pm_1999, ','))
#Assign headers from 1st row of the data
names(df) <- df[1, ]
#Remove the 1st row from the data
df <- df[-1, ]
【讨论】:
以上是关于如何从包含在单个列中的文本构建 data.frame?的主要内容,如果未能解决你的问题,请参考以下文章
使用 NatTable 在树表中的单个列中同时编辑 CheckBox 和 Text
如何使用正则表达式语法从给定列中的文本中删除“省略号”? [复制]
如果仅句子包含搜索列表中的任何关键字,则从数据框文本列中选择句子