如何从包含在单个列中的文本构建 data.frame?

Posted

技术标签:

【中文标题】如何从包含在单个列中的文本构建 data.frame?【英文标题】:How to build data.frame from text contained in one single column? 【发布时间】:2021-01-10 05:44:18 【问题描述】:

问题是需要将单行数据格式化为具有适当列的 R data.frame 以将行数据存储到列数据中。原始数据的格式是一个文本文件,有 2 行标题和 N 行数据。数据在单列中。它需要格式化为单独的列。 基本上,如何将文本从每一列拆分为一个新的data.frame?

输入示例:

例如,输入包含两行,其中包含 1) 标题名称、2) 子标题名称以及 N 行,其中包含包含所有数据的单个列,例如 RD,I ,01,027,0001,88101,1,7,105,120,19990103,00:00....

输出预期

一个新的 R data.frame 将具有列标题(子标题)和将从单列数据行解析的列。例如:一个单列数据行如下所示: RD,I,01,027,0001,88101,1,7,105,120,19990103。但是这个单列数据需要格式化成单独的数据列。

预期输出

  RD Action.Code State.Code County.Code Site.ID Parameter POC Sample.Duration   Unit
1 RC Action Code State Code County Code Site ID Parameter POC            Unit Method
2 RD           I         01         027    0001     88101   1               7    105
3 RD           I         01         027    0001     88101   1               7    105
4 RD           I         01         027    0001     88101   1               7    105
5 RD           I         01         027    0001     88101   1               7    105
6 RD           I         01         027    0001     88101   1               7    105
  Method     Date        Start.Time   Sample.Value Null.Data.Code
1   Year   Period Number of Samples Composite Type   Sample Value
2    120 19990103             00:00                            AS
3    120 19990106             00:00                            AS
4    120 19990109             00:00                            AS
5    120 19990112             00:00          8.841               
6    120 19990115             00:00          14.92               
        Sampling.Frequency Monitor.Protocol..MP..ID Qualifier...1 Qualifier...2
1 Monitor Protocol (MP) ID            Qualifier - 1 Qualifier - 2 Qualifier - 3
2                        3                                                     
3                        3                                                     
4                        3                                                     
5                        3                                                     
6                        3                                                     
  Qualifier...3 Qualifier...4 Qualifier...5 Qualifier...6 Qualifier...7
1 Qualifier - 4 Qualifier - 5 Qualifier - 6 Qualifier - 7 Qualifier - 8
2                                                                      
3                                                                      
4                                                                      
5                                                                      
6                                                                      
  Qualifier...8  Qualifier...9                    Qualifier...10
1 Qualifier - 9 Qualifier - 10 Alternate Method Detectable Limit
2                                                               
3                                                               
4                                                               
5                                                               
6                                                               
  Alternate.Method.Detectable.Limit Uncertainty year
1                       Uncertainty          NA 1999
2                                            NA 1999
3                                            NA 1999
4                                            NA 1999
5                                            NA 1999
6                                            NA 1999
> 

DPUT:

> dput(pm_1999[1:20])
c(X..RD = "# RD,Action Code,State Code,County Code,Site ID,Parameter,POC,Sample Duration,Unit,Method,Date,Start Time,Sample Value,Null Data Code,Sampling Frequency,Monitor Protocol (MP) ID,Qualifier - 1,Qualifier - 2,Qualifier - 3,Qualifier - 4,Qualifier - 5,Qualifier - 6,Qualifier - 7,Qualifier - 8,Qualifier - 9,Qualifier - 10,Alternate Method Detectable Limit,Uncertainty", 
Action.Code = "# RC,Action Code,State Code,County Code,Site ID,Parameter,POC,Unit,Method,Year,Period,Number of Samples,Composite Type,Sample Value,Monitor Protocol (MP) ID,Qualifier - 1,Qualifier - 2,Qualifier - 3,Qualifier - 4,Qualifier - 5,Qualifier - 6,Qualifier - 7,Qualifier - 8,Qualifier - 9,Qualifier - 10,Alternate Method Detectable Limit,Uncertainty", 
State.Code = "RD,I,01,027,0001,88101,1,7,105,120,19990103,00:00,,AS,3,,,,,,,,,,,,,", 
County.Code = "RD,I,01,027,0001,88101,1,7,105,120,19990106,00:00,,AS,3,,,,,,,,,,,,,", 
Site.ID = "RD,I,01,027,0001,88101,1,7,105,120,19990109,00:00,,AS,3,,,,,,,,,,,,,", 
Parameter = "RD,I,01,027,0001,88101,1,7,105,120,19990112,00:00,8.841,,3,,,,,,,,,,,,,", 
POC = "RD,I,01,027,0001,88101,1,7,105,120,19990115,00:00,14.92,,3,,,,,,,,,,,,,", 
Sample.Duration = "RD,I,01,027,0001,88101,1,7,105,120,19990118,00:00,3.878,,3,,,,,,,,,,,,,", 
Unit = "RD,I,01,027,0001,88101,1,7,105,120,19990121,00:00,9.042,,3,,,,,,,,,,,,,", 
Method = "RD,I,01,027,0001,88101,1,7,105,120,19990124,00:00,5.464,,3,,,,,,,,,,,,,", 
Date = "RD,I,01,027,0001,88101,1,7,105,120,19990127,00:00,20.17,,3,,,,,,,,,,,,,", 
Start.Time = "RD,I,01,027,0001,88101,1,7,105,120,19990130,00:00,11.56,,3,,,,,,,,,,,,,", 
Sample.Value = "RD,I,01,027,0001,88101,1,7,105,120,19990202,00:00,13.68,,3,,,,,,,,,,,,,", 
Null.Data.Code = "RD,I,01,027,0001,88101,1,7,105,120,19990205,00:00,7.251,,3,,,,,,,,,,,,,", 
Sampling.Frequency = "RD,I,01,027,0001,88101,1,7,105,120,19990208,00:00,11.47,,3,,,,,,,,,,,,,", 
Monitor.Protocol..MP..ID = "RD,I,01,027,0001,88101,1,7,105,120,19990211,00:00,13.46,,3,,,,,,,,,,,,,", 
Qualifier...1 = "RD,I,01,027,0001,88101,1,7,105,120,19990214,00:00,46.20,,3,,,,,,,,,,,,,", 
Qualifier...2 = "RD,I,01,027,0001,88101,1,7,105,120,19990217,00:00,11.25,,3,,,,,,,,,,,,,", 
Qualifier...3 = "RD,I,01,027,0001,88101,1,7,105,120,19990220,00:00,,AN,3,,,,,,,,,,,,,", 
Qualifier...4 = "RD,I,01,027,0001,88101,1,7,105,120,19990223,00:00,,AN,3,,,,,,,,,,,,,"
)
> 

R Studio 环境中此资源的视图,如下图所示。

【问题讨论】:

您可以编辑您的问题并使用dput(pm_1999[1:20]) 添加数据样本并将输出粘贴到您的问题中! 【参考方案1】:

用逗号分割字符串,rbind 将列表分割为行。我们可以从第一行分配标题并从数据框中删除该行。

#Split the data on comma and create a list
df <- do.call(rbind.data.frame, strsplit(pm_1999, ','))
#Assign headers from 1st row of the data
names(df) <- df[1, ]
#Remove the 1st row from the data
df <- df[-1, ]

【讨论】:

以上是关于如何从包含在单个列中的文本构建 data.frame?的主要内容,如果未能解决你的问题,请参考以下文章

如何从包含文本的熊猫数据框中的列中提取年份(或日期时间)

使用 NatTable 在树表中的单个列中同时编辑 CheckBox 和 Text

如何使用正则表达式语法从给定列中的文本中删除“省略号”? [复制]

如果仅句子包含搜索列表中的任何关键字,则从数据框文本列中选择句子

如何从文本 (NVARCHAR(MAX)) 列中提取一个或多个 URL

SQL 如何从单个列中的所有值创建 JSON 数组