R:识别并删除列名无效的列

Posted

技术标签:

【中文标题】R:识别并删除列名无效的列【英文标题】:R: Identify and remove columns with invalid column names 【发布时间】:2018-02-07 14:13:24 【问题描述】:

有没有办法识别 R 中的无效列名?也许使用正则表达式或其他技术。

我正在从文本列生成 DocumentTermMatrix (DTM),然后将此 DTM 转换为数据框。我最终得到名称无效的列。例如

“node”“CLASS”“️️️️”“️️️”“de”“des”“ je devais”“夜”“她的眼睛”“ cpas chaud”“郁郁葱葱的化妆品”“  我看到了"

当我将此数据集传递给 mlr::makeClassificationTask 时,我收到以下错误消息

makeClassifTask(data = dat, target = "CLASS") 中的错误: “数据”断言失败:列必须根据 R 的变量命名规则命名。

因此,我想识别并删除所有名称无效的列。像

invalidColumnNames <- identify indexes of columns with invalid names
dat <- dat[,-invalidColumnNames]

可重现示例的数据:

cols <- c("node", "CLASS", "️️️️", "️️️", " de", " des", 
" kmh", " points", " zéro", "\u2615️\u2615️", "\u2615️", 
"\u2693️\u2693️", "\u26f5️\u2693️", "\u2728\u2728\u2728\u2728\u2728", 
"aaliassime", "aaron", "abaixoassinado", "abandono", "abat", 
"abattu", "abiertamente", "abierto", "abit", "able", "abomination", 
"abonnements", "abonnés", "abonnez", "abraham", "absolutely", 
"abstract", "abused", "acaba", "acabar", "acabo", "acadiebathurst", 
"acaï", "acc", "accept", "accèsloisirs", "access", "accessible", 
"accessories", "accident", "accidentally", "acción", "acciones", 
"accommodationsreligious", "accompli", "accomplie", "accomplir", 
"accorde", "accordent", "account", "accounts", "accro", "accueil", 
"accueille", "accueillir", "accurate", "accusé", "accusent", 
"acérées", "acériculteur", "acha", "achat", "achei", "acheté", 
"acheter", "acho", "acidités", "acknowledge", "acontecem", "acordei", 
"acquis", "across", "action", "activité", "activités", "actresses", 
"actualité", "actuel", "adam", "adaptation", "adapter", "added", 
"addicive", "addicted", "addition", "additives", "addressed", 
"adds", "adeus", "adjoint", "adjointeadministrative", "adjust", 
"administratives", "adopción", "adopté", "adorable")

期望的结果:

"node", "CLASS", " de", " des", 
" kmh", " points", " zéro", "aaliassime", "aaron", 
"abaixoassinado", "abandono", "abat", 
"abattu", "abiertamente", "abierto", "abit", "able", "abomination", 
"abonnements", "abonnés", "abonnez", "abraham", "absolutely", 
"abstract", "abused", "acaba", "acabar", "acabo", "acadiebathurst", 
"acaï", "acc", "accept", "accèsloisirs", "access", "accessible", 
"accessories", "accident", "accidentally", "acción", "acciones", 
"accommodationsreligious", "accompli", "accomplie", "accomplir", 
"accorde", "accordent", "account", "accounts", "accro", "accueil", 
"accueille", "accueillir", "accurate", "accusé", "accusent", 
"acérées", "acériculteur", "acha", "achat", "achei", "acheté", 
"acheter", "acho", "acidités", "acknowledge", "acontecem", "acordei", 
"acquis", "across", "action", "activité", "activités", "actresses", 
"actualité", "actuel", "adam", "adaptation", "adapter", "added", 
"addicive", "addicted", "addition", "additives", "addressed", 
"adds", "adeus", "adjoint", "adjointeadministrative", "adjust", 
"administratives", "adopción", "adopté", "adorable"

非常感谢任何帮助。

【问题讨论】:

您的列名似乎都对我有用。请参阅此处以查看 R 中变量命名的限制:***.com/questions/9195718/… 【参考方案1】:

也许你可以试试这个新包:

library(janitor) newdataobject <- read.csv("yourcsvfilewithpath.csv", header=T) %>% clean_names()

【讨论】:

这是一个很好的解决方法。我看到带有符号的列正在被重命名。我希望这可以解决 mlr::makeClassifTask 中的错误。但由于某种原因,错误仍然存​​在。【参考方案2】:

请参阅?make.names 了解此类情况。我还要删除变量开头和结尾的空格,所以:

cols <- trimws(cols)
cols[make.names(cols)==cols]

# [1] "node"  "CLASS"   "de"    "des"                    
# [5] "kmh"   "points"  "zéro"  "aaliassime" ...

【讨论】:

不幸的是,在 mlr::makeClassifTask 上仍然出现错误。从 Tim 共享的链接中查看有效名称的限制,我想知道我是否可能有实际上是保留字的列名。 @zunman - points 似乎是唯一有问题的。

以上是关于R:识别并删除列名无效的列的主要内容,如果未能解决你的问题,请参考以下文章

怎样将sqlserver数据库里的列字段当做列名并依此分组?

pandas删除没有列名的列

pandas删除没有列名的列

删除多个表中的列,其列名与另一个表中的值类似

无效的列名“销售季度”错误消息

vs2015添加或删除文件时提示名称无效