读取分隔文件,其中分号显示为分隔符和字符串
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了读取分隔文件,其中分号显示为分隔符和字符串相关的知识,希望对你有一定的参考价值。
我正在尝试读取一个文件,其中一些行在文本字符串中包含额外的分号(我不知道是什么导致了这个)
例如,这是一个具有相同问题的超简化数据:
bad_data <- "100; Mc Donalds; Seattle; normal day
115; Starbucks; Boston; normal day
400; PF Chang; Chicago; busy day
400;; Texas; busy day
10; D;unkin Donuts; Washin;gton; lazy day"
所以它没有标题,我试着用它读取:
library(data.table)
fread(bad_data, sep = ";", header = F, na.strings = c("", NA), strip.white = T)
但是没有雪茄......这有点不可能阅读,如果没有干净的解决方案,我想跳过这些行。
答案
如果您只想删除没有预期分隔符数的行:
library(stringi)
library(magrittr)
bad_data <-
"100; Mc Donalds; Seattle; normal day
115; Starbucks; Boston; normal day
400; PF Chang; Chicago; busy day
400;; Texas; busy day
10; D;unkin Donuts; Washin;gton; lazy day"
# split to lines. you could also use readLines if it's coming from a file
text_lines <- unlist(strsplit(bad_data, '
'))
# which lines contain the expected number of semicolons?
good_lines <- sapply(text_lines, function(x) stri_count_fixed(x, ';') == 3)
# for those lines, split to vectors and (optional bonus) trim whitespace
good_vectors <- lapply(
text_lines[good_lines],
function(x) x %>% strsplit(';') %>% unlist %>% trimws)
# flatten to matrix (from which you can make a data.frame or whatever you want)
my_mat <- do.call(rbind, good_vectors)
结果:
> my_mat
[,1] [,2] [,3] [,4]
[1,] "100" "Mc Donalds" "Seattle" "normal day"
[2,] "115" "Starbucks" "Boston" "normal day"
[3,] "400" "PF Chang" "Chicago" "busy day"
[4,] "400" "" "Texas" "busy day"
另一答案
您可以尝试删除文本字符串中的所有分号(这假设所有不需要的分号都完全在字符串中:
gsub("(\S);(\S)", "\1\2", bad_data, perl=TRUE)
[1] "100; Mc Donalds; Seattle; normal day
115; Starbucks; Boston; normal day
400; PF Chang; Chicago; busy day
400; Texas; busy day
10; Dunkin Donuts; Washington; lazy day"
以上是关于读取分隔文件,其中分号显示为分隔符和字符串的主要内容,如果未能解决你的问题,请参考以下文章
C++:使用 fgetc 读取 csv 文件,并用分号“;”分隔单词
在 spark java 中读取具有固定宽度和分隔符的文本文件