基于字母的列内字母数字值的分离
Posted
技术标签:
【中文标题】基于字母的列内字母数字值的分离【英文标题】:Seperation on letternumber value inside a column based on a letter 【发布时间】:2017-03-23 12:13:14 【问题描述】:我在列内分离时遇到问题
该列中的数据是设备位置的代码,类似于 SE005 或 H0002 或 MANA。 S 是一种移动设备,后面的字母说明了它的使用位置。
SE005是第五个移动设备名E。
H0002 是 H 位置的 2 号固定设备。
MANA 是一个地方的设备
对于我在 Power BI 中的分析,我不需要在一个设备上扫描多少文章,而我并不关心究竟是哪个设备。因为 Power BI 无法汇总设备的每个位置(因为它是列内的组合值),所以我想将其拆分。
我希望它看起来像这样。
v1 v2 v3
SE005 becomes S E 005 # 2 separations
H0002 becomes H 005 #1 separation and one deleted number
MANA MANA #R should not change this but is should be inside the same column as E and H
我必须将此应用于 800 万行。 而且我认为必须分两到三步完成,首先将字母与数字分开。 请注意,有比预览中的更多的字母。但安排是一样的。任何帮助表示赞赏。
编辑
只想拆分设备列,以便 power bi 可以使用它。
art <- c(1:100)
device <-c("SE05", "H005", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H005", "E003", "MANA", "J012", "X01", "W007", "MANA", "SE02", "H005",
"SE05", "H007", "E003", "MANA", "J012", "X02", "W007", "MANA", "SE02", "H005",
"SE05", "H008", "E004", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H005", "E003", "MANA", "J012", "X017", "W007", "MANA", "SE02", "H005",
"SE05", "H0010", "E008", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H005", "E003", "MANA", "J012", "X009", "W007", "MANA", "SE02", "H005",
"SE05", "H0010", "E0010", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H005", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H009", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005")
ACCEPT <- as.data.frame(art)
ACCEPT$device <- device
head(BLABLA)
Article device V3 V4
1 52290 SE05 20170223 162756
2 52300 SE05 20170223 162758
3 10090 SE05 20170223 162831
4 10060 SE08 20170223 162834
5 10070 SE08 20170223 162839
6 10070 SE08 20170223 162859
【问题讨论】:
相关:***.com/questions/3003527/… 谢谢,但我已经看过其中的一些。但是使用一个特定的点来溢出一个点或一个破折号。如果我想使用它,我需要输入所有需要分隔的字母吗? 你需要的是sub
和正则表达式。它在第一个答案中。您想将数字替换为 v3,将数字前的字母替换为 v1 和 v2,将不带数字的字母仅替换为 v2。
第一个 awnser 用户 \\ 和点以及其他东西,如果我用字母填充它,它就不起作用。所以我不明白你的意思
您是否考虑过查看sub
和regexp
的帮助?
【参考方案1】:
尝试this site 以更好地了解正则表达式以及如何在您的案例中应用它。如果没有可重现的示例,就很难理解您的具体情况以及您可能遇到的极端情况。希望我下面的示例可以帮助您入门:
编辑:更改了我的答案以使用您的示例数据集
art <- c(1:100)
device <-c("SE05", "H005", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H005", "E003", "MANA", "J012", "X01", "W007", "MANA", "SE02", "H005",
"SE05", "H007", "E003", "MANA", "J012", "X02", "W007", "MANA", "SE02", "H005",
"SE05", "H008", "E004", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H005", "E003", "MANA", "J012", "X017", "W007", "MANA", "SE02", "H005",
"SE05", "H0010", "E008", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H005", "E003", "MANA", "J012", "X009", "W007", "MANA", "SE02", "H005",
"SE05", "H0010", "E0010", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H005", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
"SE05", "H009", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005")
ACCEPT <- as.data.frame(art)
ACCEPT$device <- device
library(tidyverse)
library(magrittr)
library(stringr)
# Find mobile devices
# '^' for start of string
# '[\\D]' for any non-numeric
# '2' for exactly two of them
ACCEPT %<>% mutate(mobile = str_detect(device, pattern = '^[\\D]2[\\d]1'))
# Now looking for exactly one letter at the start,
# followed by a number
ACCEPT %<>% mutate(immobile = str_detect(device, pattern = '^[\\D]1[\\d]1'))
# Finally, look for "no numbers"
# (alternatively, if all places have the same value, '== "MANA"' would do)
ACCEPT %<>% mutate(place = !str_detect(device, pattern = '\\d'))
# Split and process device types individually
bind_rows(ACCEPT %>%
filter(mobile) %>%
mutate(v1 = str_extract(device, pattern = '[^\\d]1'),
v2 = str_sub(device, start = 2, end = 2),
v3 = str_extract(device, pattern = '\\d1,9')),
ACCEPT %>%
filter(immobile) %>%
mutate(v1 = '',
v2 = str_sub(device, start = 1, end = 1),
v3 = str_extract(device, pattern = '\\d1,9')),
ACCEPT %>%
filter(place) %>%
mutate(v1 = '',
v2 = device,
v3 = '')) %>%
arrange(art) %>%
select(art, v1, v2, v3)
【讨论】:
我不知道为什么,但最终结果使用此代码的记录较少。 哎呀,你是对的!我在ACCEPT %<>% mutate(mobile = str_detect(device, pattern = '^[\\D]2'))
中有一个错误,它应该是ACCEPT %<>% mutate(mobile = str_detect(device, pattern = '^[\\D]2[\\d]1'))
。答案相应更新
我不会再丢失完美的代码了。但我不打算为他们提供移动设备和地点,但如果需要获得我想要的东西,那很好。我仍在寻找 S、E 和 02 中总代码 SE02 的分离。你的第一个代码做了什么,但我丢失了记录。
我认为预先计算用于拆分处理的设备类型将更好地扩展到更大的数据集。您可以通过在末尾添加select(art, v1, v2, v3)
来删除多余的列。我现在将其添加到我的答案中的代码中。关于将 SE02 拆分为 S、E 和 02:从上面的示例中这对我有用。究竟是什么不是你想要的方式?
现在可以满足您的需求了吗?如果是这样,请接受它作为您问题的答案:)【参考方案2】:
这是一个略短的版本,不使用dplyr
。
# v2 gets all of 'device' so long as this is entirely alphabetical:
ACCEPT$v2 <- ifelse(grepl('^[A-Z]+$', ACCEPT$device), ACCEPT$device, NA)
# v3 gets the number, if there is one - we check by seeing if v2 is NA
ACCEPT$v3 <- ifelse(is.na(ACCEPT$v2), sub('\\D+(\\d+)', '\\1', ACCEPT$device), NA)
# now v1 and v2 will get the first two letters,
# but only if v2 hasn't already been filled out:
ACCEPT$v1[is.na(ACCEPT$v2)] <- substr(ACCEPT$device[is.na(ACCEPT$v2)], 1, 1)
ACCEPT$v2[is.na(ACCEPT$v2)] <- substr(ACCEPT$device[is.na(ACCEPT$v2)], 2, 2)
【讨论】:
没有数据丢失,但它出错了,因为只有“S”是移动的 en 应该在单独的列中,而不是“E”。与您的代码 E 是分开的,而 S 是与其他不动的地方。 啊,所以V1应该只有S?而且H0002也变成了H005?那个005是从哪里来的?还是您的问题中有错字,您的意思是它变成了 H002? 是的 S 应该是 V1(说明它是一个移动设备,并且在位置 x 使用,它位于带有 S 的代码的第二个字母中),所有其他字母(位置)V2和数字V3。是的,这是我的问题中的一个错字。 H0002 应该是 H 和 002。以上是关于基于字母的列内字母数字值的分离的主要内容,如果未能解决你的问题,请参考以下文章