在附属机构中查找城市名称,并将它们与其对应的国家/地区添加到数据框的新列中
Posted
技术标签:
【中文标题】在附属机构中查找城市名称,并将它们与其对应的国家/地区添加到数据框的新列中【英文标题】:Find city names within affiliations and add them with their corresponding countries in new columns of a dataframe 【发布时间】:2021-09-06 12:10:52 【问题描述】:我有一个隶属关系的数据框“dfa”,其中包含城市名称,有时会缺少该国家/地区,例如比如第 4 行(巴格达)和第 7 行(柏林):
dfa <- data.frame(affiliation=c("DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS",
"DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA.",
"DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES",
"COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD.",
"DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA.",
"LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY.",
"DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN",
"INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY.",
"DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND.",
"DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN",
"DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN.",
"LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA."))
我现在有第二个数据框“dfb”,其中包含城市和相应国家/地区的列表,其中一些存在于“dfa”中:
dfb <- data.frame(city=c("AGRI","AMSTERDAM","ATHENS","AUCKLAND","BUENOS AIRES","BEIJING","BAGHDAD","BANGKOK","BERLIN","BUDAPEST"),
country=c("TURKEY","NETHERLANDS","GREECE","NEW ZEALAND","ARGENTINA","CHINA","IRAQ","THAILAND","GERMANY","HUNGARY"))
如何在两个新列中添加城市和对应的国家/地区,仅针对同时出现在“dfa”和“dfb”中的城市(即使国家/地区缺失,如巴格达和柏林)?
注意:目标是添加完整城市名称,但不是其中的一部分。下面的第 7 行是不想要的示例:土耳其农业城市与柏林的关联不恰当,因为该行包含“农业”一词。
有没有一种简单的方法可以做到这一点,最好是使用 dplyr?
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN AGRI TURKEY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
【问题讨论】:
我想帮助 ypu,但创建数据集的工作量太大。请将两个数据集中的前 10 行复制粘贴为文本 图片不是共享数据/代码的正确方式。以更易于复制的可复制格式添加它们。阅读how to give a reproducible example。 确实,这里是代码。感谢 Samuel 和 Ronak 的建议。 恶魔,如果提供的答案解决了您的问题,您是否可以考虑将其指定为正确的解决方案? 【参考方案1】:str_extract
与联接或另一个str_extract
的组合是让您到达那里的一种选择。
str_extract
将获得它遇到的第一个值,paste0
将城市折叠成一个长的 or
字符串以进行检查。
library(dplyr)
library(stringr)
dfa %>%
mutate(city = str_extract(dfa$affiliation, paste0("\\b", dfb$city, collapse = "\\b|"))) %>%
left_join(dfb, by = "city")
编辑:在paste0
中添加单词边界,以便仅匹配整个城市名称并避免部分匹配。
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN BERLIN GERMANY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
【讨论】:
谢谢+++ phiver,它运行良好。但是,由于我没有指定,所以您无法预料到一个限制:请阅读上述更新问题中的注意事项。 @demon,您的数据是这样的,还是“BERLIN INSTITUTE OF HEALTH, BERLIN”的附属行更像:“BERLIN INSTITUTE OF HEALTH, , ,BERLIN,”?因为缺少的逗号可用会使其更容易,因为您可以使用 tidyr::separate,而不是查找匹配项。 很遗憾没有,隶属关系类似于“柏林卫生研究所,柏林”。在 str_exact 中使用 "\w" 似乎可以匹配任何单词,但我不知道如何在您的代码中包含这种方式 (stringr.tidyverse.org/articles/regular-expressions.html)。谢谢你。 @demon,查看调整后的答案。 非常感谢phiver,它非常有用。我将您的代码应用于与大“dfb”(数千个城市)交叉的大型“dfa”数据框(数千个附属机构),并且效果很好。唯一无法避免的限制是同一个城市有多个国家(例如汉密尔顿,它是加拿大、新西兰、美国和英国的城市)。幸运的是,由于下面的 Andy Eggers 代码,在这些情况下,行是重复的,因此提供了所有可能性,这允许根据其他条件过滤掉重复的行,并手动消除那些不合适的行。【参考方案2】:这种方法说明了一个附属机构可以匹配多个城市名称的可能性。
library(tidyverse)
dfa %>%
mutate(city = map(affiliation, ~ str_extract(.x, dfb$city))) %>%
unnest(cols = c(city)) %>%
group_by(affiliation) %>%
mutate(nmatches = sum(!is.na(city))) %>%
filter((nmatches > 0 & !is.na(city)) | (nmatches == 0 & row_number() == 1)) %>%
ungroup() %>%
left_join(dfb, by = "city") %>%
mutate(country_match = str_detect(affiliation, country))
# A tibble: 12 x 5
affiliation city nmatches country country_match
<chr> <chr> <int> <chr> <lgl>
1 DEPARTMENT OF PHARMACY,… AMSTE… 1 NETHER… TRUE
2 DEPARTMENT OF BIOCHEMIS… NA 0 NA NA
3 DEPARTMENT OF PATHOLOGY… NA 0 NA NA
4 COLLEGE OF EDUCATION FO… BAGHD… 1 IRAQ FALSE
5 DEPARTMENT OF CLINICAL … BEIJI… 1 CHINA TRUE
6 LABORATORY OF MOLECULAR… NA 0 NA NA
7 BERLIN INSTITUTE OF HEA… BERLIN 1 GERMANY FALSE
8 INSTITUTE OF LABORATORY… NA 0 NA NA
9 DEPARTMENT OF CLINICAL … BANGK… 1 THAILA… TRUE
10 DEPARTMENT OF BIOLOGY, … NA 0 NA NA
11 DEPARTMENT OF MOLECULAR… NA 0 NA NA
12 LABORATORY OF CARDIOVAS… BEIJI… 1 CHINA TRUE
然后,您可以使用 1 个 nmatches
但 country_match == F
仔细检查案例,当有 2 个或更多 nmatches
时,您可以使用 country_match == T
保留一个。
【讨论】:
非常感谢安迪,它运作良好。 'nmatches' 列非常有用(请参阅上面对 phiver 的评论)。 我详细说明了如何避免手动清洁。以上是关于在附属机构中查找城市名称,并将它们与其对应的国家/地区添加到数据框的新列中的主要内容,如果未能解决你的问题,请参考以下文章
当城市名称等于某个国家/地区的名称(不仅如此)时,CLGeocoder 返回错误的结果
如何在 WooCommerce 中查找任何国家/地区的州/城市