拆分一串名称并转置

Posted

技术标签:

【中文标题】拆分一串名称并转置【英文标题】:Split a string of names and transpose 【发布时间】:2022-01-16 13:22:25 【问题描述】:

我有一个名称列表(著名导演),格式为“名”(可能是中间名)和“姓氏”,我需要将其重新排列为姓氏,名次(可能是中间名)。我不能将所有这些都用第一个空格甚至第二个空格分开,因为有些姓氏实际上有两个单词,有些有中间名和/或中间名首字母,它们会留在名字后面。

这是我正在使用的列表的 dput:

> dput(directors.names)
c("Frank Darabont,", "Francis Ford Coppola,", "Francis Ford Coppola,", 
"Christopher Nolan,", "Sidney Lumet,", "Steven Spielberg,", "Peter Jackson,", 
"Quentin Tarantino,", "Sergio Leone,", "Peter Jackson,", "David Fincher,", 
"Robert Zemeckis,", "Christopher Nolan,", "Peter Jackson,", "Irvin Kershner,", 
"Lana Wachowski,", "Martin Scorsese,", "Milos Forman,", "Akira Kurosawa,", 
"David Fincher,", "Jonathan Demme,", "Fernando Meirelles,", "Roberto Benigni,", 
"Frank Capra,", "Steven Spielberg,", "George Lucas,", "Christopher Nolan,", 
"Hayao Miyazaki,", "Frank Darabont,", "Bong Joon Ho,", "Luc Besson,", 
"Masaki Kobayashi,", "Roman Polanski,", "James Cameron,", "Robert Zemeckis,", 
"Bryan Singer,", "Alfred Hitchcock,", "Roger Allers,", "Charles Chaplin,", 
"Tony Kaye,", "Isao Takahata,", "Charles Chaplin,", "Damien Chazelle,", 
"Ridley Scott,", "Martin Scorsese,", "Olivier Nakache,", "Christopher Nolan,", 
"Michael Curtiz,", "Sergio Leone,", "Alfred Hitchcock,", "Giuseppe Tornatore,", 
"Ridley Scott,", "Francis Ford Coppola,", "Christopher Nolan,", 
"Steven Spielberg,", "Charles Chaplin,", "Quentin Tarantino,", 
"Florian Henckel von Donnersmarck,", "Stanley Kubrick,", "Billy Wilder,", 
"Andrew Stanton,", "Anthony Russo,", "Billy Wilder,", "Stanley Kubrick,", 
"Bob Persichetti,", "Stanley Kubrick,", "Hayao Miyazaki,", "Park Chan-Wook,", 
"Todd Phillips,", "Makoto Shinkai,", "Lee Unkrich,", "Christopher Nolan,", 
"James Cameron,", "Sergio Leone,", "Anthony Russo,", "Nadine Labaki,", 
"Wolfgang Petersen,", "Akira Kurosawa,", "Rajkumar Hirani,", 
"John Lasseter,", "Sam Mendes,", "Milos Forman,", "Mel Gibson,", 
"Quentin Tarantino,", "Thomas Kail,", "Gus Van Sant,", "Richard Marquand,", 
"Stanley Kubrick,", "Quentin Tarantino,", "Elem Klimov,", "Fritz Lang,", 
"Aamir Khan,", "Alfred Hitchcock,", "Orson Welles,", "Thomas Vinterberg,", 
"Darren Aronofsky,", "Stanley Donen,", "Alfred Hitchcock,", "Michel Gondry,", 
"Akira Kurosawa,", "Vittorio De Sica,", "David Lean,", "Charles Chaplin,", 
"Stanley Kubrick,", "Nitesh Tiwari,", "Billy Wilder,", "Denis Villeneuve,", 
"Florian Zeller,", "Fritz Lang,", "Billy Wilder,", "Stanley Kubrick,", 
"Martin Scorsese,", "Asghar Farhadi,", "George Roy Hill,", "Brian De Palma,", 
"Satyajit Ray,", "Guy Ritchie,", "Sam Mendes,", "Jean-Pierre Jeunet,", 
"Robert Mulligan,", "Lee Unkrich,", "Sergio Leone,", "Pete Docter,", 
"Steven Spielberg,", "Michael Mann,", "Curtis Hanson,", "T.J. Gnanavel,", 
"Akira Kurosawa,", "John McTiernan,", "Akira Kurosawa,", "Akira Kurosawa,", 
"Peter Farrelly,", "Oliver Hirschbiegel,", "Terry Gilliam,", 
"Joseph L. Mankiewicz,", "Billy Wilder,", "Christopher Nolan,", 
"Clint Eastwood,", "Majid Majidi,", "Hayao Miyazaki,", "Martin Scorsese,", 
"Stanley Kramer,", "John Sturges,", "Paul Thomas Anderson,", 
"Martin Scorsese,", "John Huston,", "Guillermo del Toro,", "Ron Howard,", 
"Juan José Campanella,", "Martin Scorsese,", "Akira Kurosawa,", 
"Roman Polanski,", "Hayao Miyazaki,", "Guy Ritchie,", "Martin Scorsese,", 
"Ethan Coen,", "Charles Chaplin,", "Alfred Hitchcock,", "John Carpenter,", 
"Ingmar Bergman,", "Martin McDonagh,", "Sergio Pablos,", "David Lynch,", 
"M. Night Shyamalan,", "Ingmar Bergman,", "Peter Weir,", "Carol Reed,", 
"Steven Spielberg,", "Denis Villeneuve,", "Bong Joon Ho,", "James McTeigue,", 
"Ridley Scott,", "Danny Boyle,", "Pete Docter,", "David Lean,", 
"Joel Coen,", "Gavin O'Connor,", "Andrew Stanton,", "Quentin Tarantino,", 
"Victor Fleming,", "Yasujirô Ozu,", "Elia Kazan,", "Cagan Irmak,", 
"Damián Szifron,", "Andrei Tarkovsky,", "Michael Cimino,", "Denis Villeneuve,", 
"Costa-Gavras,", "Wes Anderson,", "Buster Keaton,", "Clyde Bruckman,", 
"Clint Eastwood,", "Ingmar Bergman,", "Richard Linklater,", "Adam Elliot,", 
"Steven Spielberg,", "Frank Capra,", "Jim Sheridan,", "Stanley Kubrick,", 
"Lenny Abrahamson,", "David Fincher,", "Mel Gibson,", "Carl Theodor Dreyer,", 
"Sriram Raghavan,", "James Mangold,", "Steve McQueen,", "Ernst Lubitsch,", 
"Joel Coen,", "Peter Weir,", "Ingmar Bergman,", "Dean DeBlois,", 
"George Miller,", "William Wyler,", "David Yates,", "Clint Eastwood,", 
"Henri-Georges Clouzot,", "Park Chan-Wook,", "Rob Reiner,", "Sidney Lumet,", 
"James Mangold,", "Anurag Kashyap,", "Stuart Rosenberg,", "Lasse Hallström,", 
"Mathieu Kassovitz,", "François Truffaut,", "Naoko Yamada,", 
"Oliver Stone,", "Tom McCarthy,", "Pete Docter,", "Alfred Hitchcock,", 
"Terry Jones,", "Terry George,", "Kar-Wai Wong,", "Yavuz Turgul,", 
"Ron Howard,", "Sean Penn,", "John G. Avildsen,", "Alejandro G. Iñárritu,", 
"Hayao Miyazaki,", "Andrei Tarkovsky,", "Frank Capra,", "Richard Linklater,", 
"Ingmar Bergman,", "Hideaki Anno,", "Gillo Pontecorvo,", "Federico Fellini,", 
"Rob Reiner,", "Wim Wenders,", "Krzysztof Kieslowski,", "Ram Kumar,"
)

一些棘手的例子,我需要在 G. 之后拆分“John G. Avildsen”,然后在第一个空格之后拆分“Bong Joon Ho”,更重要的是,在之后拆分“Florian Henckel von Donnersmarck”第二个空格(只是指出一对)。

我在所有字符串的末尾添加了一个逗号,这样我就可以转置字符串并让它返回姓氏,第一个(可能是中间)格式。

我浏览了我的列表,发现所有需要保留姓氏部分的东西才能尝试的情况,这些情况首先拆分,但它并没有在我需要的地方拆分,它只是在拆分每个字符串到它自己的索引中。

这是我最近尝试过的:

directors.names <- paste0(directors.1, ",")
directors.names <- strsplit(directors.names, "[[:space:]]+('von'|'Ford'|'Joon'|'De'|'del'|'Van')[[:space:]]+", perl = TRUE)  

一旦这些被正确拆分和转置,就需要删除重复项以返回一个列表,该列表可以按姓氏字母顺序排序,每行显示姓氏、名字(MI 或中间名)。

【问题讨论】:

请与dput()分享您的数据 我编辑了原始问题以添加到我正在使用的列表的dput() 中。 【参考方案1】:

我们知道要提取的模式(第一个单词、最后一个单词,有时是两个单词的姓氏),所以我们可能会使用extract 而不是split 方法更好,因为我们不知道每个名字的单词数(很难在第 n 个空格上拆分)。

我们可以为常见的双字姓氏定义一个模式,然后在str_extract_all 中插入这个带有glue::glue 的模式。

在对str_extract_all 的以下调用中,我们定义了 3 种可能的模式来提取:

第一个词^\\w+ 两个字的姓氏((two_word_patterns)\\s+\\w+$) 一个普通的姓氏\\w+$

这三个应该用|作为分隔符(1)折叠,都在正则表达式内(中间没有刻度)。

提取名称后,我们可以用rev() 颠倒顺序,最后用toString 将它们粘贴回去。

toString 在我们需要 paste 带有“,”分隔符的字符元素时特别有用,就像在这种情况下。

library(glue)
library(stringr)
libyrar(purrr)

directors<-c("Fernando Meireles", "Bong Joon Ho", "Florian Henckel von Donnersmarck")

two_word_patterns<-'(von)|(Ford)|(Joon)|(De)|(del)|(Van)'(1)

directors %>%
    str_extract_all(pattern = glue('^\\w+|((two_word_patterns)\\s+\\w+$)|\\w+$'))%>%
    map(rev) %>%
    map_chr(toString)

[1] "Meireles, Fernando"        "Joon Ho, Bong"             "von Donnersmarck, Florian"

(1)如果我们有一个包含两个单词姓氏的向量,并且想要以编程方式构造“two_word_patterns”,我们可以使用:

two_words_2<-c('von', 'Ford', 'Joon', 'De', 'del', 'Van')

two_words_2_pattern <- map_chr(two_words_2, ~glue('(.x)')) %>%
    paste(collapse = '|')

[1] "(von)|(Ford)|(Joon)|(De)|(del)|(Van)"

编辑

-OP 通过dput()提供数据

如果我们真的必须使用带有尾随逗号的名称(如在“Fernando Meirelles”中,我们可以从删除操作前的逗号开始,使用trimws。然后将trimws 的输出通过管道传输到相同的代码中如上。为了清楚起见,这里我只使用了数据的一个子集:

head(directors_names, 40)%>%
    trimws(whitespace = ',') %>%
    str_extract_all(pattern = glue('^\\w+|(((two_words_2_pattern))\\s+\\w+$)|\\w+$')) %>%
    map(rev) %>%
    map_chr(toString)
 [1] "Darabont, Frank"       "Ford Coppola, Francis" "Ford Coppola, Francis" "Nolan, Christopher"    "Lumet, Sidney"         "Spielberg, Steven"    
 [7] "Jackson, Peter"        "Tarantino, Quentin"    "Leone, Sergio"         "Jackson, Peter"        "Fincher, David"        "Zemeckis, Robert"     
[13] "Nolan, Christopher"    "Jackson, Peter"        "Kershner, Irvin"       "Wachowski, Lana"       "Scorsese, Martin"      "Forman, Milos"        
[19] "Kurosawa, Akira"       "Fincher, David"        "Demme, Jonathan"       "Meirelles, Fernando"   "Benigni, Roberto"      "Capra, Frank"         
[25] "Spielberg, Steven"     "Lucas, George"         "Nolan, Christopher"    "Miyazaki, Hayao"       "Darabont, Frank"       "Joon Ho, Bong"        
[31] "Besson, Luc"           "Kobayashi, Masaki"     "Polanski, Roman"       "Cameron, James"        "Zemeckis, Robert"      "Singer, Bryan"        
[37] "Hitchcock, Alfred"     "Allers, Roger"         "Chaplin, Charles"      "Kaye, Tony"

【讨论】:

以上是关于拆分一串名称并转置的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Tableau 上动态选择名称并转到所需的仪表板

如何在mysql中拆分名称字符串?

根据观察名称将数据拆分为训练和使用 pandas 进行测试

Groovy 将路径拆分为名称和父级

将excel根据列名称拆分成多个文件

VBA Word 拆分并使用指定名称保存