前面介绍了很多NLTK中携带的词典资源,这些词典资源对于我们处理文本是有大的作用的,比如实现这样一个功能,寻找由egivronl几个字母组成的单词。且组成的单词每个字母的次数不得超过egivronl中字母出现的次数,每个单词的长度要大于6.
要实现这样的一个功能,首先我们要调用FreqDist功能。来得到样本字母中各个字母出现的次数
puzzle_letters=nltk.FreqDist(‘egivrvonl‘)
for k in puzzle_letters:
print(k,puzzle_letters[k])
得到如下结果:可以看出puzzle_letters其实是一个可迭代的对象,并且是以字典形式存在的,key值为字母,item为字母出现的次数
e 1
g 1
i 1
v 2
r 1
o 1
n 1
l 1
那么我们是否可以通过FreqDist来比较两个单词的字母是否包含呢,来看下面的这个例子:
对两个FreqDist对象进行比较
print(nltk.FreqDist(‘eg‘)<=puzzle_letters)
print(nltk.FreqDist(‘ae‘) <= puzzle_letters)
运行结果:如果puzzle_letters包含前面被比较的对象,则会返回true,比如eg都包含在‘egivrvonl‘中,而ae尽管e包含在‘egivrvonl‘中,但是a并不存在,因此返回False.
True
False
介绍了FreqDist的功能,那么我们大致已经清楚该如何实现我们的这个功能。我们创建两个FreqDist对象,其中一个由egivronl组成。其中由nltk.corpus.words.words()中的单词组成,将两个对象进行比较得到满足的单词
puzzle_letters=nltk.FreqDist(‘egivrvonl‘)
obligatory=‘r‘
wordlist=nltk.corpus.words.words()
ret=[w for w in wordlist if len(w) >=6 and obligatory in w and nltk.FreqDist(w) <= puzzle_letters]
print(ret)
obligatory代表单词中必须包含r,然后通过w for w in wordlist if len(w) >=6 and obligatory in w and nltk.FreqDist(w) <= puzzle_letters来得到满足条件的单词:1 长度大于6 2 r包含在单词中 3 w单词中单词都来自于‘egivrvonl‘
得到的结果如下:
[‘glover‘, ‘gorlin‘, ‘govern‘, ‘grovel‘, ‘ignore‘, ‘involver‘, ‘lienor‘, ‘linger‘, ‘longer‘, ‘lovering‘, ‘noiler‘, ‘overling‘, ‘region‘, ‘renvoi‘, ‘revolving‘, ‘ringle‘, ‘roving‘, ‘violer‘, ‘virole‘]
这个功能类似一个词谜游戏,通过NLTK中的功能与词典资源可以很轻松的得出结果。
我们再来看另外一个功能,找到男性和女性共有的名字。也就是男性可以用,女性也可以用,从名字上无法分辨出性别的名字。
在NLTK中,有一个名字资料库,分别有两个文件存储男性和女性的名字。代码如下:
name=nltk.corpus.names
print(name.fileids())
male_name=name.words(‘male.txt‘)
female_name=name.words(‘female.txt‘)
print([w for w in male_name if w in female_name])
运行结果如下:
[‘female.txt‘, ‘male.txt‘]
[‘Abbey‘, ‘Abbie‘, ‘Abby‘, ‘Addie‘, ‘Adrian‘, ‘Adrien‘, ‘Ajay‘, ‘Alex‘, ‘Alexis‘, ‘Alfie‘, ‘Ali‘, ‘Alix‘, ‘Allie‘, ‘Allyn‘, ‘Andie‘, ‘Andrea‘, ‘Andy‘, ‘Angel‘, ‘Angie‘, ‘Ariel‘, ‘Ashley‘, ‘Aubrey‘, ‘Augustine‘, ‘Austin‘, ‘Averil‘, ‘Barrie‘, ‘Barry‘, ‘Beau‘, ‘Bennie‘, ‘Benny‘, ‘Bernie‘, ‘Bert‘, ‘Bertie‘, ‘Bill‘, ‘Billie‘, ‘Billy‘, ‘Blair‘, ‘Blake‘, ‘Bo‘, ‘Bobbie‘, ‘Bobby‘, ‘Brandy‘, ‘Brett‘, ‘Britt‘, ‘Brook‘, ‘Brooke‘, ‘Brooks‘, ‘Bryn‘, ‘Cal‘, ‘Cam‘, ‘Cammy‘, ‘Carey‘, ‘Carlie‘, ‘Carlin‘, ‘Carmine‘, ‘Carroll‘, ‘Cary‘, ‘Caryl‘, ‘Casey‘, ‘Cass‘, ‘Cat‘, ‘Cecil‘, ‘Chad‘, ‘Chris‘, ‘Chrissy‘, ‘Christian‘, ‘Christie‘, ‘Christy‘, ‘Clair‘, ‘Claire‘, ‘Clare‘, ‘Claude‘, ‘Clem‘, ‘Clemmie‘, ‘Cody‘, ‘Connie‘, ‘Constantine‘, ‘Corey‘, ‘Corrie‘, ‘Cory‘, ‘Courtney‘, ‘Cris‘, ‘Daffy‘, ‘Dale‘, ‘Dallas‘, ‘Dana‘, ‘Dani‘, ‘Daniel‘, ‘Dannie‘, ‘Danny‘, ‘Darby‘, ‘Darcy‘, ‘Darryl‘, ‘Daryl‘, ‘Deane‘, ‘Del‘, ‘Dell‘, ‘Demetris‘, ‘Dennie‘, ‘Denny‘, ‘Devin‘, ‘Devon‘, ‘Dion‘, ‘Dionis‘, ‘Dominique‘, ‘Donnie‘, ‘Donny‘, ‘Dorian‘, ‘Dory‘, ‘Drew‘, ‘Eddie‘, ‘Eddy‘, ‘Edie‘, ‘Elisha‘, ‘Emmy‘, ‘Erin‘, ‘Esme‘, ‘Evelyn‘, ‘Felice‘, ‘Fran‘, ‘Francis‘, ‘Frank‘, ‘Frankie‘, ‘Franky‘, ‘Fred‘, ‘Freddie‘, ‘Freddy‘, ‘Gabriel‘, ‘Gabriell‘, ‘Gail‘, ‘Gale‘, ‘Gay‘, ‘Gayle‘, ‘Gene‘, ‘George‘, ‘Georgia‘, ‘Georgie‘, ‘Geri‘, ‘Germaine‘, ‘Gerri‘, ‘Gerry‘, ‘Gill‘, ‘Ginger‘, ‘Glen‘, ‘Glenn‘, ‘Grace‘, ‘Gretchen‘, ‘Gus‘, ‘Haleigh‘, ‘Haley‘, ‘Hannibal‘, ‘Harley‘, ‘Hazel‘, ‘Heath‘, ‘Henrie‘, ‘Hilary‘, ‘Hillary‘, ‘Holly‘, ‘Ike‘, ‘Ikey‘, ‘Ira‘, ‘Isa‘, ‘Isador‘, ‘Isadore‘, ‘Jackie‘, ‘Jaime‘, ‘Jamie‘, ‘Jan‘, ‘Jean‘, ‘Jere‘, ‘Jermaine‘, ‘Jerrie‘, ‘Jerry‘, ‘Jess‘, ‘Jesse‘, ‘Jessie‘, ‘Jo‘, ‘Jodi‘, ‘Jodie‘, ‘Jody‘, ‘Joey‘, ‘Jordan‘, ‘Juanita‘, ‘Jude‘, ‘Judith‘, ‘Judy‘, ‘Julie‘, ‘Justin‘, ‘Karel‘, ‘Kellen‘, ‘Kelley‘, ‘Kelly‘, ‘Kelsey‘, ‘Kerry‘, ‘Kim‘, ‘Kip‘, ‘Kirby‘, ‘Kit‘, ‘Kris‘, ‘Kyle‘, ‘Lane‘, ‘Lanny‘, ‘Lauren‘, ‘Laurie‘, ‘Lee‘, ‘Leigh‘, ‘Leland‘, ‘Lesley‘, ‘Leslie‘, ‘Lin‘, ‘Lind‘, ‘Lindsay‘, ‘Lindsey‘, ‘Lindy‘, ‘Lonnie‘, ‘Loren‘, ‘Lorne‘, ‘Lorrie‘, ‘Lou‘, ‘Luce‘, ‘Lyn‘, ‘Lynn‘, ‘Maddie‘, ‘Maddy‘, ‘Marietta‘, ‘Marion‘, ‘Marlo‘, ‘Martie‘, ‘Marty‘, ‘Mattie‘, ‘Matty‘, ‘Maurise‘, ‘Max‘, ‘Maxie‘, ‘Mead‘, ‘Meade‘, ‘Mel‘, ‘Meredith‘, ‘Merle‘, ‘Merrill‘, ‘Merry‘, ‘Meryl‘, ‘Michal‘, ‘Michel‘, ‘Michele‘, ‘Mickie‘, ‘Micky‘, ‘Millicent‘, ‘Morgan‘, ‘Morlee‘, ‘Muffin‘, ‘Nat‘, ‘Nichole‘, ‘Nickie‘, ‘Nicky‘, ‘Niki‘, ‘Nikki‘, ‘Noel‘, ‘Ollie‘, ‘Page‘, ‘Paige‘, ‘Pat‘, ‘Patrice‘, ‘Patsy‘, ‘Pattie‘, ‘Patty‘, ‘Pen‘, ‘Pennie‘, ‘Penny‘, ‘Perry‘, ‘Phil‘, ‘Pooh‘, ‘Quentin‘, ‘Quinn‘, ‘Randi‘, ‘Randie‘, ‘Randy‘, ‘Ray‘, ‘Regan‘, ‘Reggie‘, ‘Rene‘, ‘Rey‘, ‘Ricki‘, ‘Rickie‘, ‘Ricky‘, ‘Rikki‘, ‘Robbie‘, ‘Robin‘, ‘Ronnie‘, ‘Ronny‘, ‘Rory‘, ‘Ruby‘, ‘Sal‘, ‘Sam‘, ‘Sammy‘, ‘Sandy‘, ‘Sascha‘, ‘Sasha‘, ‘Saundra‘, ‘Sayre‘, ‘Scotty‘, ‘Sean‘, ‘Shaine‘, ‘Shane‘, ‘Shannon‘, ‘Shaun‘, ‘Shawn‘, ‘Shay‘, ‘Shayne‘, ‘Shea‘, ‘Shelby‘, ‘Shell‘, ‘Shelley‘, ‘Sibyl‘, ‘Simone‘, ‘Sonnie‘, ‘Sonny‘, ‘Stacy‘, ‘Sunny‘, ‘Sydney‘, ‘Tabbie‘, ‘Tabby‘, ‘Tallie‘, ‘Tally‘, ‘Tammie‘, ‘Tammy‘, ‘Tate‘, ‘Ted‘, ‘Teddie‘, ‘Teddy‘, ‘Terri‘, ‘Terry‘, ‘Theo‘, ‘Tim‘, ‘Timmie‘, ‘Timmy‘, ‘Tobe‘, ‘Tobie‘, ‘Toby‘, ‘Tommie‘, ‘Tommy‘, ‘Tony‘, ‘Torey‘, ‘Trace‘, ‘Tracey‘, ‘Tracie‘, ‘Tracy‘, ‘Val‘, ‘Vale‘, ‘Valentine‘, ‘Van‘, ‘Vin‘, ‘Vinnie‘, ‘Vinny‘, ‘Virgie‘, ‘Wallie‘, ‘Wallis‘, ‘Wally‘, ‘Whitney‘, ‘Willi‘, ‘Willie‘, ‘Willy‘, ‘Winnie‘, ‘Winny‘, ‘Wynn‘]
当然如果我们想补充名字,也可以自己定义文件。方法如下:
corpus_root=‘/home/zhf/word‘
wordlists=PlaintextCorpusReader(corpus_root,‘.*‘)
print(wordlists.fieldids())
for w in wordlists.words(‘文件名’):
print(w)
词汇工具:
在文本中我们经常使用同义词替换某个单词。这就需要借助WordNet来帮助实现
from nltk.corpus import wordnet as wn
lemma=wn.synsets(‘motorcar‘)
print(lemma)
运行结果:motorcar只有一个可能的含义,就是car,那么car.n.01就称为synset或者同义词集。这里car是指的具体名称,n是词性(名词),01代表集合的索引
[Synset(‘car.n.01‘)]
通过wn.synset(‘car.n.01‘).lemma_names()就可以得到这个同义词集中的所有同义词
[‘car‘, ‘auto‘, ‘automobile‘, ‘machine‘, ‘motorcar‘]
我们还可以得到这个同义词集的定义以及使用例子
wn.synset(‘car.n.01‘).definition()
wn.synset(‘car.n.01‘).examples()
a motor vehicle with four wheels; usually propelled by an internal combustion engine
[‘he needs a car to get to work‘]
在wordnet中同义词分为上位词和下位词。比如前面的car.n.01, 汽车有很多中品牌。这些品牌就是car的下位词
motocar=wn.synset(‘car.n.01‘)
types_of_motorcar=motocar.hyponyms()
[lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas()]
可以看到各种不同的汽车类型和品牌。
[‘ambulance‘, ‘beach_wagon‘, ‘station_wagon‘, ‘wagon‘, ‘estate_car‘, ‘beach_waggon‘, ‘station_waggon‘, ‘waggon‘, ‘bus‘, ‘jalopy‘, ‘heap‘, ‘cab‘, ‘hack‘, ‘taxi‘, ‘taxicab‘, ‘compact‘, ‘compact_car‘, ‘convertible‘, ‘coupe‘, ‘cruiser‘, ‘police_cruiser‘, ‘patrol_car‘, ‘police_car‘, ‘prowl_car‘, ‘squad_car‘, ‘electric‘, ‘electric_automobile‘, ‘electric_car‘, ‘gas_guzzler‘, ‘hardtop‘, ‘hatchback‘, ‘horseless_carriage‘, ‘hot_rod‘, ‘hot-rod‘, ‘jeep‘, ‘landrover‘, ‘limousine‘, ‘limo‘, ‘loaner‘, ‘minicar‘, ‘minivan‘, ‘Model_T‘, ‘pace_car‘, ‘racer‘, ‘race_car‘, ‘racing_car‘, ‘roadster‘, ‘runabout‘, ‘two-seater‘, ‘sedan‘, ‘saloon‘, ‘sport_utility‘, ‘sport_utility_vehicle‘, ‘S.U.V.‘, ‘SUV‘, ‘sports_car‘, ‘sport_car‘, ‘Stanley_Steamer‘, ‘stock_car‘, ‘subcompact‘, ‘subcompact_car‘, ‘touring_car‘, ‘phaeton‘, ‘tourer‘, ‘used-car‘, ‘secondhand_car‘]
上位词和下位词可以理解为is-a的关系。属于上下级包含的关系。既然是这样,那么我们可以对多个同义词集判断是否具有共同的上位词,如果两个同义词集共用一个特定的上位词,那么可以判断它们肯定有一定的联系。比如下面的代码:
right=wn.synset(‘right_whale.n.01‘) #露脊鲸
orca=wn.synset(‘orca.n.01‘) # 逆戟鲸
minke=wn.synset(‘minke_whale.n.01‘) #逆戟鲸
print(right.lowest_common_hypernyms(minke))
运行结果:
[Synset(‘baleen_whale.n.01‘)]
这是三种不同的鲸鱼类型,通过lowest_common_hypernyms的方式找到right和minke的共同上位词也就是长须鲸