如何用python实现英文短文的双词频统计?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何用python实现英文短文的双词频统计?相关的知识,希望对你有一定的参考价值。
例如:输入字符串"Do you hear the people sing, singing a song of angry men. It is the music of a people, who will not be slaves again, when the beating of your heart echoes the beating of the drums. There is a life about to start when tomorrow comes."
对该段字符串中,每两个相邻单词作为一个词组,例如 do you ;a song ;it is ;is the ;这样的,要求统计每个词组出现的次数,然后对每个词组计算一个数值n,例如 a song 在短文中出现一次,以 a开头的词组出现三次(a song; a people; a life),那么n=1/3。然后对所有词组先根据n值由大至小排列,n值一致的根据词的字母顺序排列(例如前面三个a开头的词组排列是a life; a people; a song),最后输出这些词组与n值。
from itertools import imap as map
from collections import Counter
def parserwords(sentence):
preword = ''
result = []
for word in re.findall('\\w+', sentence.lower()):
if preword:
result.append((preword, word))
preword = word
return result
context = """
Do you hear the people sing, singing a song of angry men.
It is the music of a people, who will not be slaves again,
when the beating of your heart echoes the beating of the drums.
There is a life about to start when tomorrow comes.
"""
words = []
for sentence in map(parserwords,
re.split(r'[,.]', context.lower())):
words.extend(sentence)
prefixcounter = Counter([word[0] for word in words])
counter = Counter(words)
meter =
for pre, post in counter.iterkeys():
meter[(pre, post)] = 1. * counter[(pre, post)] / prefixcounter[pre]
result = sorted(meter.iteritems(),
cmp = lambda a, b: cmp(b[1], a[1]) or cmp(a[0], b[0])
)
print result[:5] 参考技术B data="""Do you hear the people sing, singing a song of angry men. It is the music of a people, who will not be slaves again, when the beating of your heart echoes the beating of the drums. There is a life about to start when tomorrow comes."""
data=data.replace(',','')
data=data.replace('.','')
ws=data.split()
dic=#count two words
ws2=[]#two words
for i in range(len(ws)-1):
ws2.append(ws[i]+" "+ws[i+1])
for w2 in ws2:
if dic.get(w2)==None:
dic[w2]=1
else:
dic[w2]+=1
dic_first=#count two words by first word
for w2 in ws2:
(l,r)=w2.split()
if dic_first.get(l)==None:
dic_first[l]=1
else:
dic_first[l]+=1
for w2 in ws2:#output
(l,r)=w2.split()
print w2,dic[w2],dic_first[l],dic[w2]/float(dic_first[l])追问
最后输出还有些问题,仅输出n值和双词短语,并且排列顺序是由n值从大至小,n值同样的情况下按双词短语的字母顺序(a-z,a在最前),怎样输出前五个呢?
另外在句号逗号两边的词不能形成一个短语呀,应该把它删掉
ws=ju.split()
ws2=[]
for i in range(len(ws)-1):
ws2.append(ws[i]+" "+ws[i+1])
for w2 in ws2:
if dic.get(w2)==None:
dic[w2]=1
else:
dic[w2]+=1
if dic.get("a life")==2:
print ws2
raw_input()
for w2 in ws2:
(l,r)=w2.split()
if dic_first.get(l)==None:
dic_first[l]=1
else:
dic_first[l]+=1
data="""Do you hear the people sing, singing a song of angry men. It is the music of a people, who will not be slaves again, when the beating of your heart echoes the beating of the drums. There is a life about to start when tomorrow comes."""
data=data.replace(',','.')
jus=data.split('.')
dic=
dic_first=
ws2=[]
for ju in jus:
count(ju,dic,dic_first)
out=[]
for k in dic:#output
(l,r)=k.split()
print k,dic[k],dic_first[l]
n=dic[k]/float(dic_first[l])
out.append([n,k])
out.sort()
for o in out:
print o[0],o[1]本回答被提问者采纳 参考技术C 感觉不是很难啊,1.split切句子,2.split切词,3.遍历生成词组,扔到dict里统计就可以了啊
以上是关于如何用python实现英文短文的双词频统计?的主要内容,如果未能解决你的问题,请参考以下文章