机器学习算法整理— 贝叶斯算法_拼写纠正实例_垃圾邮件过滤实例

Posted 豆子

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了机器学习算法整理— 贝叶斯算法_拼写纠正实例_垃圾邮件过滤实例相关的知识,希望对你有一定的参考价值。

以下均为自己看视频做的笔记,自用,侵删!

 

 

 

 

 

 

 

 

 

 

 

(p(h): 先验概率)

 

贝叶斯拼写检查器实现

In [1]:
import re, collections
 
def words(text): return re.findall(\'[a-z]+\', text.lower()) 
 
def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model
 
NWORDS = train(words(open(\'big.txt\').read()))
 
alphabet = \'abcdefghijklmnopqrstuvwxyz\'
 
def edits1(word):
    n = len(word)
    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion
               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion
 
def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
 
def known(words): return set(w for w in words if w in NWORDS)
 
def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=lambda w: NWORDS[w])
In [6]: 实验一下检查器
#appl #appla #learw #tess #morw
correct(\'knon\')
Out[6]:
\'know\'

求解:argmaxc P(c|w) -> argmaxc P(w|c) P(c) / P(w)

  • P(c), 文章中出现一个正确拼写词 c 的概率, 也就是说, 在英语文章中, c 出现的概率有多大
  • P(w|c), 在用户想键入 c 的情况下敲成 w 的概率. 因为这个是代表用户会以多大的概率把 c 敲错成 w
  • argmaxc, 用来枚举所有可能的 c 并且选取概率最大的
In [6]:
# 把语料中的单词全部抽取出来, 转成小写, 并且去除单词中间的特殊符号
def words(text): return re.findall(\'[a-z]+\', text.lower()) 
 
def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model
 
NWORDS = train(words(open(\'big.txt\').read()))

要是遇到我们从来没有过见过的新词怎么办. 假如说一个词拼写完全正确, 但是语料库中没有包含这个词, 从而这个词也永远不会出现在训练集中. 于是, 我们就要返回出现这个词的概率是0. 这个情况不太妙, 因为概率为0这个代表了这个事件绝对不可能发生, 而在我们的概率模型中, 我们期望用一个很小的概率来代表这种情况. lambda: 1

In [7]:
NWORDS
Out[7]:
defaultdict(<function __main__.train.<locals>.<lambda>>,
            {\'counterorders\': 2,
             \'ureters\': 3,
             \'displeasure\': 9,
             \'omitted\': 10,
             \'sparrow\': 5,
             \'tubercle\': 66,
             \'curse\': 7,
             \'pauncefote\': 2,
             \'updated\': 5,
             \'gloomier\': 4,
             \'foremost\': 17,
             \'wabash\': 2,
             \'anarchists\': 4,
             \'intermediacy\': 2,
             \'threadbare\': 2,
             \'endeavouring\': 9,
             \'freeholders\': 11,
             \'irreproachably\': 3,
             \'ignominious\': 3,
             \'illuminated\': 9,
             \'galitsyn\': 2,
             \'struthers\': 3,
             \'shuya\': 2,
             \'futile\': 16,
             \'each\': 412,
             \'district\': 38,
             \'acquiesced\': 2,
             \'staircase\': 14,
             \'shamelessly\': 2,
             \'doubter\': 2,
             \'plumage\': 3,
             \'worming\': 2,
             \'militiamen\': 30,
             \'tombstones\': 2,
             \'presupposable\': 2,
             \'notable\': 6,
             \'louise\': 5,
             \'overtook\': 17,
             \'abstraction\': 8,
             \'displeased\': 20,
             \'ranchmen\': 2,
             \'instal\': 2,
             \'kashmir\': 3,
             \'nay\': 4,
             \'wired\': 5,
             \'pencil\': 11,
             \'mustache\': 46,
             \'breast\': 87,
             \'dioxide\': 9,
             \'disappointments\': 4,
             \'impassive\': 6,
             \'though\': 651,
             \'floridas\': 7,
             \'torban\': 2,
             \'combine\': 11,
             \'yawning\': 7,
             \'homeless\': 4,
             \'cinema\': 2,
             \'subjects\': 68,
             \'rib\': 9,
             \'bin\': 3,
             \'cylinders\': 18,
             \'bijou\': 2,
             \'acted\': 38,
             \'accepted\': 88,
             \'attainment\': 11,
             \'mustered\': 8,
             \'audacious\': 2,
             \'respectable\': 15,
             \'bilateral\': 10,
             \'coraco\': 2,
             \'stuffs\': 2,
             \'reheat\': 2,
             \'roberts\': 3,
             \'trenton\': 6,
             \'sharpening\': 5,
             \'component\': 6,
             \'pat\': 4,
             \'animation\': 32,
             \'coincidently\': 5,
             \'cy\': 2,
             \'smoker\': 2,
             \'manes\': 3,
             \'adelaide\': 2,
             \'prayer\': 43,
             \'industries\': 65,
             \'advantageously\': 5,
             \'dissolute\': 3,
             \'tendon\': 130,
             \'barton\': 2,
             \'ablest\': 2,
             \'episode\': 12,
             \'barges\': 3,
             \'sipping\': 4,
             \'inoperative\': 2,
             \'soap\': 8,
             \'padlocks\': 2,
             \'vagaries\': 2,
             \'potemkins\': 3,
             \'blackguard\': 5,
             \'smashed\': 11,
             \'bursitis\': 17,
             \'goes\': 61,
             \'prefix\': 3,
             \'shops\': 23,
             \'basketful\': 2,
             \'stepfather\': 22,
             \'veil\': 17,
             \'adorers\': 2,
             \'overhauled\': 6,
             \'liquors\': 3,
             \'bottoms\': 3,
             \'plastun\': 2,
             \'surest\': 4,
             \'carlton\': 5,
             \'friedland\': 6,
             \'alice\': 14,
             \'unhealthy\': 15,
             \'cannula\': 9,
             \'eleven\': 22,
             \'persuasions\': 3,
             \'cawolla\': 2,
             \'elephants\': 2,
             \'mechanicks\': 2,
             \'kitten\': 8,
             \'promotes\': 2,
             \'venae\': 2,
             \'matt\': 2,
             \'private\': 94,
             \'essential\': 93,
             \'creating\': 25,
             \'exclaiming\': 5,
             \'extent\': 100,
             \'oxidising\': 2,
             \'dessicans\': 3,
             \'uplands\': 4,
             \'tops\': 4,
             \'jerky\': 6,
             \'irregularity\': 6,
             \'recruitment\': 3,
             \'fringes\': 17,
             \'shopkeepers\': 7,
             \'tendencies\': 16,
             \'unconditionally\': 3,
             \'brandy\': 16,
             \'camberwell\': 3,
             \'statue\': 9,
             \'metatarsal\': 9,
             \'measurement\': 3,
             \'enclosures\': 2,
             \'suspecting\': 4,
             \'noses\': 7,
             \'standard\': 55,
             \'inspection\': 19,
             \'enterprising\': 6,
             \'freak\': 4,
             \'liberating\': 2,
             \'ordeal\': 3,
             \'pancras\': 2,
             \'luxury\': 9,
             \'livery\': 3,
             \'anconeus\': 2,
             \'polypus\': 4,
             \'leapt\': 3,
             \'liberally\': 2,
             \'finish\': 50,
             \'previously\': 56,
             \'mccarthy\': 38,
             \'mallet\': 6,
             \'bluestocking\': 3,
             \'conveyance\': 8,
             \'transformer\': 2,
             \'compel\': 10,
             \'blasphemies\': 3,
             \'suggest\': 25,
             \'shares\': 4,
             \'dishonoured\': 4,
             \'hen\': 7,
             \'vols\': 28,
             \'narcotisation\': 2,
             \'speranski\': 80,
             \'cherished\': 15,
             \'overcoat\': 27,
             \'malbrook\': 2,
             \'nephroma\': 2,
             \'habeus\': 2,
             \'coward\': 9,
             \'widower\': 5,
             \'extremely\': 52,
             \'resembling\': 53,
             \'understood\': 223,
             \'impetus\': 10,
             \'actinomyces\': 10,
             \'eosinophile\': 4,
             \'pronounce\': 10,
             \'arrangements\': 30,
             \'inevitably\': 33,
             \'hochgeboren\': 2,
             \'crusted\': 3,
             \'weeks\': 118,
             \'slightest\': 26,
             \'fords\': 2,
             \'stimulatingly\': 2,
             \'economically\': 3,
             \'thrice\': 9,
             \'peg\': 5,
             \'adventurous\': 4,
             \'mountainous\': 3,
             \'potch\': 2,
             \'adults\': 27,
             \'kindled\': 11,
             \'have\': 3494,
             \'sedate\': 3,
             \'democrats\': 94,
             \'vaginitis\': 2,
             \'foo\': 2,
             \'headgear\': 2,
             \'gape\': 8,
             \'reassigned\': 2,
             \'incompletely\': 2,
             \'pharmacopoeial\': 2,
             \'feelings\': 79,
             \'phone\': 3,
             \'anger\': 60,
             \'improvisations\': 2,
             \'dethrone\': 2,
             \'toothed\': 2,
             \'sweetish\': 2,
             \'tack\': 4,
             \'unwinding\': 3,
             \'pediculosis\': 2,
             \'overfed\': 2,
             \'rabble\': 8,
             \'opsonins\': 4,
             \'ver\': 3,
             \'postures\': 3,
             \'entertainment\': 8,
             \'unkind\': 5,
             \'lightest\': 3,
             \'undergone\': 10,
             \'persons\': 120,
             以上是关于机器学习算法整理— 贝叶斯算法_拼写纠正实例_垃圾邮件过滤实例的主要内容,如果未能解决你的问题,请参考以下文章

机器学习贝叶斯算法详解 + 公式推导 + 垃圾邮件过滤实战 + Python代码实现

学习朴素贝叶斯分类实例-单词纠正问题

朴素贝叶斯分类实例-单词纠正问题

人工智能机器学习及与智能数据处理Python使用朴素贝叶斯算法对垃圾短信数据集进行分类

《机器学习实战》笔记——朴素贝叶斯

机器学习_贝叶斯算法