学习机器学习:朴素贝叶斯和文本分类
Posted GHBD
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了学习机器学习:朴素贝叶斯和文本分类相关的知识,希望对你有一定的参考价值。
这是GHBD 的第 27 篇文章
旨在推广医疗大数据与人工智能的发展
源自:Commonlounge
朴素贝叶斯
Naive Bayes
朴素贝叶斯 (Naive Bayes) 是一种广泛使用的分类算法。它是一种基于贝叶斯定理的监督学习算法。“朴素”这个词来自于特征之间相互独立的假设。也就是说,如果我们输入的向量是(x1, x2,...., xn), 则对于指定的y, xi的条件是独立的。
Naive Bayes is a widely used classification algorithm. It is a supervised learning algorithm based on Bayes' Theorem. The word naive comes form the assumption of independence among features. That is, if our input vector is (x1, x2,..., xn), then xi's are conditionally independent given y.
派生算法
Deriving the Alogrithm
我们从贝叶斯定理开始(对于朴素贝叶斯,x 是输入,y 是输出),
Let's start with Bayes'theorem (for naive bayes, x is the input and y is the output):
当我们有多个特征时,我们可以将贝叶斯定理重写为:
When we have more than one feature, we can rewrite Bayes' theorem as:
由于我们假设xi在条件上是独立的,因此我们可将上述式子重写为:
Since we are making the assumption that xi's are conditionally independent given y, we can rewrite the above as
但我们也知道 P(x1, x2, ..., xn) 是一个给定输入的常量,因此得到下面式子:
but we also know that P(x1, x2, ..., xn) is a constant given the input, i.e.
注明这是公式 (1), 后面会引用。
注意
Notice that
上式右边是我们感兴趣的内容,即给定输入x, 输出y的概率分布 the left hand side is the term we are interested in, probablity distribution of the output y given input x
P(y)可以通过计算每个类y, 在训练数据中出现的次数来估算 (这称为最大后验估算)P(y) can be estimated by counting the number of times each class y appears in our training data (this is called Maximum a Posteriori estimation)
P(xily)可以通过计算存在类y的训练数据,xi每个值出现的次数来估算 P(xily) can be estimated by counting the number of times each value of xi appears for each class y in our training data
伪码
Pseudocode
训练
Training
· Estimate P(y):P(y=t)=出现在数据集的次数 number of times class t appears in the dataset / size of dataset
· Estimate P(xily):P(xi=k|y=t)=xi的值为k的次数和y的值为t的次数之和,除以类t出现的次数 number of times xi has value k and y has value t/number of data points of class t
2
预测
Predicting
. Estimate P(y|x1, ..., xn): 用上述的估算值 P(y)、P(xily)、公式(1). 此后,将值标准化。Use above estimated values of P(y) and P(xily) and equation(1). Thereafter, normaliza the values.
变体
Variants
有几种朴素贝叶斯的变体,它们使用对P(xily)使用不同的概率分布。例如:高斯分布(高斯朴素贝叶斯)、多项分布(多项式朴素贝叶斯)和伯努利分布(伯努利朴素贝叶斯)
Thera are several variants of naive bayes which use different distributions for P(xily) such as gaussian distribution (gaussian naive bayes), multinomial distribution (multinomial naive bayes) and bernoulli distribution (bernoulli naive bayes)
用Scikit-learn库的实现
Scikit-learn implementation
应用
Applications
朴素贝叶斯是最简单且最有效的算法之一
Naive bayes is one of the simplest yet effective algorithms for
文本分类:例如,我们有一些新的文章,我们希望学习对文章进行分类如政治、健康、技术、体育和生活 Text classification: For example, we have a number of news articles, and we want to learn to classify if the article is about politics, health, technology, sports or lifestyle.
垃圾邮件过滤:我们收集了大量邮件,我们希望学习区别这些邮件是否垃圾邮件 Spam filtering: We have a number of emails, and we want to learn to classify if the email is spam or not.
性别分类:给定一些特征如身高、体重等,预测这人是男性还是女性 Gender classification: Given features such as height, weight, etc, predict whether the person is male or female.
·关 注·
环球医疗大数据
end
学习机器学习:从初学者到专家系列
该系列共包含25个学习机器学习的教程。
您可将此系列视为“免费在线图书馆”。
您将学习核心机器学习概念,算法和应用程序。
一切都是100%免费,
欢迎关注交流。
以上是关于学习机器学习:朴素贝叶斯和文本分类的主要内容,如果未能解决你的问题,请参考以下文章