sklearn中的naive bayes算法

Posted bitcarmanlee

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了sklearn中的naive bayes算法相关的知识,希望对你有一定的参考价值。

1.总览

sklearn中的naive bayes一共有五种,如果进入到源码中,会发现该模块文件中最开始的位置有如下源码:

__all__ = ['BernoulliNB', 'GaussianNB', 'MultinomialNB', 'ComplementNB',
           'CategoricalNB']

以上这五个就是总共的五种算法。

2.GaussianNB

看到GaussianNB这个名字,那肯定就是跟高斯分布有关系。如果原始数据是连续值且符合高斯分布,那么使用GaussianNB是个不错的选择,比如大众的工资收入,人的身高体重等比较符合高斯分布的数据。

class GaussianNB(_BaseNB):
    """
    Gaussian Naive Bayes (GaussianNB)

    Can perform online updates to model parameters via :meth:`partial_fit`.
    For details on algorithm used to update feature means and variance online,
    see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:

        http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

    Read more in the :ref:`User Guide <gaussian_naive_bayes>`.

    Parameters
    ----------
    priors : array-like, shape (n_classes,)
        Prior probabilities of the classes. If specified the priors are not
        adjusted according to the data.

    var_smoothing : float, optional (default=1e-9)
        Portion of the largest variance of all features that is added to
        variances for calculation stability.
...............

    def __init__(self, priors=None, var_smoothing=1e-9):
        self.priors = priors
        self.var_smoothing = var_smoothing

GaussianNB只有两个,或者说一个参数:priors。priors是各类别的先验概率,如果没有的话,则是从数据集中计算得出。而另外一个所谓的参数var_smoothing是为了计算稳定性。

3.MultinomialNB

class MultinomialNB(_BaseDiscreteNB):
    """
    Naive Bayes classifier for multinomial models

    The multinomial Naive Bayes classifier is suitable for classification with
    discrete features (e.g., word counts for text classification). The
    multinomial distribution normally requires integer feature counts. However,
    in practice, fractional counts such as tf-idf may also work.

MultinomialNB适合特征是离散值且满足多项式分布的情况,比如文本分类中的词频。同时注释中还特别标明,tf-idf这类特征也适合MultinomialNB。

    Parameters
    ----------
    alpha : float, optional (default=1.0)
        Additive (Laplace/Lidstone) smoothing parameter
        (0 for no smoothing).

    fit_prior : boolean, optional (default=True)
        Whether to learn class prior probabilities or not.
        If false, a uniform prior will be used.

    class_prior : array-like, size (n_classes,), optional (default=None)
        Prior probabilities of the classes. If specified the priors are not
        adjusted according to the data.
......

    def __init__(self, alpha=1.0, fit_prior=True, class_prior=None):
        self.alpha = alpha
        self.fit_prior = fit_prior
        self.class_prior = class_prior
.......
 

源码中,注意到三个参数,alpha, fit_prior, class_prior。参数alpha是在计算概率的时候进行拉普拉斯平滑,fit_prior表示是否学习先验概率,而class_prior是给定的先验概率,如果没有给定则从数据集中自行计算。

4.BernoulliNB

class BernoulliNB(_BaseDiscreteNB):
    """Naive Bayes classifier for multivariate Bernoulli models.

    Like MultinomialNB, this classifier is suitable for discrete data. The
    difference is that while MultinomialNB works with occurrence counts,
    BernoulliNB is designed for binary/boolean features.
 ......

    def __init__(self, alpha=1.0, binarize=.0, fit_prior=True,
                 class_prior=None):
        self.alpha = alpha
        self.binarize = binarize
        self.fit_prior = fit_prior
        self.class_prior = class_prior

通过上面这段注释不难看出,BernoulliNB与MultinomialNB的唯一区别在于,MultinomialNB的特征是类似词频这种特征,而BernoulliNB使用的是布尔类型的特征,即这个词有没有出现过。
BernoulliNB在那些短文本,并且关键词区分度比较明显的场景中,可能效果会很好。

5.ComplementNB

class ComplementNB(_BaseDiscreteNB):
    """The Complement Naive Bayes classifier described in Rennie et al. (2003).

    The Complement Naive Bayes classifier was designed to correct the "severe
    assumptions" made by the standard Multinomial Naive Bayes classifier. It is
    particularly suited for imbalanced data sets.
......

ComplementNB是用来纠正标准MultinomialNB的一个严重假设,最后一句话点明了该分类器的用途:特别适合样本不平衡数据集。

6.CategoricalNB

class CategoricalNB(_BaseDiscreteNB):
    """Naive Bayes classifier for categorical features

    The categorical Naive Bayes classifier is suitable for classification with
    discrete features that are categorically distributed. The categories of
    each feature are drawn from a categorical distribution.
......

分类朴素贝叶斯分类器适用于具有分类分布的离散特征的分类。 每个特征的类别都来自一个分类分布。

以上是关于sklearn中的naive bayes算法的主要内容,如果未能解决你的问题,请参考以下文章

如何纠正 sklearn.naive_bayes 中的 sample_weight?

朴素贝叶斯(Naive Bayes)及python实现(sklearn)

如何将 sklearn.naive_bayes 与(多个)分类特征一起使用? [关闭]

使用 Sklearn.naive_bayes.Bernoulli 的朴素贝叶斯分类器;如何使用模型进行预测?

sklearn.naive_bayes中Bernoulli NB几种朴素贝叶斯分类器

sklearn naive bayes MultinomialNB:尽管我有 2 个类,为啥我只得到一个带系数的数组?