【TODO】【scikit-learn翻译】4.2.3Text feature extraction

Posted 2023-02-20

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了【TODO】【scikit-learn翻译】4.2.3Text feature extraction相关的知识，希望对你有一定的参考价值。

参考技术A

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
文本分析是机器学习算法的主要应用领域。然而，原始数据，符号文字序列不能直接传递给算法，因为它们大多数要求具有固定长度的数字矩阵特征向量，而不是具有可变长度的原始文本文档。

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
为解决这个问题，scikit-learn提供了从文本内容中提取数字特征的最常见方法，即：

In this scheme, features and samples are defined as follows:
在该方案中，特征和样本定义如下：

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
因此，文本的集合可被表示为矩阵形式，每行对应一条文本，每列对应每个文本中出现的词令牌(如单个词)。

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
我们称 向量化 是将文本文档集合转换为数字集合特征向量的普通方法。这种特殊思想（令牌化，计数和归一化）被称为 Bag of Words 或 “Bag of n-grams” 模型。文档由单词出现来描述，同时完全忽略文档中单词的相对位置信息。

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
由于大多数文本文档通常只使用文本词向量全集中的一个小子集，所以得到的矩阵将具有许多特征值为零（通常大于99％）。

For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
例如，10,000 个短文本文档（如电子邮件）的集合将使用总共100,000个独特词的大小的词汇，而每个文档将单独使用100到1000个独特的单词。

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.
为了能够将这样的矩阵存储在存储器中，并且还可以加速代数的矩阵/向量运算，实现通常将使用诸如 scipy.sparse 包中的稀疏实现。

CountVectorizer implements both tokenization and occurrence counting in a single class:
类 CountVectorizer 在单个类中实现了 tokenization （词语切分）和 occurrence counting （出现频数统计）:

This model has many parameters, however the default values are quite reasonable (please see the reference documentation for the details):
这个模型有很多参数，但参数的默认初始值是相当合理的（请参阅参考文档了解详细信息）:

Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:
我们用它来对简约的文本语料库进行 tokenize（分词）和统计单词出现频数:

The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:
默认配置通过提取至少 2 个字母的单词来对 string 进行分词。做这一步的函数可以显式地被调用:

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
analyzer 在拟合过程中找到的每个 term（项）都会被分配一个唯一的整数索引，对应于 resulting matrix（结果矩阵）中的一列。此列的一些说明可以被检索如下:

The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:
从 feature 名称到 column index（列索引）的逆映射存储在 vocabulary_ 属性中:

Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:
因此，在未来对 transform 方法的调用中，在 training corpus （训练语料库）中没有看到的单词将被完全忽略:

Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. In particular we lose the information that the last document is an interrogative form. To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):
请注意，在前面的 corpus（语料库）中，第一个和最后一个文档具有完全相同的词，因为被编码成相同的向量。特别是我们丢失了最后一个文件是一个疑问的形式的信息。为了防止词组顺序颠倒，除了提取一元模型 1-grams（个别词）之外，我们还可以提取 2-grams 的单词:

The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local positioning patterns:
由 vectorizer（向量化器）提取的 vocabulary（词汇）因此会变得更大，同时可以在定位模式时消除歧义:

In particular the interrogative form “Is this” is only present in the last document:
特别是 “Is this” 的疑问形式只出现在最后一个文档中:

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
在一个大的文本语料库中，一些单词将出现很多次（例如 “the”, “a”, “is” 是英文），因此对文档的实际内容没有什么有意义的信息。如果我们将直接计数数据直接提供给分类器，那么这些频繁词组会掩盖住那些我们关注但很少出现的词。

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
为了为了重新计算特征权重，并将其转化为适合分类器使用的浮点值，因此使用 tf-idf 变换是非常常见的。

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency :

Using the TfidfTransformer ’s default settings, TfidfTransformer(norm=\'l2\', use_idf=True, smooth_idf=True, sublinear_tf=False) the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
,
where is the total number of documents, and
is the number of documents that contain term . The resulting tf-idf vectors are then normalized by the Euclidean norm:
.
Tf表示词频，而 tf-idf 表示术语频率乘以 逆文档频率 :
使用 TfidfTransformer 的默认设置， TfidfTransformer(norm=\'l2\', use_idf=True, smooth_idf=True, sublinear_tf=False) 词频即一个词在给定文档中出现的次数，乘以 idf 即通过计算,
其中是文档的总数，是包含词的文档数。然后，所得到的tf-idf向量通过欧几里得范数归一化：
.

This was originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results) that has also found good use in document classification and clustering.

The following sections contain further explanations and examples that illustrate how the tf-idfs are computed exactly and how the tf-idfs computed in scikit-learn’s TfidfTransformer and TfidfVectorizer differ slightly from the standard textbook notation that defines the idf as

In the TfidfTransformer and TfidfVectorizer with smooth_idf=False , the “1” count is added to the idf instead of the idf’s denominator:

它源于一个词权重的信息检索方式(作为搜索引擎结果的评级函数)，同时也在文档分类和聚类中表现良好。

以下部分包含进一步说明和示例，说明如何精确计算 tf-idfs 以及如何在 scikit-learn 中计算 tf-idfs， TfidfTransformer 并 TfidfVectorizer 与定义 idf 的标准教科书符号略有不同

在 TfidfTransformer 和 TfidfVectorizer 中 smooth_idf=False ，将 “1” 计数添加到 idf 而不是 idf 的分母:

This normalization is implemented by the TfidfTransformer class:
该归一化由类 TfidfTransformer 实现:

Again please see the reference documentation for the details on all the parameters.
有关所有参数的详细信息，请参阅参考文档。

Let’s take an example with the following counts. The first term is present 100% of the time hence not very interesting. The two other features only in less than 50% of the time hence probably more representative of the content of the documents:
让我们以下方的词频为例。第一个次在任何时间都是100％出现，因此不是很有重要。另外两个特征只占不到50％的比例，因此可能更具有代表性:

Each row is normalized to have unit Euclidean norm:

For example, we can compute the tf-idf of the first term in the first document in the <cite style="font-style: normal;">counts</cite> array as follows:

Now, if we repeat this computation for the remaining 2 terms in the document, we get

and the vector of raw tf-idfs:

Then, applying the Euclidean (L2) norm, we obtain the following tf-idfs for document 1:

Furthermore, the default parameter smooth_idf=True adds “1” to the numerator and denominator as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions:

Using this modification, the tf-idf of the third term in document 1 changes to 1.8473:

And the L2-normalized tf-idf changes to

:
每行都被正则化，使其适应欧几里得标准:

例如，我们可以计算计数数组中第一个文档中第一个项的 tf-idf ，如下所示:

现在，如果我们对文档中剩下的2个术语重复这个计算，我们得到:

和原始 tf-idfs 的向量:

然后，应用欧几里德（L2）规范，我们获得文档1的以下 tf-idfs:

此外，默认参数 smooth_idf=True 将 “1” 添加到分子和分母，就好像一个额外的文档被看到一样包含集合中的每个术语，这样可以避免零分割:

使用此修改，文档1中第三项的 tf-idf 更改为 1.8473:

而 L2 标准化的 tf-idf 变为

The weights of each feature computed by the fit method call are stored in a model attribute:
通过 fit 方法调用计算出的每个特征的权重存储在模型属性中:

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model:
由于 tf-idf 经常用于文本特征，所以还有一个类 TfidfVectorizer ，它将 CountVectorizer 和 TfidfTransformer 的所有选项组合在一个单例模型中:

While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer . In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.
虽然tf-idf标准化通常非常有用，但是可能有一种情况是二元变量显示会提供更好的特征。这可以使用类 CountVectorizer 的二进制参数来实现。特别地，一些估计器，诸如伯努利朴素贝叶斯显式的使用离散的布尔随机变量。而且，非常短的文本很可能影响 tf-idf 值，而二进制出现信息更稳定。

As usual the best way to adjust the feature extraction parameters is to use a cross-validated grid search, for instance by pipelining the feature extractor with a classifier:

通常情况下，调整特征提取参数的最佳方法是使用基于网格搜索的交叉验证，例如通过将特征提取器与分类器进行流水线化:

Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding . To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.

Note

An encoding can also be called a ‘character set’, but this term is less accurate: several encodings can exist for a single character set.

The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the files are in. The CountVectorizer takes an encoding parameter for this purpose. For modern text files, the correct encoding is probably UTF-8, which is therefore the default ( encoding="utf-8" ).

If the text you are loading is not actually encoded with UTF-8, however, you will get a UnicodeDecodeError . The vectorizers can be told to be silent about decoding errors by setting the decode_error parameter to either "ignore" or "replace" . See the documentation for the Python function bytes.decode for more details (type help(bytes.decode) at the Python prompt).

If you are having trouble decoding text, here are some things to try:

For example, the following snippet uses chardet (not shipped with scikit-learn, must be installed separately) to figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not shown here.

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> import chardet

(Depending on the version of chardet , it might get the first one wrong.)

For an introduction to Unicode and character encodings in general, see Joel Spolsky’s Absolute Minimum Every Software Developer Must Know About Unicode .

以上是关于【TODO】【scikit-learn翻译】4.2.3Text feature extraction的主要内容，如果未能解决你的问题，请参考以下文章