手把手实现邮件分类《Getting Started with NLP》chap2：Your first NLP example

Posted 2023-01-09 临风而眠

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了手把手实现邮件分类《Getting Started with NLP》chap2：Your first NLP example相关的知识，希望对你有一定的参考价值。

《Getting Started with NLP》chap2：Your first NLP example

感觉这本书很适合我这种菜菜,另外下面的笔记还有学习英语的目的，故大多数用英文摘录或总结

文章目录

《Getting Started with NLP》chap2：Your first NLP example
2.1 Introducing NLP in practice: Spam filtering
- classification
- - 一个简单的例子
2.2 Understanding the task
2.3 Implementing your own spam filter
2.4 Deploying your spam filter in practice
Summary

This chapter covers
- Implementing your first practical NLP application from scratch
- Structuring an NLP project from beginning to end
- Exploring NLP concepts, including tokenization and text normalization
- Applying a machine learning algorithm to textual data

昨天刚在《机器学习实战》那本书里面看了以spam filtering为例子贯穿讲解的ML的landscape，今天就来跟着这个chapter实践一下

2.1 Introducing NLP in practice: Spam filtering

Spam filtering exemplifies a widely spread family of tasks —— text classification.

exemplify :to be or give a typical example of something 作为…的典范（或范例、典型、榜样等）

classification

作者给了很多非常贴近生活的例子 ~ 讲了 classification的usefulness，概括一下👇
- Classification helps us reason about things and adjust our behavior based on them.
- Classification allows us to group things into categories, making it easier to deal with individual instances.
We refer to the name of each class as a class label
- When we are dealing with two labels only, it is called binary classification
- Classification that implies more than two classes is called multiclass classification
How do we perform classification ?
- Classification is performed based on certain characteristics or features of the concepts being classified.
- The specific characteristics or features used in classification depend on the task at hand.
- In machine-learning terms, we call such characteristics “features”.

作者给了一段summary
- Classification refers to the process of identifying which category or class among the set of categories (classes) an observation belongs to based on its properties. In machine learning, such properties are called features and the class names are called class labels. If you classify observations into two classes, you are dealing with binary classification; tasks with more than two classes are examples of multiclass classification.

一个简单的例子

其实大一最早的时候做的经典的老掉牙if/else或者那个switch，输入学生成绩，给出成绩，就是simple classification

这里作者也给出了很simple的程序

For example, you can make the machine print out a warning that water is hot based on a simple threshold of 45°C (113°F), as listing 2.1 suggests. In this code, you define a function print_warning, which takes water temperature as input and prints out water status. The if statement checks if input temperature is above a predefined threshold（阈值） and prints out a warning message if it is. In this case, since the temperature is above 45°C, the code prints out Caution: Hot water!

#  Simple code to tell whether water is cold or hot
def print_warning(temperature): 
    if temperature>=45: 
        print ("Caution: Hot water!")
    else:
        print ("You may use water as usual")
print_warning(46)

感觉写的不错，作者把这个例子过渡到了ML 和监督学习

When there are multiple factors to consider in classification, it is better to let the machine learn the rules and patterns from data rather than hardcoding（写死） them. Machine learning involves providing a machine with examples and a general outline of a task, and allowing it to learn to solve the task independently. In supervised machine learning, the machine is given labeled data and learns to classify based on the provided features.

Supervised machine learning

Supervised machine learning refers to a family of machine-learning tasks in which the algorithm learns the correspondences between an input and an output based on the provided labeled examples. Classification is an example of a supervised machine learning task, where the algorithm tries to learn the mapping between the input data and the output class label.

Using the cold or hot water example, we can provide the machine with the samples of water labeled hot and samples of water labeled cold, tell it to use temperature as the predictive factor (feature), and this way let it learn independently from the provided data that the boundary between the two classes is around 45°C (113°F)

2.2 Understanding the task

作者给出了scenario （a description of possible actions or events in the future可能发生的事态；设想）

Consider the following scenario: you have a collection of spam and normal emails from the past. You are tasked with building a spam filter, which for any future incoming email can predict whether this email is spam or not. Consider these questions:

How can you use the provided data?
What characteristics of the emails might be particularly useful, and how will you extract them?
What will be the sequence of steps in this application?

In this section, we will discuss this scenario and look into the implementation steps. In total, the pipeline for this task will consist of five steps.

The pipeline👇

Step 1: Define the data and classes

“normal” emails are sometimes called “ham” in the spam-detection context

First, you need to ask yourself what format the email messages are delivered in for this task. For instance, in a real-life situation, you might need to extract the messages from the mail agent application. However, for simplicity, let’s assume that someone has extracted the emails for you and stored them in text format. The normal emails are stored in a separate folder—let’s call it Ham, and spam emails are stored in a Spam folder.

If someone has already predefined past spam and ham emails for you (e.g., by extracting these emails from the INBOX and SPAM box), you don’t need to bother with labeling them. However, you still need to point the machine-learning algorithm at the two folders by clearly defining which one is ham and which one is spam. This way, you will define the class labels and identify the number of classes for the algorithm. This should be the first step in your spam-detection pipeline (and in any text-classification pipeline), after which you can preprocess the data, extract the relevant information, and then train and test your algorithm (figure 2.5).

You can set step 1 of your algorithm as follows: Define which data represents “ham” class and which data represents “spam” class for the machine-learning algorithm.

Step 2: Split the text into words

Next, you will need to define the features for the machine to know what type of information, or what properties of the emails to pay attention to, but before you can do that, there is one more step to perform – split the text into words.

为何需要split into words?

The content of an email can be useful for identifying spam, but using the entire email as a single feature may not work well because even small changes can affect it. Instead, smaller chunks of text like individual words might be more effective features because they are more likely to carry spam-related information and are repetitive enough to appear in multiple emails.

如何split into words(exercise 2.2)

For a machine, the text comes in as a sequence of symbols, so the machine does not have an idea of what a word is.
- How would you define what a word is from the human perspective?
- How would you code this for a machine?
For example, how will you split the following sentence into words? “Define which data represents each class for the machine learning algorithm.”

split text string into words by whitespaces

The first solution might be “Words are sequences of characters separated by whitespaces.”

simple code to split text string into words by whitespaces

text = "Define which data represents each class for the machine learning algorithm"
text.split(" ")

['Define',
 'which',
 'data',
 'represents',
 'each',
 'class',
 'for',
 'the',
 'machine',
 'learning',
 'algorithm']

So far(到目前为止), so good. However, what happens to this strategy when we have punctuation marks? For example: “Define which data represents “ham” class and which data represents “spam” class for the machine learning algorithm.”
```
['Define',
 'which',
 'data',
 'represents',
 '“ham”',
 'class',
 'and',
 'which',
 'data',
 'represents',
 '“spam”',
 'class',
 'for',
 'the',
 'machine',
 'learning',
 'algorithm.']
```
In the list of words, are [“ham”], [“spam”], and [algorithm.] any different from [ham], [spam], and [algorithm]? That is, the same words but without the punctuation marks attached to them? The answer is, these words are exactly the same, but because you are only splitting by whitespaces at the moment, there is no way of taking the punctuation marks into account.

However, each sentence will likely include one full stop (.), question (?), or exclamation mark (!) attached to the last word, and possibly more punctuation marks inside the sentence itself, so this is going to be a problem for properly extracting words from text. Ideally, you would like to be able to extract words and punctuation marks separately.

split text string into words by whitespaces and punctuation

那么如何把标点符号punctuation考虑进来？书上给出了很详细的算法描述！

Store words list and a variable that keeps track of the current word—let’s call it current_word for simplicity.
Read text character by character:
- If a character is a whitespace, add the current_word to the words list and update the current_word variable to be ready to start a new word.
- Else if a character is a punctuation mark:
  - If the previous character is not a whitespace, add the current_word to the words list; then add the punctuation mark as a separate word token, and update the current_word variable.
  - Else if the previous character is a whitespace, just add the punctuation mark as a separate word token.
- Else if a character is a letter other than a whitespace or punctuation mark, add it to the current_word.

orz，感觉这图做得很好

Code to split text string into words by whitespaces and punctuation

delimiter 分隔符，定界符

text = "Define which data represents “ham” class and which data represents “spam” class for the machine learning algorithm."
delimiters = ['"', "."]
words = []
current_word = ""
for char in text:
    if char==" ":
        if not current_word=="":
            words.append(current_word)
            current_word = ""
    elif char in delimiters:
        if current_word=="":  #current_word是空的说明前面一个是whitespace，只把标点加入list
            words.append(char)
        else:
            words.append(current_word)
            words.append(char)
            current_word = ""
    else:
        current_word += char

print(words)

['Define', 'which', 'data', 'represents', '“ham”', 'class', 'and', 'which', 'data', 'represents', '“spam”', 'class', 'for', 'the', 'machine', 'learning', 'algorithm', '.']

这个时候完美了吗? 没有，有很多时候我们用的缩写比如e.g.,i.e.是不需要拆的，但是用刚刚的算法会拆成[‘i’,‘.’,‘e’,'.]

tokenizers

This is problematic, since if the algorithm splits these examples in this way, it will lose track of the correct interpretation of words like i.e. or U.S.A., which should be treated as one word token rather than a combination of characters. How can this be achieved?
This is where the NLP tools come in handy: the tool that helps you split the running string of characters into meaningful words is called tokenizer, and it takes care of the cases like the ones we’ve just discussed—that is, it can recognize that ham. needs to be split into [‘ham’, ‘.’] while U.S.A. needs to be kept as one word [‘U.S.A.’]. Normally, tokenizers rely on extensive and carefully designed lists of regular expressions, and some are trained using machine-learning approaches.
这里作者介绍了NLTK👇

For an example of a regular expressions-based tokenizer, you can check the Natural Language Processing Toolkit’s regexp_tokenize() to get a general idea of the types of the rules that tokenizers take into account: see Section 3.7 on www.nltk.org/book/ch03.html. The lists of rules applied may differ from one tokenizer to another.
tokenizer还可以给我们解决哪些常见的problem呢？缩写（contraction）

看几个例子
- What’s the best way to cook a pizza?
  - The first bit, What’s, should also be split into two words: this is a contraction for what and is, and it is important that the classifier knows that these two are separate words. Therefore, the word list for this sentence will include [What, 's, the, best, way, to, cook, a, pizza, ?].
- We’re going to use a baking stone.
  - we’re should be split into we and 're (“are”). Therefore, the full word list will be [We, 're, going, to, use, a, baking, stone, .].
- I haven’t used a baking stone before.
  - [I, have, n’t, used, a, baking, stone, before, .]. Note that the contraction of have and not here results in an apostrophe inside the word not; however, you should still be able to recognize that the proper English words in this sequence are have and n’t (“not”) rather than haven and 't. This is what the tokenizer will automatically do for you. (Note that the tokenizers do not automatically map contracted forms like n’t and ’re to full form like not and are. Although such mapping would be useful in some cases, this is beyond the functionality of tokenizers.)
如何解决这个呢，正则表达式是一个神奇的东西
```
import nltk
import re

text1 = "What’s the best way to cook a pizza?"
text2 = "We’re going to use a baking stone."
text3 = "I haven’t used a baking stone before."

texts = [text1, text2, text3]

tokens = []
for text in texts:
  # Replace certain characters with a space
  text = re.sub(r'[’]', '\\'', text)
  print(text)
  # text = re.sub(r'[^\\w\\s\\.\\?\\']', ' ', text)
  # print(text)
  # Tokenize the text
  tokens.append(nltk.word_tokenize(text))

for token in tokens:
  print(token)
# print(tokens)
```
```
What's the best way to cook a pizza?
We're going to use a baking stone.
I haven't used a baking stone before.
['What', "'s", 'the', 'best', 'way', 'to', 'cook', 'a', 'pizza', '?']
['We', "'re", 'going', 'to', 'use', 'a', 'baking', 'stone', '.']
['I', 'have', "n't", 'used', 'a', 'baking', 'stone', 'before', '.']
```

Step 3: Extract and normalize the features

作者举了个例子
- Suppose two emails use a different format:
  - one says Collect your lottery winnings
  - while another one says Collect Your Lottery Winnings
  拆开的词语里面 lottery 和 Lottery 形式上确实不同，但是意思上是一样的，都是博彩
我们需要get rid of such formatting issues
The problem is that small formatting differences, such as the use of uppercase versus lowercase letters, can result in different word lists being generated. To address this issue, the passage suggests normalizing the extracted words by converting them to lowercase. This can help ensure that the meaning of the words is not affected by formatting differences and that the extracted words are more indicative（标示的） of the spam-related content of the emails.

Step 4: Train a classifier

At this point, you will end up with two sets of data—one linked to the spam class and another one linked to the ham class. Each data is preprocessed in the same way in steps 2 and 3, and the features are extracted.
Next, you need to let the machine use this data to build the connection between the set of features (properties) that describe each type of email (spam or ham) and the labels attached to each type. In step 4, a machine-learning algorithm tries to build a statistical model, a function, that helps it distinguish between the two classes. This is what happens during the learning (training) phase（阶段）.
A refresher visualizing the training and test processes.
So, step 4 of the algorithm should be defined as follows: define a machine-learning model and train it on the data with the features predefined in the previous steps
Suppose that the algorithm has learned to map the features of each class of emails to the spam and ham labels, and has determined which features are more indicative of spam or ham. The final step in this process is to make sure the algorithm is doing such predictions well.
How will you do that? 划分数据集
The data used for training and testing should be chosen randomly and should not overlap（交叠） in order to avoid biasing the evaluation. The training set is typically larger, with a typical proportion being 80% for training and 20% for testing. The classifier should not be allowed to access the test set during the training phase, and it should only be used for evaluation at the final step.

shuffle 洗牌

Data splits for supervised machine learning

In supervised machine learning, the algorithm is trained on a subset of the labeled data called training set. It uses this subset to learn the function mapping the input data to the output labels. Test set is the subset of the data, disjointed（分离） from the training set, on which the algorithm can then be evaluated. The typical data split is 80% for training and 20% for testing. Note that it is important that the two sets are not overlapping. If your algorithm is trained and tested on the same data, you won’t be able to tell what it actually learned rather than memorized.

Step 5: Evaluate the classifier

Suppose you trained your classifier in step 4 and then applied it to the test data. How will you measure the performance?
One approach would be to check what proportion of the test emails the algorithm classifies correctly—that is, assigns the spam label to a spam email and classifies ham emails as ham. This proportion is called accuracy, and its calculation is pretty straightforward:
$=\\frac num (\\text correct predictions )n u m(\\text all test instances )$
这里给了个练习2.4

Let’s discuss the solutions to this exercise (note that you can also find more detailed explanations at the end of the chapter). The prediction of the classifier based on the distribution of classes that you came across in this exercise is called baseline. In an equal class distribution case, the baseline is 50%, and if your classifier yields an accuracy of 60%, it outperforms this baseline. In the case of 60:40 split, the baseline, which can also be called the majority class baseline, is 60%. This means that if a dummy “classifier” does no learning at all and simply predicts the ham label for all emails, it will not filter out any spam emails from the inbox, but its accuracy will also be 60%—just like your classifier that is actually trained and performs some classification! This makes the classifier in the second case in this exercise much less useful because it does not outperform the majority class baseline.
书后给的解答
1. An accuracy of 60% doesn’t seem very high, but how exactly can you interpret it? Note that the distribution of classes helps you to put the performance of your classifier in context because it tells you how challenging the problem itself is. For example, with the 50–50% split, there is no majority class in the data and the classifier’s random guess will be at 50%, so the classifier’s accuracy is higher than this random guess. In the second case, however, the classifier performs on a par with the majority class guesser: the 60% to 40% distribution of classes suggests that if some dummy “classifier” always selected the majority class, it would get 60% of the cases correctly—just like the classifier you trained.
2. The single accuracy value of 60% does not tell you anything about the performance of the classifier on each class, so it is a bit hard to interpret. However, if you look into each class separately, you can tell that the classifier is better at classifying ham emails (it got 2/3 of those right) than at classifying spam emails (only 1/2 are correct).
In summary, accuracy is a good overall measure of performance, but you need to keep in mind
- (1) the distribution of classes to have a comparison point for the classifier’s performance
- (2) the performance on each class, which is hidden within a single accuracy value but might suggest what the strengths and weaknesses of your classifier are. Therefore, the final step, step 5, in your algorithm is applying your classifier to the test data and evaluating its performance

2.3 Implementing your own spam filter

Step 1: Define the data and classes

数据集：Enron Email Dataset (cmu.edu)
- This is a dataset of emails, including both ham (extracted from the original Enron dataset using personal messages of three Enron employees) and spam emails. To make processing more manageable, we will use a subset of this large dataset.
- All folders in the Enron dataset contain spam and ham emails in separate subfolders. Each email is stored as a text file in these subfolders. Let’s read in the contents of these text files in each subfolder, store the spam emails’ contents and the ham emails’ contents as two separate data structures, and point our algorithm at each, clearly defining which one is spam and which one is ham.

解压文件

我用的是codespace，我想到三种解压方式，一个是vscode也许有extension，另一个是terminal里面使用unzip，还有就是用python来解压，我打算用后两种都尝试一下
我直接把这本书的GitHub仓库ekochmar/Getting-Started-with-NLP: This repository accompanies the book “Getting Started with Natural Language Processing” (github.com)在codespace打开，然后建了一个叫learningThrough的文件夹，观察一下目标文件位置，我怕里面的文件直接冲出来于是搞了个subfolder叫test

然后
```
unzip enron1.zip -d ./learningThrough/test
unzip enron2.zip -d learningThrough/test
```
这里的路径前面写./ 或者前面不加东西

用python来试试

但是遇到了Bad pipe message 🤔

难道是因为我指定的那个try2zip一开始不存在这个directory？

那我再来一次，直接解压到current directory 下面吧

import zipfile

# Specify the zip file name
zip_file = '../enron1.zip'

# Create a ZipFile object
zip_obj = zipfile.ZipFile(zip_file, 'r')

# Extract all the contents of zip file to the destination folder
zip_obj.extractall()

# close the Zip File
zip_obj.close()

import zipfile

# Specify the zip file name
zip_file = '../enron2.zip'

# Create a ZipFile object
zip_obj = zipfile.ZipFile(zip_file, 'r')

# Extract all the contents of zip file to the destination folder
zip_obj.extractall()

# close the Zip File
zip_obj.close()

好耶

read in the contents of the files

Let’s define a function read_in that will take a folder as an input, read the files in this folder, and store their contents as a Python list data structure.

不得不说这本书给的注释是真的详细…orz

import os
# helps with different text encoding
import codecs
def read_in(folder):
# Using os functionality, list all the files in the specified folder
the files in the specified folder.
    files = os.listdir(folder)
    a_list = []
    # Iterate through the files in the folder
    for a_file in files:
        # Skip hidden files.
        if not a_file.startswith("."):
            # Read the contents of each file.
            f = codecs.open(folder + a_file, "r",encoding="ISO-8859-1",errors="ignore")
            # Add the content of each file to the list data structure.
            a_list.append(f.read())
            # Close the fileafter you readthe contents.
            f.close()
    return a_list

In this code, you rely on Python’s os module functionality to list all the files in the specified folder, and then you iterate through them, skipping hidden files (以.开头的那些文件,such files can be easily identified because their names start with “.” ) that are sometimes automatically created by the operating systems.

Next, you read the contents of each file.

The encoding and errors arguments of codecs.open function will help you avoid errors in reading files that are related to text encoding.

codecs是啥捏

The codecs module in Python provides functions to encode and decode data using various codecs (encoding/decoding algorithms).

Here are some common codecs that you can use with the codecs module:

utf-8: A Unicode encoding that can handle any character in the Unicode standard. It’s the most widely used encoding for the Web.
ascii: An encoding for the ASCII character set, which consists of 128 characters.
latin-1: An encoding for the Latin-1 character set, which consists of 256 characters.
utf-16: A Unicode encoding that uses two bytes (16 bits) to represent each character.

You add the content of each file to a list data structure, and in the end, you return the list that contains the contents of the files from the specified folder.

verify that the data is uploaded and read in correctly

现在我们可以定义spam_list和ham_list—letting the machine know what data represents spam emails and what data represents ham emails.
Let’s check if the data is uploaded correctly; for example, you can print out the lengths of the lists or check any particular member of the list.

Summary.txt里面列出来了，那我们就看看条数对不对

the length of the spam_list should equal the number of spam emails in the enron1/spam/ folder, which should be 1,500, while the length of the ham_list should equal the number of emails in the enron1/ham/, or 3,672. If you get these numbers, your data is uploaded and read in correctly.

代码如下

spam_list = read_in("./enron1/spam/")
print(len(spam_list))
# print(spam_list[0])
ham_list = read_in("./enron1/ham/")
print(len(ham_list))
print(ham_list[0])

条数是对的，好耶

combine the data into a single structure

Next, we need to preprocess the data (e.g., by splitting text strings into words) and extract the features. Wouldn’t it be easier if you could run all preprocessing steps over a single data structure rather than over two separate lists?

这里不采用for循环，为了加上标签，我们把每个邮件和其标签组成一个元组，然后再放入一个新的list all_emails 里面

verify一下是否读进去了，输出太长，用切片切一下

此外，我们还要考虑到划分数据集的随机性

We need to split the data randomly into the training and test sets. To that end（因为那个缘故）, let’s shuffle the resulting list of emails with their labels, and make sure that the shuffle is reproducible by fixing the way in which the data is shuffled. For the shuffle to be reproducible, you need to define the seed for the random operator, which makes sure that all future runs will shuffle the data in exactly the same way.

# Python’s random module will help you shuffle the data randomly.
import random
# Use list comprehensions to create the all_emails list that will keep all emails with their labels.
all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]
# Select the seed of the random operator to make sure that all future runs will shuffle  the data in the same way
random.seed(42)
random.shuffle(all_emails)
# Check the size of the dataset (lengthof the list); it should be equal to 1,500 + 3,672 
# This kind of string is called formatted string literals or f-strings, and it is a new feature introduced in Python 3.6
print (f"Dataset size = str(len(all_emails)) emails")

Dataset size = 5172 emails

Step 2: Split the text into words

Remember that the email contents you’ve read in so far each come as a single string of symbols. The first step of text preprocessing involves splitting the running text into words.
我们还是用NLTK
- One of the benefits of this toolkit is that it comes with a thorough documentation and description of its functionality.
  
  （该工具包的好处是，它具有详尽的文档）
- NLTK :: nltk.tokenize package
It takes running text as input and returns a list of words based on a number of customized regular expressions, which help to delimit the text by whitespaces and punctuation marks, keeping common words like U.S.A. unsplit.

run a tokenizer over text

code

This code defines a tokenize function that takes a string as input and splits it into words.

The for-loop within this function appends each identified word from the tokenized string to the output word list; alternatively, you can use list comprehensions(列表生成式) for the same purpose. Finally, given the input, the function prints out a list of words. You can test your intuitions about the words and check your answers to previous exercises by changing the input to any string of your choice
```
import nltk
from nltk import word_tokenize 
nltk.download('punkt')
def tokenize(input): 
    word_list = []
    for word in word_tokenize(input):
        word_list.append(word) 
    return word_list
input = "What's the best way to split a sentence into words?"
print(tokenize(input))
```
In addition to the toolkit itself, you need to install NLTK data as explained on www.nltk.org/data.html. Running nltk.download() will install all the data needed for text processing in one go; in addition, individual tools can be installed separately (e.g., nltk.download(‘punkt’) installs NLTK’s sentence tokenizer)
```
['What', "'s", 'the', 'best', 'way', 'to', 'split', 'a', 'sentence', 'into', 'words', '?']
```
如果我用列表生成式（list comprehension）,那就这样写👇
```
import nltk
from nltk import word_tokenize 
def tokenize(input): 
    word_list = [word for word in word_tokenize(input)]
    return word_list
input = "What's the best way to split a sentence into words?"
print(tokenize(input))
```

Step 3: Extract and normalize the features

这里作者非常推崇list comprehensions，确实，我得提高使用它的意识

Once the words are extracted from running text, you need to convert them into features. In particular, you need to put all words into lowercase to make your algorithm establish the connection between different formats like Lottery and lottery.
Putting all strings to lowercase can be achieved with Python’s string functionality. To extract the features (words) from the text, you need to iterate through the recognized words and put all words to lowercase. In fact, both tokenization and converting text to lowercase can be achieved using a single line of code with list comprehensions.

word_list = [word for word in word_tokenize(text.lower())]

Using list comprehensions, you can combine the two steps—tokenization and converting strings to lowercase—in one line. Here, you first normalize and then tokenize text, but the two steps are interchangeable.
We define a function get_features that extracts the features from the text of email passed in as input. Next, for each word in the email, you switch on the “flag” that the word is contained in this email by assigning it with a True value. The list data structure all_features keeps tuples containing the dictionary of features matched with the spam or ham label for each email.

Code to extract the features

import nltk
from nltk import word_tokenize

def get_features(text):
    features = 
    word_list = [word for word in word_tokenize(text.lower())]
    for word in word_list:
    # For each word in the email, switch on the “flag” that this word is contained in the email.
        features[word] = True
    return features
# all_features will keep tuples containing the dictionary of features matched with the label for each email

all_features = [(get_features(email), label) for (email, label) in all_emails]
# Check what features are extracted from an input text
print(get_features("Participate In Our New Lottery NOW!"))
print(len(all_features))
# Check what all_features list data structure contains.

print(len(all_features[0][0]))
print(len(all_features[99][0]))

In the end, the code shows how you can check what features are extracted from an input text and what all_features list data structure contains (e.g., by printing out its length and the number of features detected in the first or any other email in the set)

'participate': True, 以上是关于手把手实现邮件分类 《Getting Started with NLP》chap2：Your first NLP example的主要内容，如果未能解决你的问题，请参考以下文章

手把手实现邮件分类 《Getting Started with NLP》chap2：Your first NLP example

《Getting Started with NLP》chap2：Your first NLP example

文章目录

2.1 Introducing NLP in practice: Spam filtering

classification

一个简单的例子

2.2 Understanding the task

Step 1: Define the data and classes

Step 2: Split the text into words

为何需要split into words?

如何split into words(exercise 2.2)

split text string into words by whitespaces

split text string into words by whitespaces and punctuation

tokenizers

Step 3: Extract and normalize the features

Step 4: Train a classifier

Step 5: Evaluate the classifier

2.3 Implementing your own spam filter

Step 1: Define the data and classes

解压文件

read in the contents of the files

verify that the data is uploaded and read in correctly

combine the data into a single structure

Step 2: Split the text into words

run a tokenizer over text

Step 3: Extract and normalize the features

Code to extract the features

手把手实现邮件分类《Getting Started with NLP》chap2：Your first NLP example