多于一级类别的文本分类问题

Posted 2023-03-12

技术标签:

【中文标题】多于一级类别的文本分类问题【英文标题】：A question on text classification with more than one level of category 【发布时间】：2022-01-19 17:17:54 【问题描述】：

我正在尝试根据每个产品的文本描述生成一系列产品分类器。我拥有的数据框与以下类似，但更复杂。使用 Python 和 sklearn 库。

data = 'description':['orange', 'apple', 'bean', 'carrot','pork','fish','beef'],
        'level1':['plant', 'plant', 'plant', 'plant','animal','animal','animal'],
         'level2:['fruit','fruit','vegatable','vegatable','livestock', 'seafood','livestock'
  
# Create DataFrame
df = pd.DataFrame(data)

“描述”是文本数据。现在它只是一个词。但真正的是一个更长的句子。 “Level1”是***类别。 “Level2”是一个子类别。

我知道如何使用 sklearn 库训练分类模型以将产品分类为 1 级类别。

以下是我所做的：

import pandas as pd
import numpy as np
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score
import pickle

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(df['description'],
                                                 df[['Level1','Level2']], test_size = 0.4, shuffle=True)

#use the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True)

#transforming the training data into tf-idf matrix
X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train)

#transforming testing data into tf-idf matrix
X_test_vectors_tfidf = tfidf_vectorizer.transform(X_test)

#Create and save model for level 1
naive_bayes_classifier = MultinomialNB()
model_level1 = naive_bayes_classifier.fit(X_train_vectors_tfidf, y_train['Level1'])
with open('model_level_1.pkl','wb') as f:
    pickle.dump(model_level1, f)

我不知道怎么做的是为每个 1 级类别建立一个分类模型，可以预测产品的 2 级类别。例如，基于上述数据集，应该有一个“植物”分类模型（预测水果或蔬菜）和另一个“动物”模型（预测海鲜或牲畜）。您有什么想法可以使用循环来保存模型吗？

【问题讨论】：

【参考方案1】：

假设您将能够获取数据集的所有列，那么它将是功能的混合，其中级别是类标签。在同一行上制定：

cols = ["abc", "Level1", "Level2", "Level3"]

从这里开始，我们只考虑关卡，因为这是我们感兴趣的。

level_cols = [val for val in levels if "Lev" in val]

上面只是检查是否存在以这三个字符开头的“Lev”。

现在，设置了关卡列。我认为您可以从以下几点开始：

1. Iterate only the level cols.
2. Take only the numbers 1,2,3,4....n
3. If step-2 is divisible by 2 then I do the prediction using the saved level model. Ideally, all the even ones.
4. Else train on other levels. 

for level in level_cols:
    if int(level[-1]) % 2 == 0:
      #  open the saved model at int(level[-1]) - 1
      #  Perform my prediction
    else:
        level_idx = int(level[-1])
        model = naive_bayes_classifier.fit(x_train, y_train[level])
        mf = open("model-x-"+level_idx, "wb")
        pickle.dump(model, mf)

【讨论】：

以上是关于多于一级类别的文本分类问题的主要内容，如果未能解决你的问题，请参考以下文章