文本分类 Text Classification
Posted mrdoghead
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了文本分类 Text Classification相关的知识,希望对你有一定的参考价值。
什么是文本分类
文本分类任务是NLP十分常见的任务大类,他的输入一般是文本信息,输出则是预测得到的分类标签。主要的文本分类任务有主题分类、情感分析 、作品归属、真伪检测等,很多问题其实通过转化后也能用分类的方法去做。
常规步骤
- 选择一个感兴趣的任务
- 收集合适的数据集
- 做好标注
- 特征选择
- 选择一个机器学习方法
- 利用验证集调参
- 可以多尝试几种算法和参数
- 训练final模型
- Evaluate测试集
机器学习算法
这里简单介绍几个机器学习(基础)算法
1. 朴素贝叶斯 Naive Bayes
假设特征之间是相互独立的,利用贝叶斯法则,寻找最有可能的class,
优点:Fast to “train” and classify; robust, low- variance; good for low data situations; optimal classifier if independence assumption is correct; extremely simple to implement.
缺点:Independence assumption rarely holds; low accuracy compared to similar methods in most situations; smoothing required for unseen class/ feature combinations
2. 逻辑回归 Logistic Regression
逻辑回归是由线性回归做了点改动得来的,利用一个link function进行转化,有点”化曲为直“的味道,能够输出一个0-1的概率。
训练的方法和回归模型差不多,利用cost函数来求weight,还可以添加正则项(regularisation)作为惩罚项。
优点: Unlike Na?ve Bayes not confounded by diverse, correlated features
缺点: High bias; slow to train; some feature scaling issues; often needs a lot of data to work well; choosing regularisation a nuisance but important since overfitting is a big problem
3. Support Vector Machines (SVD)
主要思想:找到一个超平面能够区分训练数据从而进行测试集的分类,这里不展开。
优点: fast and accurate linear classifier; can do non-linearity with kernel trick; works well with huge feature sets
缺点: Multiclass classification awkward; feature scaling can be tricky; deals poorly with class imbalances; uninterpretable
4. K-Nearest Neighbour (KNN)
主要思想:根据观测数据与已有数据的距离(可以是欧几里得距离、cosine距离),取最接近的标签作为观测数据的标签。
优点: Simple, effective; no training required; inherently multiclass; optimal with infinite data
缺点: Have to select k; issues with unbalanced classes; often slow (need to find those k-neighbours); features must be selected carefully
5. 决策树 Decision Tree
主要思想:利用feature信息构建树,最后的叶子节点就是class类。
优点: in theory, very interpretable; fast to build and test; feature representation/scaling irrelevant; good for small feature sets, handles non-linearly-separable problems
缺点: In practice, often not that interpretable; highly redundant sub-trees; not competitive for large feature sets
6. 随机森林 Random Forest
主要思想:有多个决策树构成,通过最后投票选定标签。
优点: Usually more accurate and more robust than decision trees, a great classifier for small- to moderate-sized feature sets; training easily parallelised
缺点: Same negatives as decision trees: too slow with large feature sets
7. 神经网络 Neural Network
主要思想:将多个神经层节点之间相互联系,每个节点把前一层的weight传递到下一层,这里不展开,其实本质还是linear regression。
优点: Extremely powerful, state-of-the-art accuracy on many tasks in natural language processing and vision
缺点: Not an off-the-shelf classifier, very difficult to choose good parameters; slow to train; prone to overfitting
调参
我们在使用训练集训练完数据后,可以用验证集进行调参,常用的调参方法有k-fold cross-validation,grid search
评估
常用的评估标准:
-
Accuracy = 正确数/总数
-
Precision = tp/tp+fp
-
Recall = tp/tp+fn
-
F1-score = 2 * precision * recall / (precision + recall)
另外还有macro f-score 和 micro f-score,想进一步了解的可以点这里。
以上是关于文本分类 Text Classification的主要内容,如果未能解决你的问题,请参考以下文章
文本分类Bag of Tricks for Efficient Text Classification
多标签文本分类Deep Learning for Extreme Multi-label Text Classification
多标签文本分类Large Scale Multi-label Text Classification with Semantic Word Vectors
文本分类Recurrent Convolutional Neural Networks for Text Classification
文本分类Recurrent Convolutional Neural Networks for Text Classification
GCN与文本分类Graph Convolutional Networks for Text Classification