情绪分析的训练数据[关闭]

Posted 2023-02-15

技术标签:

【中文标题】情绪分析的训练数据[关闭]【英文标题】：Training data for sentiment analysis [closed] 【发布时间】：2011-11-24 23:04:04 【问题描述】：

我在哪里可以获得已在企业领域中被分类为正面/负面情绪的文档语料库？我想要为公司提供评论的大量文档，例如分析师和媒体提供的公司评论。

我发现有产品和电影评论的语料库。是否有与商业语言相匹配的商业领域的语料库，包括对公司的评论？

【问题讨论】：

另请参阅此相关问题：***.com/questions/5570681/… 【参考方案1】：

我不知道有任何这样的语料库可以免费获得，但您可以在未标记的数据集上尝试unsupervised method。

【讨论】：

【参考方案2】：

http://www.cs.cornell.edu/home/llee/data/

http://mpqa.cs.pitt.edu/corpora/mpqa_corpus

您可以使用带有表情符号的 twitter，如下所示：http://web.archive.org/web/20111119181304/http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf

希望能帮助您入门。如果您对否定、情感范围等特定的子任务感兴趣，文献中还有更多内容。

要关注公司，您可以将一种方法与主题检测结合起来，或者只需大量提及给定公司即可。或者您可以让 Mechanical Turkers 对您的数据进行注释。

【讨论】：

仅供参考，皮特搬到了这里mpqa.cs.pitt.edu/corpora/mpqa_corpus【参考方案3】：

这里还有一些；

http://inclass.kaggle.com/c/si650winter11

http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html

【讨论】：

我们需要为 kaggle 链接输入大学邮箱和密码。【参考方案4】：

如果您有一些关于您想要探索的领域的资源（媒体渠道、博客等），您可以创建自己的语料库。我在 python 中这样做：

使用 Beautiful Soup http://www.crummy.com/software/BeautifulSoup/ 解析我要分类的内容。将那些表示对公司的正面/负面意见的句子分开。使用NLTK来处理这个句子，tokenize words，POS tagging等。使用 NLTK PMI 计算仅在一类中出现频率最高的二元组或三元组

创建语料库是一项艰巨的预处理、检查、标记等工作，但其好处是为特定领域准备模型多次提高准确性。如果您可以获得已经准备好的语料库，请继续进行情感分析；）

【讨论】：

【参考方案5】：

您可以从 Datafiniti 获得大量在线评论。大多数评论都带有评级数据，这将提供比正面/负面更多的情绪粒度。这是list of businesses with reviews，这是list of products with reviews。

【讨论】：

【参考方案6】：

这是我几周前写的一个列表，来自my blog。其中一些数据集最近已包含在 NLTK Python 平台中。

词典

刘冰的意见词典

网址：http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon 论文：Mining and summarizing customer reviews 注意事项：包含在 NLTK Python 平台中

MPQA 主观性词典

网址：http://mpqa.cs.pitt.edu/#subj_lexicon 论文：Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis (Theresa Wilson, Janyce Wiebe, and Paul Hoffmann, 2005)。

SentiWordNet

网址：http://sentiwordnet.isti.cnr.it 注意事项：包含在 NLTK Python 平台中

哈佛综合询问者

网址：http://www.wjh.harvard.edu/~inquirer 论文：The General Inquirer: A Computer Approach to Content Analysis (Stone, Philip J; Dexter C. Dunphry; Marshall S. Smith; and Daniel M. Ogilvie. 1966)

语言查询和字数统计 (LIWC)

网址：http://www.liwc.net

维达词典

网址：https://github.com/cjhutto/vaderSentiment、http://comp.social.gatech.edu/papers 论文：Vader: A parsimonious rule-based model for sentiment analysis of social media text (Hutto, Gilbert. 2014)

数据集

MPQA 数据集

网址：http://mpqa.cs.pitt.edu

注意事项：GNU 公共许可证。

政治辩论数据产品辩论数据主观感觉注释

Sentiment140（推文）

网址：http://help.sentiment140.com/for-students 论文：Twitter Sent classification using Distant Supervision (Go, Alec, Richa Bhayani, and Lei Huang) 网址：http://help.sentiment140.com、https://groups.google.com/forum/#!forum/sentiment140

STS-Gold（推文）

网址：http://www.tweenator.com/index.php?page_id=13 论文：Evaluation datasets for twitter sentiment analysis (Saif, Fernandez, He, Alani) 备注：与 Sentiment140 相同，但数据集更小且带有人工注释器。它带有 3 个文件：推文、实体（带有他们的情绪）和一个聚合集。

客户评论数据集（产品评论）

网址：http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets 论文：Mining and summarizing customer reviews 备注：评论标题、产品功能、带有意见强度的正面/负面标签、其他信息（比较、代词解析等）

包含在 NLTK Python 平台中

优缺点数据集（优缺点句子）

网址：http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets 论文：Mining Opinions in Comparative Sentences (Ganapathibhotla, Liu 2008) NOTES：标记为<pros> 或<cons> 的句子列表

包含在 NLTK Python 平台中

比较句（评论）

网址：http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets 论文：Identifying Comparative Sentences in Text Documents (Nitin Jindal and Bing Liu)，Mining Opinion Features in Customer Reviews (Minqing Hu and Bing Liu) 注意：句子、带有 POS 标记的句子、实体、比较类型（不等、等、***、不可分级）

包含在 NLTK Python 平台中

Sanders Analytics Twitter 情绪语料库（推文）

网址：http://www.sananalytics.com/lab/twitter-sentiment

5513 条手工分类的推文包含 4 个不同的主题。由于 Twitter 的 ToS，包含一个小的 Python 脚本来下载所有推文。情绪分类本身是免费提供的，没有任何限制。它们可用于商业产品。它们可能会被重新分配。它们可能会被修改。

西班牙语推文（推文）

网址：http://www.daedalus.es/TASS2013/corpus.php

SemEval 2014（推文）

网址：http://alt.qcri.org/semeval2014/task9

您不得重新分发（从自述文件中）获得的推文、注释或语料库

各种数据集（评论）

网址：https://personalwebs.coloradocollege.edu/~mwhitehead/html/opinion_mining.html 论文：Building a General Purpose Cross-Domain Sentiment Mining Model (Whitehead and Yaeger), Sentiment Mining Using Ensemble Classification Models (Whitehead and Yaeger)

各种数据集 #2（评论）

网址：http://www.text-analytics101.com/2011/07/user-review-datasets_20.html

参考资料：

Keenformatics - Sentiment Analysis lexicons and datasets（我的博客）个人经历

【讨论】：

不错的答案。非常感谢库尔特。

以上是关于情绪分析的训练数据[关闭]的主要内容，如果未能解决你的问题，请参考以下文章

情绪分析java库[关闭]

为不同的任务、情绪分析微调预训练的西班牙语 RoBERTa 模型

csharp 镜子亭中的面部情绪表达训练和分析

用于情绪分析的 Mahout

用于情绪分析的 BERT 微调模型高度过拟合

使用 Pyspark 训练随机森林回归模型