自然语言处理任务数据集

Posted 2020-10-31 冯煜博

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了自然语言处理任务数据集相关的知识，希望对你有一定的参考价值。

自然语言处理任务数据集

keywords: NLP, DataSet

AI Challenger - 英中翻译评测

适用领域：机器翻译

规模最大的口语领域英中双语对照数据集。提供了超过1000万的英中对照的句子对作为数据集合。所有双语句对经过人工检查，数据集从规模、相关度、质量上都有保障。

训练集：10,000,000 句
验证集（同声传译）：934 句
验证集（文本翻译）：8000 句

https://challenger.ai/datasets/translation

UN Parallel Corpus - 联合国平行语料

适用领域：机器翻译

联合国平行语料库由已进入公有领域的联合国正式记录和其他会议文件组成。语料库包含1990至2014年编写并经人工翻译的文字内容，包括以语句为单位对齐的文本。

语料库旨在提供多语种的语言资源，帮助在机器翻译等各种自然语言处理方面开展研究和取得进展。为了方便使用，本语料库还提供现成的特定语种双语文本和六语种平行语料子库。

介绍：https://conferences.unite.un.org/UNCorpus/zh#introduction

下载：https://conferences.unite.un.org/UNCorpus/zh/DownloadOverview

（目前一直下载不下来）

2nd International Chinese Word Segmentation Bakeoff

适用领域：中文分词

This directory contains the training, test, and gold-standard data
used in the 2nd International Chinese Word Segmentation Bakeoff.

http://sighan.cs.uchicago.edu/bakeoff2005/

20 Newsgroups

适用领域：文本分类

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

http://qwone.com/~jason/20Newsgroups/

NLPCC 2017 新闻标题分类

适用领域：文本分类

http://tcci.ccf.org.cn/conference/2017/taskdata.php

Reuters-21578 Text Categorization Collection

适用领域：文本分类

This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

全网新闻数据(SogouCA)

适用领域：文本分类、事件检测跟踪、新词发现、命名实体识别自动摘要

来自若干新闻站点2012年6月—7月期间国内，国际，体育，社会，娱乐等18个频道的新闻数据，提供URL和正文信息

http://www.sogou.com/labs/resource/ca.php

CMU World Wide Knowledge Base (Web->KB) project

适用领域：知识抽取

To develop a probabilistic, symbolic knowledge base that mirrors the content of the world wide web. If successful, this will make text information on the web available in computer-understandable form, enabling much more sophisticated information retrieval and problem solving.

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/

以上是关于自然语言处理任务数据集的主要内容，如果未能解决你的问题，请参考以下文章