文本自然语言处理简介
Posted 数据算法之心
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了文本自然语言处理简介相关的知识,希望对你有一定的参考价值。
文本自然语言处理简介
一些例子
Cortana
Microsoft OS有一个名为Cortana的虚拟助手,可以识别自然语音。可以使用它来设置提醒,打开应用程序,发送电子邮件,玩游戏,跟踪航班和包裹,查看天气等。
Siri
Siri是Apple Inc.的ios,watchOS,macOS,HomePod和tvOS操作系统的虚拟助手。同样,你可以用语音命令做很多事情:开始通话,发短信给别人,发送电子邮件,设置计时器,拍照,打开应用程序,设置闹钟,使用导航等等。
Gmail
Google开发的著名电子邮件服务,Gmail使用垃圾邮件检测来过滤掉一些垃圾邮件。
NLP很难的真正原因
阅读和理解语言的过程比乍一看要复杂得多。要真正理解一篇文章在现实世界中的意义,有很多事情要做。比如:
“Steph Curry was on fire last nice. He totally destroyed the other team”
对一个人来说,这句话的意思可能很明显。我们知道库里是个篮球运动员;即使你不知道,我们也知道他在某个球队打球,也许是运动队。当我们看到“fire”和“destroyed”的时候,我们知道这意味着库里昨晚打得很好,并且击败了对方。
计算机往往把事情看得太过字面。从字面上看,就像一台电脑,我们会看到“史蒂芬·库里”(Steph Curry),根据大写字母,我们会认为这是一个人、一个地方或其他重要的东西,这很棒!但是我们看到,史蒂芬·库里“着火了”….电脑可能会告诉你昨天有人真的把库里点着了!……呀!。之后,电脑可能会说库里先生身体摧毁对方….根据计算机显示,他们已经不存在了……
但并非一切都是残酷的!多亏了机器学习,我们实际上可以做一些非常聪明的事情来快速地从自然语言中提取和理解信息!让我们看看如何使用几个简单的Python库在几行代码中实现这一点。
一些依赖包
首先,我们将安装一些有用的Python NLP库来帮助我们分析本文。
### 安装 spaCy, general Python NLP lib
pip install spacy
### 下载 the English dictionary model for spaCy
python -m spacy download en_core_web_lg
### 安装 textacy, basically a useful add-on to spaCy
pip3 install textacy
实体分析
现在一切都安装好了,我们可以对文本进行快速的实体分析。实体分析将检查您的文本,并确定文本中的所有重要单词或“实体”。当我们说“重要”的时候,我们真正的意思是那些具有某种现实世界语义或意义的词。
# coding: utf-8
import spacy
### Load spaCy's English NLP model
nlp = spacy.load('en_core_web_lg')
### The text we want to examine
text = "Amazon.com, Inc., doing business as Amazon, is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after Alibaba Group in terms of total sales. The amazon.com website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics - Kindle e-readers, Fire tablets, Fire TV, and Echo - and is the world's largest provider of cloud infrastructure services (IaaS and PaaS). Amazon also sells certain low-end products under its in-house brand AmazonBasics."
### Parse the text with spaCy
### Our 'document' variable now contains a parsed version of text.
document = nlp(text)
### print out all the named entities that were detected
for entity in document.ents:
print(entity.text, entity.label_)
我们首先加载spaCy的学习过的ML模型并初始化要处理的文本。我们在文本上运行ML模型来提取实体。当您运行上述代码时,您将得到以下输出:
Amazon.com, Inc. ORG
Amazon ORG
American NORP
Seattle GPE
Washington GPE
Jeff Bezos PERSON
July 5, 1994 DATE
second ORDINAL
Alibaba Group ORG
amazon.com ORG
Fire TV ORG
Echo - LOC
PaaS ORG
Amazon ORG
AmazonBasics ORG
文本旁边的3个字母代码是表示我们正在查看的实体类型的标签。看来我们的模型做得很好!杰夫·贝索斯(Jeff Bezos)的确是一个人,日期确定得没错,亚马逊(Amazon)是一个组织,西雅图和华盛顿都是地缘政治实体(i.e国家、城市、州等)。唯一棘手的问题是,像 Fire TV 和 Echo 这样的产品实际上是产品,而不是组织。它还错过了亚马逊售卖的“video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry,”,可能是因为它们在一个庞大的、没有大写字母的列表中,因此看起来不太重要。
总的来说,我们的模型已经完成了我们想要的。想象一下,我们有一个巨大的文档,里面有几百页的文字。这个NLP模型可以让您快速了解文档的内容以及其中的关键实体。
对实体操作
让我们试着做一些更适用的事情。假设您拥有与上面相同的文本块,但出于隐私方面的考虑,您希望自动删除所有人员和组织的名称。spaCy库有一个非常有用的scrub 功能,我们可以用它来擦除我们不想看到的任何实体类别。代码如下:
# coding: utf-8
import spacy
### Load spaCy's English NLP model
nlp = spacy.load('en_core_web_lg')
### The text we want to examine
text = "Amazon.com, Inc., doing business as Amazon, is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after Alibaba Group in terms of total sales. The amazon.com website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics - Kindle e-readers, Fire tablets, Fire TV, and Echo - and is the world's largest provider of cloud infrastructure services (IaaS and PaaS). Amazon also sells certain low-end products under its in-house brand AmazonBasics."
### Replace a specific entity with the word "PRIVATE"
def replace_entity_with_placeholder(token):
if token.ent_iob != 0 and (token.ent_type_ == "PERSON" or token.ent_type_ == "ORG"):
return "[PRIVATE] "
else:
return token.string
### Loop through all the entities in a piece of text and apply entity replacement
def scrub(text):
doc = nlp(text)
for ent in doc.ents:
ent.merge()
tokens = map(replace_entity_with_placeholder, doc)
return "".join(tokens)
print(scrub(text))
[PRIVATE] , doing business as [PRIVATE] , is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by [PRIVATE] on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after [PRIVATE] in terms of total sales. The [PRIVATE] website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics - Kindle e-readers, Fire tablets, [PRIVATE] , and Echo - and is the world's largest provider of cloud infrastructure services (IaaS and [PRIVATE] ). [PRIVATE] also sells certain low-end products under its in-house brand [PRIVATE] .
看起来很棒!这实际上是一个非常强大的技术。人们总是在电脑上使用ctrl + f函数来查找和替换文档中的单词。但是使用NLP,我们可以找到并替换特定的实体,考虑到它们的语义,而不仅仅是它们的原始文本。
从文本中提取信息
我们之前安装的textacy库在spaCy的基础上实现了几种常见的NLP信息提取算法。它会让我们做一些更高级的事情,而不是简单的开箱即用的东西。
它实现的一种算法称为半结构化语句提取(Semi-structured Statement Extraction)。这个算法本质上解析了spaCy的NLP模型能够提取的一些信息,在此基础上,我们可以获取关于某些实体的一些更具体的信息!简而言之,我们可以提取关于我们所选择的实体的某些“facts”。
代码如下,我们将对Washington D.C’s的维基百科页面进行实体摘要。
# coding: utf-8
import spacy
import textacy.extract
### Load spaCy's English NLP model
nlp = spacy.load('en_core_web_lg')
### The text we want to examine
text = """Washington, D.C., formally the District of Columbia and commonly referred to as Washington or D.C., is the capital of the United States of America.[4] Founded after the American Revolution as the seat of government of the newly independent country, Washington was named after George Washington, first President of the United States and Founding Father.[5] Washington is the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6] As the seat of the United States federal government and several international organizations, the city is an important world political capital.[7] Washington is one of the most visited cities in the world, with more than 20 million annual tourists.[8][9]
The signing of the Residence Act on July 16, 1790, approved the creation of a capital district located along the Potomac River on the country's East Coast. The U.S. Constitution provided for a federal district under the exclusive jurisdiction of the Congress and the District is therefore not a part of any state. The states of Maryland and Virginia each donated land to form the federal district, which included the pre-existing settlements of Georgetown and Alexandria. Named in honor of President George Washington, the City of Washington was founded in 1791 to serve as the new national capital. In 1846, Congress returned the land originally ceded by Virginia; in 1871, it created a single municipal government for the remaining portion of the District.
Washington had an estimated population of 693,972 as of July 2017, making it the 20th largest American city by population. Commuters from the surrounding Maryland and Virginia suburbs raise the city's daytime population to more than one million during the workweek. The Washington metropolitan area, of which the District is the principal city, has a population of over 6 million, the sixth-largest metropolitan statistical area in the country.
All three branches of the U.S. federal government are centered in the District: U.S. Congress (legislative), President (executive), and the U.S. Supreme Court (judicial). Washington is home to many national monuments and museums, which are primarily situated on or around the National Mall. The city hosts 177 foreign embassies as well as the headquarters of many international organizations, trade unions, non-profit, lobbying groups, and professional associations, including the Organization of American States, AARP, the National Geographic Society, the Human Rights Campaign, the International Finance Corporation, and the American Red Cross.
A locally elected mayor and a 13‑member council have governed the District since 1973. However, Congress maintains supreme authority over the city and may overturn local laws. D.C. residents elect a non-voting, at-large congressional delegate to the House of Representatives, but the District has no representation in the Senate. The District receives three electoral votes in presidential elections as permitted by the Twenty-third Amendment to the United States Constitution, ratified in 1961."""
### Parse the text with spaCy
### Our 'document' variable now contains a parsed version of text.
document = nlp(text)
### Extracting semi-structured statements
statements = textacy.extract.semistructured_statements(document, "Washington")
print("**** Information from Washington's Wikipedia page ****")
count = 1
for statement in statements:
subject, verb, fact = statement
print(str(count) + " - Statement: ", statement)
print(str(count) + " - Fact: ", fact)
count += 1
**** Information from Washington's Wikipedia page ****
1 - Statement: (Washington, is, the capital of the United States of America.[4)
1 - Fact: the capital of the United States of America.[4
2 - Statement: (Washington, is, the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6)
2 - Fact: the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6
3 - Statement: (Washington, is, home to many national monuments and museums, which are primarily situated on or around the National Mall)
3 - Fact: home to many national monuments and museums, which are primarily situated on or around the National Mall
我们的NLP模型发现了3个关于Washington D.C’s的有用事实:
(1) Washington is the capital of the USA
(2) Washington’s population and the fact that it is metropolitan
(3) Many national monuments and museums
最棒的是,这些都是文本块中最重要的信息!
使用NLP进行更深入的研究
这就是我们对NLP的简单介绍!我们学到了很多,但这只是一小部分……
NLP还有很多伟大的应用,比如语言翻译、聊天机器人,以及对文本文档更具体、更复杂的分析。今天的大部分工作都是通过深度学习来完成的,特别是循环神经网络(RNNs)和长-短期记忆(LSTMs)网络。
如果你想自己玩更多的NLP,看看spaCy文档和textacy文档,这是一个很好的开始!您将看到许多使用解析文本和从中提取非常有用信息的方法示例。任何带有spaCy的东西都是快速简单的,你可以从中得到一些真正有价值的东西。一旦你明白了这一点,是时候通过深入学习做更大更好的事情了!
更多个人笔记欢迎关注
知乎专栏
参考
https://towardsdatascience.com/an-easy-introduction-to-natural-language-processing-b1e2801291c1
https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63
以上是关于文本自然语言处理简介的主要内容,如果未能解决你的问题,请参考以下文章
NLP:自然语言处理领域常见的文本特征表示/文本特征抽取(本质都是“数字化”)的简介四大类方法(基于规则/基于统计,离散式one-hotBOWTF-IDF/分布式)之详细攻略