从字符串中分析来源城市/目的地城市
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了从字符串中分析来源城市/目的地城市相关的知识,希望对你有一定的参考价值。
我有一个pandas数据框,其中一列是一串带有特定旅行细节的字符串。我的目标是解析每个字符串以提取始发城市和目的地城市(我希望最终有两个新列分别为“ origin”和“ destination”)。
数据:
df_col = [
'new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags'
]
这应该导致:
Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)
到目前为止,我已经尝试过:各种各样的NLTK方法,但是让我最接近的是使用nltk.pos_tag
方法标记字符串中的每个单词。结果是带有每个单词和相关标签的元组列表。这是一个例子...
[('Fly', 'NNP'), ('to', 'TO'), ('Australia', 'NNP'), ('&', 'CC'), ('New', 'NNP'), ('Zealand', 'NNP'), ('from', 'IN'), ('Paris', 'NNP'), ('from', 'IN'), ('€422', 'NNP'), ('return', 'NN'), ('including', 'VBG'), ('2', 'CD'), ('checked', 'VBD'), ('bags', 'NNS'), ('!', '.')]
我在这个阶段还处于困境,不确定如何最好地实现这一目标。有人能指出我正确的方向吗?谢谢。
TL; DR
乍一看几乎是不可能的,除非您可以访问某些包含相当复杂的组件的API。
详细
[乍看之下,您似乎要求魔术般解决自然语言问题。但是,让我们分解一下它的范围,将其范围扩展到可以构建某些东西的程度。
首先,要识别国家和城市,您需要枚举它们的数据,因此,请尝试:https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json
并且在搜索结果的顶部,我们找到https://datahub.io/core/world-cities,它指向world-cities.json文件。现在,我们将它们加载到多个国家和城市中。
import requests
import json
cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))
countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])
现在已提供数据,让我们尝试构建component ONE:
- [任务:检测文本中是否有子字符串与城市/国家匹配。
- 工具: https://github.com/vi3k6i5/flashtext(快速字符串搜索/匹配)
- 指标:正确识别的字符串中城市/国家的数量
让它们放在一起。
import requests
import json
from flashtext import KeywordProcessor
cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))
countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])
[out]:
['York', 'Venice', 'Italy']
嘿,出了什么问题?!
进行尽职调查,首先的预感是数据中没有“纽约”,
>>> "New York" in cities
False
什么?! #$%^&*为了理智起见,我们检查以下内容:
>>> len(countries)
244
>>> len(cities)
21940
是的,您不能仅信任单个数据源,所以让我们尝试获取所有数据源。
从https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json,您找到另一个链接https://github.com/dr5hn/countries-states-cities-database让我们说一下...
import requests
import json
cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities1_json = json.loads(requests.get(cities_url).content.decode('utf8'))
countries1 = set([city['country'] for city in cities1_json])
cities1 = set([city['name'] for city in cities1_json])
dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json"
dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json"
cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8'))
countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8'))
countries2 = set([c['name'] for c in countries2_json])
cities2 = set([c['name'] for c in cities2_json])
countries = countries2.union(countries1)
cities = cities2.union(cities1)
现在我们很神经质,我们进行了健全性检查。
>>> len(countries)
282
>>> len(cities)
127793
哇,那儿的城市比以前多了。
让我们再次尝试flashtext
代码。
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])
[out]:
['York', 'Venice', 'Italy']
认真?!没有纽约吗? $%^&*
好吧,要进行更多的健全性检查,只需在城市列表中查找“纽约”。
>>> [c for c in cities if 'york' in c.lower()]
['Yorklyn',
'West York',
'West New York',
'Yorktown Heights',
'East Riding of Yorkshire',
'Yorke Peninsula',
'Yorke Hill',
'Yorktown',
'Jefferson Valley-Yorktown',
'New York Mills',
'City of York',
'Yorkville',
'Yorkton',
'New York County',
'East York',
'East New York',
'York Castle',
'York County',
'Yorketown',
'New York City',
'York Beach',
'Yorkshire',
'North Yorkshire',
'Yorkeys Knob',
'York',
'York Town',
'York Harbor',
'North York']
尤里卡!这是因为它被称为“纽约市”而不是“纽约”!
您:这是什么恶作剧?!
语言学家:欢迎来到自然语言处理的世界,在这里,自然语言是受制于社区和教法的社会建构。
您:废话,告诉我如何解决。
NLP Practitioner(一种真正的处理嘈杂的用户生成文本的方法):您只需要添加到列表中即可。但在此之前,请根据已拥有的列表检查metric。
对于示例“测试集”中的每个文本,您都应提供一些真相标签,以确保您可以“度量指标”。
from itertools import zip_longest
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))]
# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0
for text, label in texts_labels:
extracted = keyword_processor.extract_keywords(text)
# We're making some assumptions here that the order of
# extracted and the truth must be the same.
true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
total_truth += len(label)
# Just visualization candies.
print(text)
print(extracted)
print(label)
print()
实际上,看起来还不错。我们的准确度是90%:
>>> true_positives / total_truth
0.9
但是我%^&*(-我想100%提取!
[好吧,好吧,请看上面方法造成的“唯一”错误,只是“纽约”不在城市列表中。
您:为什么我们不只是在城市列表中添加“纽约”,即
keyword_processor.add_keyword('New York')
print(texts[0])
print(keyword_processor.extract_keywords(texts[0]))
[out]:
['New York', 'Venice', 'Italy']
您:看,我做到了!!!现在我该喝啤酒了。语言学家:'I live in Marawi'
呢?
>>> keyword_processor.extract_keywords('I live in Marawi')
[]
NLP Practitioner(编入):'I live in Jeju'
怎么样?
>>> keyword_processor.extract_keywords('I live in Jeju')
[]
Raymond Hettinger的粉丝(从远处传来):“一定有更好的方法!”
是,如果我们只是尝试一些愚蠢的事情,例如在keyword_processor
中添加以“城市”结尾的城市关键字,该怎么办?
for c in cities:
if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
if c[:-5].strip():
keyword_processor.add_keyword(c[:-5])
p以上是关于从字符串中分析来源城市/目的地城市的主要内容,如果未能解决你的问题,请参考以下文章
从国家、州、城市和每个城市纬度/经度的数据库查询构建 JSON
Python爬取分析B站动漫《柯南》弹幕,从数据中分析接下来的剧情
在excel如何用函数把地址中省份、城市、区县单独提取出来?