从 url 'path' 中提取的 input_url 解析字符串 'name'
Posted
技术标签:
【中文标题】从 url \'path\' 中提取的 input_url 解析字符串 \'name\'【英文标题】:Parse the string 'name' from the input_url extracted from the url 'path'从 url 'path' 中提取的 input_url 解析字符串 'name' 【发布时间】:2022-01-03 20:11:27 【问题描述】:您好,我正在尝试使用正则表达式从 input_url 解析名称“beer.master.121”,并且我正在寻找比我现有的更好的正则表达式
其实我的功能和结果如下:
import urllib
from urllib.parse import urlparse, urlsplit
input_url = 'https://www.pizza.com/beer.master.121/margaretha/98799csduu99003/'
def get_url_data(input_url):
url_parts = urlsplit(input_url)
query = dict(urllib.parse.parse_qsl(url_parts.query))
path_ = url_parts.path
if 'margaretha/' in input_url:
publisher = re.search('\w+(?=\s*/[^/])', path_).group(0)
print(publisher)
return publisher
当我运行代码时,我只得到最后一个字:
get_url_data(input_url)
'121'
期望的输出:
input_url = 'https://www.pizza.com/beer.master.121/margaretha/98799csduu99003/'
get_url_data(input_url)
'beer.master.121'
input_url = 'https://www.pizza.com/beer.master/margaretha/98799csduuppP000/'
get_url_data(input_url)
'beer.master'
input_url = 'https://www.pizza.com/beer/margaretha/98799csduuppP000/'
get_url_data(input_url)
'beer'
input_url = 'https://www.pizza.com/lovely/10022648/margaretha/939520'
get_url_data(input_url)
'10022648'
input_url = 'https://www.pizza.com/lovely/jhonson.1002278/margaretha/939520'
get_url_data(input_url)
'jhonson.1002278'
【问题讨论】:
基本网址是否始终相同?也许你甚至不需要正则表达式 没有base url趋于变化,但结构类似,总是会包含'margaretha'这样的单词,但单词很少 您只是在寻找路径的第一部分吗?喜欢 website/XXX/margaretha/.... 你想要 XXX? 【参考方案1】:其他信息的另一种方法。
输出
url: https://www.pizza.com/beer.master.121/margaretha/98799csduu99003/
network location: www.pizza.com
directories: ['beer.master.121', 'margaretha', '98799csduu99003']
target: beer.master.121
url: https://www.pizza.com/beer.master/margaretha/98799csduuppP000/
network location: www.pizza.com
directories: ['beer.master', 'margaretha', '98799csduuppP000']
target: beer.master
url: https://www.pizza.com/beer/margaretha/98799csduuppP000/
network location: www.pizza.com
directories: ['beer', 'margaretha', '98799csduuppP000']
target: beer
url: https://www.pizza.com/lovely/10022648/margaretha/939520
network location: www.pizza.com
directories: ['lovely', '10022648', 'margaretha', '939520']
target: 10022648
url: https://www.pizza.com/lovely/jhonson.1002278/margaretha/939520
network location: www.pizza.com
directories: ['lovely', 'jhonson.1002278', 'margaretha', '939520']
target: jhonson.1002278
代码
from urllib.parse import urlparse
urls = [
'https://www.pizza.com/beer.master.121/margaretha/98799csduu99003/',
'https://www.pizza.com/beer.master/margaretha/98799csduuppP000/',
'https://www.pizza.com/beer/margaretha/98799csduuppP000/',
'https://www.pizza.com/lovely/10022648/margaretha/939520',
'https://www.pizza.com/lovely/jhonson.1002278/margaretha/939520'
]
for url in urls:
print()
print(f'url: url')
parts = urlparse(url)
print(f'network location: parts.netloc')
directories = parts.path.strip('/').split('/')
print(f'directories: directories')
margaretha_index = directories.index('margaretha')
ret = directories[margaretha_index-1]
print(f'target: ret')
def get_url_data(url):
parts = urlparse(url)
directories = parts.path.strip('/').split('/')
margaretha_index = directories.index('margaretha')
return directories[margaretha_index-1]
参考
https://practicaldatascience.co.uk/data-science/how-to-parse-url-structures-using-python
【讨论】:
【参考方案2】:试试这个:
from urllib.parse import urlsplit
def get_url_data(input_url):
path = urlsplit(input_url).path
try:
idx = path.index('margaretha')
except:
return None
return path[:idx - 1].rsplit('/', 1)[-1]
【讨论】:
感谢 Ricardo,但这是对旧代码的重构,我在其中使用 split 处理此 url,但在许多情况下都失败了。我想要做的是通过正则表达式得到这个以更确定结果 @TheDan 那么请在您的问题中添加更多用例,如果您希望人们能够帮助您,您需要更具体 我刚刚添加了 2 个示例 @TheDan 更新了我的答案。现在可以用了吗?以上是关于从 url 'path' 中提取的 input_url 解析字符串 'name'的主要内容,如果未能解决你的问题,请参考以下文章