从 url 'path' 中提取的 input_url 解析字符串 'name'

Posted 2023-02-15

技术标签:

【中文标题】从 url \'path\' 中提取的 input_url 解析字符串 \'name\'【英文标题】：Parse the string 'name' from the input_url extracted from the url 'path'从 url 'path' 中提取的 input_url 解析字符串 'name' 【发布时间】：2022-01-03 20:11:27 【问题描述】：

您好，我正在尝试使用正则表达式从 input_url 解析名称“beer.master.121”，并且我正在寻找比我现有的更好的正则表达式

其实我的功能和结果如下：

import urllib
from urllib.parse import urlparse, urlsplit


input_url = 'https://www.pizza.com/beer.master.121/margaretha/98799csduu99003/'

def get_url_data(input_url):
    
    url_parts = urlsplit(input_url)
    query = dict(urllib.parse.parse_qsl(url_parts.query))
    path_ = url_parts.path
    
    if 'margaretha/' in input_url:
        publisher = re.search('\w+(?=\s*/[^/])', path_).group(0)
        print(publisher)
        return publisher

当我运行代码时，我只得到最后一个字：

get_url_data(input_url)
'121'

期望的输出：

input_url = 'https://www.pizza.com/beer.master.121/margaretha/98799csduu99003/'
get_url_data(input_url)
'beer.master.121'

input_url = 'https://www.pizza.com/beer.master/margaretha/98799csduuppP000/'
get_url_data(input_url)
'beer.master'

input_url = 'https://www.pizza.com/beer/margaretha/98799csduuppP000/'
get_url_data(input_url)
'beer'

input_url = 'https://www.pizza.com/lovely/10022648/margaretha/939520'
get_url_data(input_url)
'10022648'

input_url = 'https://www.pizza.com/lovely/jhonson.1002278/margaretha/939520'
get_url_data(input_url)
'jhonson.1002278'

【问题讨论】：

基本网址是否始终相同？也许你甚至不需要正则表达式没有base url趋于变化，但结构类似，总是会包含'margaretha'这样的单词，但单词很少您只是在寻找路径的第一部分吗？喜欢 website/XXX/margaretha/.... 你想要 XXX？ 【参考方案1】：

其他信息的另一种方法。

输出

url: https://www.pizza.com/beer.master.121/margaretha/98799csduu99003/
network location: www.pizza.com
directories: ['beer.master.121', 'margaretha', '98799csduu99003']
target: beer.master.121

url: https://www.pizza.com/beer.master/margaretha/98799csduuppP000/
network location: www.pizza.com
directories: ['beer.master', 'margaretha', '98799csduuppP000']
target: beer.master

url: https://www.pizza.com/beer/margaretha/98799csduuppP000/
network location: www.pizza.com
directories: ['beer', 'margaretha', '98799csduuppP000']
target: beer

url: https://www.pizza.com/lovely/10022648/margaretha/939520
network location: www.pizza.com
directories: ['lovely', '10022648', 'margaretha', '939520']
target: 10022648

url: https://www.pizza.com/lovely/jhonson.1002278/margaretha/939520
network location: www.pizza.com
directories: ['lovely', 'jhonson.1002278', 'margaretha', '939520']
target: jhonson.1002278

代码

from urllib.parse import urlparse


urls = [
    'https://www.pizza.com/beer.master.121/margaretha/98799csduu99003/',
    'https://www.pizza.com/beer.master/margaretha/98799csduuppP000/',
    'https://www.pizza.com/beer/margaretha/98799csduuppP000/',
    'https://www.pizza.com/lovely/10022648/margaretha/939520',
    'https://www.pizza.com/lovely/jhonson.1002278/margaretha/939520'
]

for url in urls:
    print()
    print(f'url: url')

    parts = urlparse(url)
    print(f'network location: parts.netloc')

    directories = parts.path.strip('/').split('/')
    print(f'directories: directories')

    margaretha_index = directories.index('margaretha')
    ret = directories[margaretha_index-1]
    print(f'target: ret')

def get_url_data(url):
    parts = urlparse(url)
    directories = parts.path.strip('/').split('/')
    margaretha_index = directories.index('margaretha')
    return directories[margaretha_index-1]

参考

https://practicaldatascience.co.uk/data-science/how-to-parse-url-structures-using-python

【讨论】：

【参考方案2】：

试试这个：

from urllib.parse import urlsplit

def get_url_data(input_url):
    path = urlsplit(input_url).path
    try:
        idx = path.index('margaretha')
    except:
        return None
    return path[:idx - 1].rsplit('/', 1)[-1]

【讨论】：

感谢 Ricardo，但这是对旧代码的重构，我在其中使用 split 处理此 url，但在许多情况下都失败了。我想要做的是通过正则表达式得到这个以更确定结果 @TheDan 那么请在您的问题中添加更多用例，如果您希望人们能够帮助您，您需要更具体我刚刚添加了 2 个示例 @TheDan 更新了我的答案。现在可以用了吗？

以上是关于从 url 'path' 中提取的 input_url 解析字符串 'name'的主要内容，如果未能解决你的问题，请参考以下文章