如何在 python 中对 URL 进行分层排序？

Posted 2023-02-22

技术标签:

【中文标题】如何在 python 中对 URL 进行分层排序？【英文标题】：How do I hierarchically sort URLs in python? 【发布时间】：2022-01-20 02:32:52 【问题描述】：

给定从网站抓取的初始 URL 列表：

https://somesite.com/
https://somesite.com/advertise
https://somesite.com/articles
https://somesite.com/articles/read
https://somesite.com/articles/read/1154
https://somesite.com/articles/read/1155
https://somesite.com/articles/read/1156
https://somesite.com/articles/read/1157
https://somesite.com/articles/read/1158
https://somesite.com/blogs

我正在尝试将列表转换为选项卡组织的树层次结构：

https://somesite.com
    /advertise
    /articles
        /read
            /1154
            /1155
            /1156
            /1157
            /1158
    /blogs

我尝试过使用列表、元组和字典。到目前为止，我已经找到了两种有缺陷的输出内容的方法。

如果元素在层次结构中具有相同的名称和位置，方法 1 将丢失它们：

Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0

Output:
https://somesite.com/
    /missions
        /playit
            /extbasic
                /0
            /stego

----------------^ Missing expected output "/0"

方法2不会遗漏任何元素，但会打印冗余内容：

Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0

Output:
https://somesite.com/
    /missions
        /playit
            /extbasic
                /0
    /missions       <- Redundant content
        /playit     <- Redundant content
            /stego      
                /0

我不确定如何正确执行此操作，而且我的谷歌搜索只发现了对 urllib 的引用，这些引用似乎不是我需要的。也许有更好的方法，但我一直找不到。

我将内容放入可用列表的代码：

#!/usr/bin/python3

import re

# Read the original list of URLs from file
with open("sitelist.raw", "r") as f:
    raw_site_list = f.readlines()

# Extract the prefix and domain from the first line
first_line = raw_site_list[0]
prefix, domain = re.match("(http[s]://)(.*)[/]" , first_line).group(1, 2)

# Remove instances of prefix and domain, and trailing newlines, drop any lines that are only a slash
clean_site_list = []
for line in raw_site_list:
    clean_line = line.strip(prefix).strip(domain).strip()
    if not clean_line == "/":
        if not clean_line[len(clean_line) - 1] == "/":
            clean_site_list += [clean_line]

# Split the resulting relative paths into their component parts and filter out empty strings
split_site_list = []
for site in clean_site_list:
    split_site_list += [list(filter(None, site.split("/")))]

这给出了一个要操作的列表，但我已经没有关于如何在不丢失元素或输出冗余元素的情况下输出它的想法。

谢谢

编辑：这是我根据下面选择的答案汇总的最终工作代码：

# Read list of URLs from file
with open("sitelist.raw", "r") as f:
    urls = f.readlines()

# Remove trailing newlines
for url in urls:
    urls[urls.index(url)] = url[:-1]

# Remove any trailing slashes
for url in urls:
    if url[-1:] == "/":
        urls[urls.index(url)] = url[:-1]

# Remove duplicate lines
unique_urls = []
for url in urls:
    if url not in unique_urls:
        unique_urls += [url]

# Do the actual work (modified to use unique_urls and use tabs instead of 4x spaces, and to write to file)
base = unique_urls[0]
tabdepth = 0
tlen = len(base.split('/'))

final_urls = []
for url in unique_urls[1:]:
    t = url.split('/')
    lt = len(t)
    if lt != tlen:
        tabdepth += 1 if lt > tlen else -1
        tlen = lt
    pad = ''.join(['\t' for _ in range(tabdepth)])
    final_urls += [f'pad/t[-1]']

with open("sitelist.new", "wt") as f:
    f.write(base + "\n")
    for url in final_urls:
        f.write(url + "\n")

【问题讨论】：

不是完全重复但很接近：***.com/questions/8484943 展示你是如何编写实际方法的... 【参考方案1】：

这适用于您的示例数据：

urls = ['https://somesite.com',
        'https://somesite.com/missions',
        'https://somesite.com/missions/playit',
        'https://somesite.com/missions/playit/extbasic',
        'https://somesite.com/missions/playit/extbasic/0',
        'https://somesite.com/missions/playit/stego',
        'https://somesite.com/missions/playit/stego/0']


base = urls[0]
print(base)
tabdepth = 0
tlen = len(base.split('/'))

for url in urls[1:]:
    t = url.split('/')
    lt = len(t)
    if lt != tlen:
        tabdepth += 1 if lt > tlen else -1
        tlen = lt
    pad = ''.join(['    ' for _ in range(tabdepth)])
    print(f'pad/t[-1]')

【讨论】：

感谢您提供这个简洁有效的解决方案。我已经使用基于此的最终代码编辑了我的原始帖子。非常感谢！【参考方案2】：

此代码将帮助您完成任务。我同意这段代码可能有点大，并且可能包含一些冗余代码和检查，但这将创建一个包含 url 层次结构的字典，您可以随意使用该字典，打印或存储它。

此代码的更多内容还将解析不同的 url 并创建它们的单独树（参见代码和输出）

编辑：这也将处理多余的网址

代码：

    from json import dumps


def process_urls(urls: list):
    tree = 

    for url in urls:
        url_components = url.split("/")
        # First three components will be the protocol
        # an empty entry
        # and the base domain 
        base_domain = url_components[:3]
        base_domain = base_domain[0] + "//" + "".join(base_domain[1:])
        # Add base domain to tree if not there.
        try:
            tree[base_domain]
        except:
            tree[base_domain] = 

        structure = url_components[3:]
        
        for i in range(len(structure)):
            # add the first element
            if i == 0 :
                try:
                    tree[base_domain]["/"+structure[i]]
                except:
                    tree[base_domain]["/"+structure[i]] = 
            else:
                base = tree[base_domain]["/"+structure[0]]
                for j in range(1, i):
                    base = base["/"+structure[j]]

                try:
                    base["/"+structure[i]]
                except:
                    base["/"+structure[i]] = 

    return tree


def print_tree(tree: dict, depth=0):
    for key in tree.keys():
        print("\t"*depth+key)

        # redundant checks
        if type(tree[key]) == dict:
            
            # if dictionary is empty then do nothing
            # else call this function recuressively
            # increase depth by 1
            if tree[key]:
                print_tree(tree[key], depth+1)


if __name__ == "__main__":
        urls = [
            'https://somesite.com',
            'https://somesite.com/missions',
            'https://somesite.com/missions/playit',
            'https://somesite.com/missions/playit/extbasic',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/stego',
            'https://somesite.com/missions/playit/stego/0',
            'https://somesite2.com/missions/playit',
            'https://somesite2.com/missions/playit/extbasic',
            'https://somesite2.com/missions/playit/extbasic/0',
            'https://somesite2.com/missions/playit/stego',
            'https://somesite2.com/missions/playit/stego/0'
        ]
    tree = process_urls(urls)
    print_tree(tree)

输出：

https://somesite.com
    /missions
            /playit
                    /extbasic
                            /0
                    /stego
                            /0
https://somesite2.com
    /missions
            /playit
                    /extbasic
                            /0
                    /stego
                            /0

【讨论】：

感谢您提供这个非常漂亮的解决方案。对于我当前的项目来说，这有点太复杂了，但我将把它保存为一个例子，以备将来需求是否增长，因为我同意如果我需要，dicts 将允许更多通用的功能。

以上是关于如何在 python 中对 URL 进行分层排序？的主要内容，如果未能解决你的问题，请参考以下文章

如何在Python中对版本标签列表进行排序[重复]

如何在python中对具有浮点值的列表进行排序？ [复制]

如何在 Python 中对数字的数字进行排序并获取另一个数字？

如何使用 Pandas 在 Python 中对字典中的数据进行排序

如何在 Python 中对存储在字典中的 IP 地址进行排序？

如何在python中对两列进行透视和排序？