从 JSON 文件中删除重复条目 - BeautifulSoup

Posted

技术标签:

【中文标题】从 JSON 文件中删除重复条目 - BeautifulSoup【英文标题】:Remove Duplicate Entries from JSON File - BeautifulSoup 【发布时间】:2018-10-14 02:24:10 【问题描述】:

我正在运行一个脚本来浏览网站以获取教科书信息,并且该脚本正在运行。但是,当它写入 JSON 文件时,它会给我重复的结果。我试图弄清楚如何从 JSON 文件中删除重复项。这是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = ['https://open.bccampus.ca/find-open-textbooks/', 
'https://open.bccampus.ca/find-open-textbooks/?start=10']

data = []
#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.findAll("h4")

    for container in containers:
       item = 
       item['type'] = "Textbook"
       item['title'] = container.parent.a.text
       item['author'] = container.nextSibling.findNextSibling(text=True)
       item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + container.parent.a["href"]
       item['source'] = "BC Campus"
       data.append(item) # add the item to the list

with open("./json/bc.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

这是 JSON 输出的示例


"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
, 
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
, 
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
, 
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"

【问题讨论】:

@brawlins4,您的问题非常有趣。请为您的代码添加 Python 版本(因为它是 Python3)和依赖项(它是 beautifulsoup4),以便回答者/读者在尝试之前熟悉所需的环境和依赖项代码。正如我第一次尝试使用 Python2.7 一样,它没有用。然后我搜索发现urllib模块的语法属于Python3所以我用Python3.6新建了一个conda环境b>,激活它并使用 pip 安装 beautifulsoup4。无论如何,这很有趣,因为我花时间解决了。 亲爱的 @brawlins4,还请指定一个创建 json 文件夹的要求,因为您的代码将字典列表保存为 JSON 在一个名为 bc.json 的文件中,该文件位于 ./json 目录中。如果有人(比如我)直接复制并运行(不看代码)代码,它就会失败。所以更好的是为 open() 指定类似 ./bc.json 的路径。足够的建议是不好的。你是大师,知道这一切。我只是建议提高你的问题的力量。谢谢。 【参考方案1】:

想通了。以下是其他人遇到此问题的解决方案:

textbook_list = []
for item in data:
    if item not in textbook_list:
        textbook_list.append(item)

with open("./json/bc.json", "w") as writeJSON:
    json.dump(textbook_list, writeJSON, ensure_ascii=False)

【讨论】:

【参考方案2】:

您不需要删除任何类型的重复项。

只需要更新代码。

请继续阅读。我已经提供了与此问题相关的详细描述。另外不要忘记检查我为调试您的代码而编写的这个要点https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c。

» 问题出在哪里?

我知道你想要这个,因为你得到了重复的字典。

这是因为您将容器选择为 h4 elements & f 或每本书详情,指定页面链接https://open.bccampus.ca/find-open-textbooks/ 和https://open.bccampus.ca/find-open-textbooks/?start=10 有 2 个h4 元素。

这就是为什么,而不是获取 20 个项目的列表(每页 10 个)作为容器列表,你 得到双倍,即 40 个项目的列表,其中每个项目都是 h4 元素。

对于这 40 项中的每一项,您可能会得到不同的值,但问题在于选择父项时。 因为它给出了相同的元素所以相同的文本。

让我们通过假设以下虚拟代码来澄清问题。

注意:您也可以访问并检查https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c,因为它包含我创建的用于调试和解决此问题的 Python 代码。你可能会得到一些想法。

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>
<li> <!-- 2nd book -->
    <h4>
        <a> Text 3 </a>
    </h4>
    <h4>
        <a> Text 4 </a>
    </h4>
</li>
...
...
<li> <!-- 20th book -->
    <h4>
        <a> Text 39 </a>
    </h4>
    <h4>
        <a> Text 40 </a>
    </h4>
</li>

»» containers = page_soup.find_all("h4"); 将给出以下h4 元素列表。

[
    <h4>
        <a> Text 1 </a>
    </h4>,
    <h4>
        <a> Text 2 </a>
    </h4>,
    <h4>
        <a> Text 3 </a>
    </h4>,
    <h4>
        <a> Text 4 </a>
    </h4>,
    ...
    ...
    ...
    <h4>
        <a> Text 39 </a>
    </h4>,
    <h4>
        <a> Text 40 </a>
    </h4>
]

»» 对于您的代码,内部 for 循环的第一次迭代将以下元素称为 container 变量。

<h4>
    <a> Text 1 </a>
</h4>

»» 第二次迭代将下面的元素称为 container 变量。

<h4>
    <a> Text 1 </a>
</h4>

»» 在上述(第 1 次,第 2 次)内部 for 循环迭代中,container.parent; 将给出以下元素。

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>

»» 和 container.parent.a 将给出以下元素。

<a> Text 1 </a>

»» 最后,container.parent.a.text 将以下文本作为我们前两本书的书名。

Text 1

这就是为什么我们会得到重复的字典,因为我们的动态 titleauthor 也是相同的。

让我们一一解决这个问题。

» 网页详情:

    我们有 2 个网页的链接。

    每个网页都有 10 本教科书的详细信息。

    每本书的详细信息都有 2 个h4 元素。

    总共有 2x10x2 = 40 个h4 元素。

» 我们的目标:

    我们的目标是只获得 20 个字典而不是 40 个的数组/列表。

    因此需要将容器列表迭代 2 项,即 通过在每次迭代中跳过 1 个项目。

» 修改后的工作代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
  'https://open.bccampus.ca/find-open-textbooks/', 
  'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = 
        item['type'] = "Textbook"
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['title'] = containers[index].parent.a.text
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True)

    data.append(item) # add the item to the list

with open("./json/bc-modified-final.json", "w") as writeJSON:
  json.dump(data, writeJSON, ensure_ascii=False)

» 输出:

[
    
        "type": "Textbook",
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "authors": " Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus"
    ,
    
        "type": "Textbook",
        "title": "Exploring Movie Construction and Production",
        "authors": " John Reich, SUNY Genesee Community College",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus"
    ,
    
        "type": "Textbook",
        "title": "Project Management",
        "authors": " Adrienne Watt",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
        "source": "BC Campus"
    ,
    ...
    ...
    ...
    
        "type": "Textbook",
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "authors": " Michelle Bonczek Evory. Western Michigan University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus"
    
]

最后,我尝试修改您的代码,并在字典对象中添加了更多详细信息 descriptiondatecategories

Python 版本:3.6

依赖:pip install beautifulsoup4

» 修改后的工作代码(增强版):

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
    'https://open.bccampus.ca/find-open-textbooks/', 
    'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = 

        # Store book's information as per given the web page (all 5 are dynamic)
        item['title'] = containers[index].parent.a.text
        item["catagories"] = [a_tag.text for a_tag in containers[index + 1].find_all('a')]
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True).strip()
        item['date'] = containers[index].parent.find_all("strong")[1].findNextSibling(text=True).strip()
        item["description"] = containers[index].parent.p.text.strip()

        # Store extra information (1st is dynamic, last 2 are static)
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['type'] = "Textbook"

        data.append(item) # add the item to the list

with open("./json/bc-modified-final-my-own-version.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

» 输出(增强版):

[
    
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "catagories": [
            "Ancillary Resources"
        ],
        "authors": "Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "date": "May 3, 2018",
        "description": "Description: The purpose of this textbook is to help learners develop best practices in vital sign measurement. Using a multi-media approach, it will provide opportunities to read about, observe, practice, and test vital sign measurement.",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    ,
    
        "title": "Exploring Movie Construction and Production",
        "catagories": [
            "Adopted"
        ],
        "authors": "John Reich, SUNY Genesee Community College",
        "date": "May 2, 2018",
        "description": "Description: Exploring Movie Construction and Production contains eight chapters of the major areas of film construction and production. The discussion covers theme, genre, narrative structure, character portrayal, story, plot, directing style, cinematography, and editing. Important terminology is defined and types of analysis are discussed and demonstrated. An extended example of how a movie description reflects the setting, narrative structure, or directing style is used throughout the book to illustrate ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    ,
    ...
    ...
    ...
    
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "catagories": [],
        "authors": "Michelle Bonczek Evory. Western Michigan University",
        "date": "Apr 27, 2018",
        "description": "Description: Informed by a writing philosophy that values both spontaneity and discipline, Michelle Bonczek Evory’s Naming the Unnameable: An Approach to Poetry for New Generations  offers practical advice and strategies for developing a writing process that is centered on play and supported by an understanding of America’s rich literary traditions. With consideration to the psychology of invention, Bonczek Evory provides students with exercises aimed to make writing in its early stages a form of play that ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    
]

就是这样。谢谢。

【讨论】:

【参考方案3】:

我们最好使用集合数据结构而不是列表。它不保留顺序,但不存储列表等重复项。

更改您的代码

 data = []

data = set()

data.append(item)

data.add(item)

【讨论】:

以上是关于从 JSON 文件中删除重复条目 - BeautifulSoup的主要内容,如果未能解决你的问题,请参考以下文章

从 postgres 数据库中删除重复条目

从包含 R 中特定字符的字符串向量中删除条目 [重复]

从 JSON 文件中添加和删除数据 [重复]

从mySQL中删除重复的条目/行

从索引中删除元素时,键“INDEX”的重复条目“1-1110”

如何从多个重复的 json 文件中删除一个文本块,其中文件之间有微小的变化?