如何在通过网络抓取创建的 json 文件中组织数据

Posted 2023-02-23

技术标签:

【中文标题】如何在通过网络抓取创建的 json 文件中组织数据【英文标题】：How to organize data in a json file created through webscraping 【发布时间】：2019-05-28 14:11:43 【问题描述】：

我正在尝试从雅虎新闻获取文章标题并将其组织在一个 json 文件中。当我将数据转储到 json 文件时，它看起来令人困惑。我将如何组织数据，无论是在转储之后还是从头开始？

这是一个网络抓取项目，我必须获取热门新闻文章及其正文并将它们导出到一个 json 文件，然后可以将其发送到其他人的程序。目前，我正在努力从 yahoo Finance 主页获取标题。

import requests
import json
from bs4 import BeautifulSoup

#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)--    sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm 
Mt(0.8em)--sm", id="15")


#Organizing data for export
data = 'title1': title[0].get_text(),
    'title2': title[1].get_text(),
    'title3': title[2].get_text(),
    'title4': title[3].get_text(),
    'title5': title[4].get_text()  

#Exporting the data to results.json
with open("results.json", "w") as write_file:
  json.dump(str(data), write_file)

这是最终写入 json 文件的内容（在撰写本文时）：

"'title1': 'These US taxpayers face higher payments thanks to new law', 
'title2': 'These 12 Stocks Are the Best Values in 2019, According to Pros 
Who\u2019ve Outsmarted the Market', '\\ntitle3': 'The Best Move You Can     
Make With Your Investments in 2019, According to 5 Market Professionals', 
'title4': 'The auto industry said goodbye to a lot of cars in 2018', 
'title5': '7 Stock Picks From Top-Rated Wall Street Analysts'"

我想编写代码以在单独的行中显示每篇文章的标题，并删除出现在中间的随机“\”。

【问题讨论】：

JSON 相对易于阅读，但不是“漂亮”的输出格式。如果你想要漂亮的输出，那么你需要读入文件并解析它以输出，虽然你说这是为了导入另一个程序，我不确定你为什么担心这个？试试json.dump(data, write_file, indent=4) @match 我主要是想删除不必要的 '\' 以便下一组更容易分析 【参考方案1】：

我已经运行了你的代码，但没有得到你得到的任何结果。你已经定义了'title3'，它是一个常量，但是你得到了'\n'，在我的情况下我实际上并没有得到它。顺便说一句，你得到/是因为你没有像'utf8'那样正确编码它，并且ascii确保设置为false。我建议进行两个更改，例如-'lxml'解析器而不是'html.parser'和此代码sn-p：

with open("results.json", "w",encoding='utf8') as write_file:
    json.dump(str(data), write_file ,ensure_ascii=False)

这完全对我有用/排除和 ascii 问题也解决了。

【讨论】：

【参考方案2】：

import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)--    sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm", id="15")
title=[x.get_text().strip() for x in title]
limit=len(title) #change this to 5 if you need only the first 5
data="title"+str(i+1):title[i] for i in range(0,limit)
with open("results.json", "w",encoding='utf-8') as write_file:
        write_file.write(json.dumps(data, ensure_ascii=False,indent=4))

results.json：


    "title1": "These 12 Stocks Are the Best Values in 2019, According to Pros Who’ve Outsmarted the Market",
    "title2": "These US taxpayers face higher payments thanks to new law",
    "title3": "The Best Move You Can Make With Your Investments in 2019, According to 5 Market Professionals",
    "title4": "Cramer Remix: Here's where your first $10,000 should be i...",
    "title5": "The auto industry said goodbye to a lot of cars in 2018",
    "title6": "Ocado Pips Adyen to Take Crown of 2018's Best European Stock",
    "title7": "7 Stock Picks From Top-Rated Wall Street Analysts",
    "title8": "Buy IBM Stock as It Begins 2019 as the Cheapest Dow Component",
    "title9": "$70 Oil Could Be Right Around The Corner",
    "title10": "What Is the Highest Credit Score and How Do You Get It?",
    "title11": "Silver Price Forecast – Silver markets stall on New Year’s Eve",
    "title12": "This Chart Says the S&P 500 Could Rebound in 2019",
    "title13": "Should You Buy Some Berkshire Hathaway Stock?",
    "title14": "How Much Does a Financial Advisor Cost?",
    "title15": "Here Are the World's Biggest Billionaire Winners and Losers of 2018",
    "title16": "Tax tips: What you need to know before you file your taxes in 2019",
    "title17": "Kevin O’Leary: Make This Your Top New Year’s Resolution",
    "title18": "Dakota Access pipeline developer slow to replace some trees",
    "title19": "Einhorn's Greenlight Extends Decline to 34% in Worst Year",
    "title20": "4 companies to watch in 2019",
    "title21": "What Is My Debt-to-Income Ratio?",
    "title22": "US recession unlikely, market volatility to continue in 2019, El-Erian says",
    "title23": "Fidelity: Ignore stock market turbulence and stick to long-term goals",
    "title24": "Tax season: How you can come out a winner",
    "title25": "IBD 50 Growth Stocks To Watch"

【讨论】：

以上是关于如何在通过网络抓取创建的 json 文件中组织数据的主要内容，如果未能解决你的问题，请参考以下文章

在没有抓取HTML的情况下从whattomine中的链接接收JSON数据

Rails 6：通过网络抓取控制器操作创建帖子时如何将用户与创建的帖子相关联

如何从网络抓取创建熊猫数据框？

在linux命令行环境下如何抓取网络数据包？

如何将抓取的数据从 Scrapy 以 csv 或 json 格式上传到 Amazon S3？

使用 Python 从电子商务 Ajax 站点抓取 JSON 数据