网页抓取 - Python；写入 CSV

Posted 2023-02-23

技术标签:

【中文标题】网页抓取 - Python；写入 CSV【英文标题】：Web Scraping - Python; Writing to a CSV 【发布时间】：2019-03-12 11:01:17 【问题描述】：

我正在尝试从网站写入数据。当新的数据块列在排名中时，数据以 html 表格的形式列出，标签为 '' 列表，关于排名中元素的每个描述性项目为 ''。该列表是前 500 台计算机的排名，列出 1-100，每个 1、2、3、4 等项目用“”列出，计算机的每个特性都用“”列出（它的存储、最大功率等） )。

这是我的代码：

# read the data from a URL
url = requests.get("https://www.top500.org/list/2018/06/")
url.status_code
url.content


# parse the URL using Beauriful Soup
soup = BeautifulSoup(url.content, 'html.parser')

filename = "computerRank10.csv"
f = open(filename,"w")

headers = "Rank, Site, System, Cores, RMax, RPeak, Power\n"
f.write(headers)

for record in soup.findAll('tr'):
# start building the record with an empty string
  tbltxt = ""
  tbltxt = tbltxt + data.text + ";"
  tbltxt = tbltxt.replace('\n', ' ')
  tbltxt = tbltxt.replace(',', '')
#    f.write(tbltxt[0:-1] + '\n')
f.write(tbltxt + '\n')

f.close()

我什么也没得到，我的 CSV 文件总是空白

【问题讨论】：

我认为您应该与我们分享您的完整代码，例如我们无法猜出record 变量是什么。我编辑了代码以删除未定义的记录变量。当我现在运行该程序时，我在 CSV 文件中正确写入了标题，但现在每一行都显示为“3615”，并且它只是在第一列中 100 行。 【参考方案1】：

您应该在 Python 标准库中使用 csv 模块。

这是一个更简单的解决方案：

import requests
import csv
from bs4 import BeautifulSoup as bs

url = requests.get("https://www.top500.org/list/2018/06")
soup = bs(url.content, 'html.parser')

filename = "computerRank10.csv"
csv_writer = csv.writer(open(filename, 'w'))


for tr in soup.find_all("tr"):
    data = []
    # for headers ( entered only once - the first time - )
    for th in tr.find_all("th"):
        data.append(th.text)
    if data:
        print("Inserting headers : ".format(','.join(data)))
        csv_writer.writerow(data)
        continue

    for td in tr.find_all("td"):
        if td.a:
            data.append(td.a.text.strip())
        else:
            data.append(td.text.strip())
    if data:
        print("Inserting data: ".format(','.join(data)))
        csv_writer.writerow(data)

【讨论】：

我在尝试运行此代码时遇到 Ascii 错误。我不知道如何编码为字符串。错误信息如下： print("Inserting data: ".format(','.join(data))) UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 11: ordinal not在范围内（128）你可以去掉那个打印声明，它只是为了通知，反正不是这个话题。或者您可以像这样将字符串格式编码为utf-8：print("Inserting data: ".format(','.join(data).encode('utf-8')))【参考方案2】：

试试下面的脚本。它应该为您获取所有数据并将其写入 csv 文件：

import csv
import requests
from bs4 import BeautifulSoup

link = "https://www.top500.org/list/2018/06/?page="

def get_data(link):
    for url in [link.format(page) for page in range(1,6)]:
        res = requests.get(url)
        soup = BeautifulSoup(res.text,"lxml")

        for items in soup.select("table.table tr"):
            td = [item.get_text(strip=True) for item in items.select("th,td")]
            writer.writerow(td)

if __name__ == '__main__':
    with open("tabularitem.csv","w",newline="") as infile: #if encoding issue comes up then replace with ('tabularitem.csv', 'w', newline="", encoding="utf-8")
        writer = csv.writer(infile)
        get_data(link)

【讨论】：

以上是关于网页抓取 - Python；写入 CSV的主要内容，如果未能解决你的问题，请参考以下文章