Python - Web 抓取 HTML 表格并打印到 CSV

Posted 2023-02-23

技术标签:

【中文标题】Python - Web 抓取 HTML 表格并打印到 CSV【英文标题】：Python - Web Scraping HTML table and printing to CSV 【发布时间】：2018-02-24 19:53:14 【问题描述】：

我对 Python 非常陌生，但我正在寻找构建一个网络抓取工具，该工具将在线从 html 表格中提取数据并以相同格式将其打印成 CSV。

这是 HTML 表格的示例（它很大，所以我只提供几行）。

<div class="col-xs-12 tab-content">
        <div id="historical-data" class="tab-pane active">
          <div class="tab-header">
            <h2 class="pull-left bottom-margin-2x">Historical data for Bitcoin</h2>

            <div class="clear"></div>

            <div class="row">
              <div class="col-md-12">
                <div class="pull-left">
                  <small>Currency in USD</small>
                </div>
                <div id="reportrange" class="pull-right">
                    <i class="glyphicon glyphicon-calendar fa fa-calendar"></i>&nbsp;
                    <span>Aug 16, 2017 - Sep 15, 2017</span> <b class="caret"></b>
                </div>
              </div>
            </div>

            <table class="table">
              <thead>
              <tr>
                <th class="text-left">Date</th>
                <th class="text-right">Open</th>
                <th class="text-right">High</th>
                <th class="text-right">Low</th>
                <th class="text-right">Close</th>
                <th class="text-right">Volume</th>
                <th class="text-right">Market Cap</th>
              </tr>
              </thead>
              <tbody>

                <tr class="text-right">
                  <td class="text-left">Sep 14, 2017</td>
                  <td>3875.37</td>     
                  <td>3920.60</td>
                  <td>3153.86</td>
                  <td>3154.95</td>
                  <td>2,716,310,000</td>
                  <td>64,191,600,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 13, 2017</td>
                  <td>4131.98</td>     
                  <td>4131.98</td>
                  <td>3789.92</td>
                  <td>3882.59</td>
                  <td>2,219,410,000</td>
                  <td>68,432,200,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 12, 2017</td>
                  <td>4168.88</td>     
                  <td>4344.65</td>
                  <td>4085.22</td>
                  <td>4130.81</td>
                  <td>1,864,530,000</td>
                  <td>69,033,400,000</td>
                </tr>                
              </tbody>
            </table>
          </div>

        </div>
    </div>

我对使用提供的相同列标题重新创建表格特别感兴趣：“日期”、“开盘价”、“最高价”、“最低价”、“收盘价”、“交易量”、“市值”。目前，我已经能够编写一个简单的脚本，该脚本基本上会转到 URL，下载 HTML，使用 BeautifulSoup 进行解析，然后使用“for”语句来获取 td 元素。下面是我的代码示例（省略了 URL）和结果：

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

url = "enterURLhere"
page = requests.get(url)
pagetext = page.text

pricetable = 
    "Date" : [],
    "Open" : [],
    "High" : [],
    "Low" : [],
    "Close" : [],
    "Volume" : [],
    "Market Cap" : []


soup = BeautifulSoup(pagetext, 'html.parser')

file = open("test.csv", 'w')

for row in soup.find_all('tr'):
    for col in row.find_all('td'):
        print(col.text)

sample output

有人对如何至少重新格式化数据拉入表有任何指示吗？谢谢。

【问题讨论】：

看看CSV模块：docs.python.org/2/library/csv.html 【参考方案1】：

运行代码，您将从该表中获得所需的数据。试一试并从这个元素中提取数据，您需要做的就是将您在上面粘贴的整个 html 元素包装在 html=''' '''

中

import csv
from bs4 import BeautifulSoup

outfile = open("table_data.csv","w",newline='')
writer = csv.writer(outfile)

tree = BeautifulSoup(html,"lxml")
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("th,td")]
                for row_data in table_tag.select("tr")]

for data in tab_data:
    writer.writerow(data)
    print(' '.join(data))

我已尝试将代码分解为多个片段以便您理解。我上面所做的是一个嵌套的 for 循环。这是分开的：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
table = soup.find('table')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

结果：

Date Open High Low Close Volume Market Cap
Sep 14, 2017 3875.37 3920.60 3153.86 3154.95 2,716,310,000 64,191,600,000
Sep 13, 2017 4131.98 3789.92 3882.59 2,219,410,000 68,432,200,000
Sep 12, 2017 4168.88 4344.65 4085.22 4130.81 1,864,530,000 69,033,400,000

【讨论】：

我不得不做一些调整，但效果很好。感谢您的分享和花时间帮助我。请问您如何描述逻辑？我的理解是： 1. BeautifulSoup 将 HTML 内容拉成可读格式 2. table_tag 定义为选择汤中找到的第一个表 3. tab_data 首先取列标题中选择的文本，其余的 row_data 从 table_tag 中提取脚本通过它运行？ 4. 请你解释一下for语句的写作？理解逻辑将非常有帮助。谢谢，Shahin - 这太棒了。我实际上注意到了一个有趣的问题，我现在正在尝试解决这个问题。如果您注意到 2017 年 9 月 13 日的输出，您会看到缺少一列。那是因为该行中的前两个值具有相同的值。有什么方法可以防止 Python 只取唯一值？

以上是关于Python - Web 抓取 HTML 表格并打印到 CSV的主要内容，如果未能解决你的问题，请参考以下文章