美丽的汤 html csv

Posted

技术标签:

【中文标题】美丽的汤 html csv【英文标题】:beautifulSoup html csv 【发布时间】:2012-12-19 11:51:43 【问题描述】:

晚上好,我已经使用 BeautifulSoup 从网站中提取了一些数据,如下所示:

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))

table = soup.findAll('table', attrs= "class" : "table-horizontal-line")

print table

这给出了以下输出:

[<table  class="table-horizontal-line">
<tr>
<th>Amount</th>
<th>Company or person fined</th>
<th>Date</th>
<th>What was the fine for?</th>
<th>Compensation</th>
</tr>
<tr>
<td><a name="_Hlk74714257" id="_Hlk74714257">&#160;</a>£4,000,000</td>
<td><a href="/pages/library/communication/pr/2002/124.shtml">Credit Suisse First Boston International </a></td>
<td>19/12/02</td>
<td>Attempting to mislead the Japanese regulatory and tax authorities</td>
<td>&#160;</td>
</tr>
<tr>
<td>£750,000</td>
<td><a href="/pages/library/communication/pr/2002/123.shtml">Royal Bank of Scotland plc</a></td>
<td>17/12/02</td>
<td>Breaches of money laundering rules</td>
<td>&#160;</td>
</tr>
<tr>
<td>£1,000,000</td>
<td><a href="/pages/library/communication/pr/2002/118.shtml">Abbey Life Assurance Company ltd</a></td>
<td>04/12/02</td>
<td>Mortgage endowment mis-selling and other failings</td>
<td>Compensation estimated to be between £120 and £160 million</td>
</tr>
<tr>
<td>£1,350,000</td>
<td><a href="/pages/library/communication/pr/2002/087.shtml">Royal &#38; Sun Alliance Group</a></td>
<td>27/08/02</td>
<td>Pension review failings</td>
<td>Redress exceeding £32 million</td>
</tr>
<tr>
<td>£4,000</td>
<td><a href="/pubs/final/ft-inv-ins_7aug02.pdf" target="_blank">F T Investment &#38; Insurance Consultants</a></td>
<td>07/08/02</td>
<td>Pensions review failings</td>
<td>&#160;</td>
</tr>
<tr>
<td>£75,000</td>
<td><a href="/pubs/final/spe_18jun02.pdf" target="_blank">Seymour Pierce Ellis ltd</a></td>
<td>18/06/02</td>
<td>Breaches of FSA Principles ("skill, care and diligence" and "internal organization")</td>
<td>&#160;</td>
</tr>
<tr>
<td>£120,000</td>
<td><a href="/pages/library/communication/pr/2002/051.shtml">Ward Consultancy plc</a></td>
<td>14/05/02</td>
<td>Pension review failings</td>
<td>&#160;</td>
</tr>
<tr>
<td>£140,000</td>
<td><a href="/pages/library/communication/pr/2002/036.shtml">Shawlands Financial Services ltd</a> - formerly Frizzell Life &#38; Financial Planning ltd)</td>
<td>11/04/02</td>
<td>Record keeping and associated compliance breaches</td>
<td>&#160;</td>
</tr>
<tr>
<td>£5,000</td>
<td><a href="/pubs/final/woodwards_4apr02.pdf" target="_blank">Woodward's Independent Financial Advisers</a></td>
<td>04/04/02</td>
<td>Pensions review failings</td>
<td>&#160;</td>
</tr>
</table>]

我想将其导出为 CSV,同时保持网站上显示的表格结构,这可能吗?如果可以,如何?

提前感谢您的帮助。

【问题讨论】:

你可能想看看这个解决方案 - sebsauvage.net/python/html2csv.py 。通过谷歌搜索“html to csv python”找到它:) 谢谢,虽然那个解决方案看起来很复杂?考虑到我的所有数据格式都比较干净,我希望有一种更简单的方法……如果没有,我会尝试遵循这个:-) 【参考方案1】:

这是您可以尝试的基本操作。这假设headers 都在&lt;th&gt; 标记中,并且所有后续数据都在&lt;td&gt; 标记中。这适用于您提供的单一情况,但我确信如果其他情况需要进行调整:) 一般的想法是,一旦您找到您的table(这里使用find 拉第一个),我们得到headers 通过遍历所有 th 元素,将它们存储在列表中。然后,我们创建一个rows 列表,其中包含表示每一行内容的列表。这是通过在tr 标签下查找所有td 元素并获取text 来填充的,将其编码为UTF-8(来自Unicode)。然后,您打开一个 CSV,首先写入 headers,然后写入所有 rows, but using(row for row in rows if row)` 以消除任何空白行):

In [117]: import csv

In [118]: from bs4 import BeautifulSoup

In [119]: from urllib2 import urlopen

In [120]: soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))

In [121]: table = soup.find('table', attrs= "class" : "table-horizontal-line")

In [122]: headers = [header.text for header in table.find_all('th')]

In [123]: rows = []

In [124]: for row in table.find_all('tr'):
   .....:     rows.append([val.text.encode('utf8') for val in row.find_all('td')])
   .....: 

In [125]: with open('output_file.csv', 'wb') as f:
   .....:     writer = csv.writer(f)
   .....:     writer.writerow(headers)
   .....:     writer.writerows(row for row in rows if row)
   .....: 

In [126]: cat output_file.csv
Amount,Company or person fined,Date,What was the fine for?,Compensation
" £4,000,000",Credit Suisse First Boston International ,19/12/02,Attempting to mislead the Japanese regulatory and tax authorities, 
"£750,000",Royal Bank of Scotland plc,17/12/02,Breaches of money laundering rules, 
"£1,000,000",Abbey Life Assurance Company ltd,04/12/02,Mortgage endowment mis-selling and other failings,Compensation estimated to be between £120 and £160 million
"£1,350,000",Royal & Sun Alliance Group,27/08/02,Pension review failings,Redress exceeding £32 million
"£4,000",F T Investment & Insurance Consultants,07/08/02,Pensions review failings, 
"£75,000",Seymour Pierce Ellis ltd,18/06/02,"Breaches of FSA Principles (""skill, care and diligence"" and ""internal organization"")", 
"£120,000",Ward Consultancy plc,14/05/02,Pension review failings, 
"£140,000",Shawlands Financial Services ltd - formerly Frizzell Life & Financial Planning ltd),11/04/02,Record keeping and associated compliance breaches, 
"£5,000",Woodward's Independent Financial Advisers,04/04/02,Pensions review failings, 

【讨论】:

谢谢,这看起来是完美的解决方案。但是,我似乎收到了带有“cat output_file.csv”行的 SyntaxError,它只是读取了无效的语法? @merlin_1980 哦,对不起,应该提到这是 IPython 特有的东西(基本上只是试图显示文件的内容)。如果你达到了这一点,你应该将文件保存在那个目录中。 非常感谢 :-) 我没想到在目录中查找并手动打开文件! @merlin_1980 没问题 - 我本来可以更清楚的 :) 祝一切顺利! 当它的头而不是th时:headers = [header.text for header in table.find('thead').find_all('td')]

以上是关于美丽的汤 html csv的主要内容,如果未能解决你的问题,请参考以下文章

美丽的汤 - 提取信息

Python美丽的汤提取HTML元数据

美丽的汤和桌子刮 - lxml 与 html 解析器

美丽的汤和uTidy

美丽的汤,使用“findAll()”时完全匹配

如何在带有 BS4 的 HTML 代码中找到这个通用标签(美丽的汤)