获取 BeautifulSoup 中表格的内容
Posted
技术标签:
【中文标题】获取 BeautifulSoup 中表格的内容【英文标题】:Get content of table in BeautifulSoup 【发布时间】:2015-12-02 18:08:56 【问题描述】:我在使用 BeautifulSoup 提取的网站上有下表 这是网址(我还附上了图片
理想情况下,我希望将每家公司放在 csv 中的一行中,但是我将其放在不同的行中。请看附图。
我希望它像在“D”字段中一样,但我在 A1、A2、A3 中得到它...
这是我用来提取的代码:
def _writeInCSV(text):
print "Writing in CSV File"
with open('sara.csv', 'wb') as csvfile:
#spamwriter = csv.writer(csvfile, delimiter='\t',quotechar='\n', quoting=csv.QUOTE_MINIMAL)
spamwriter = csv.writer(csvfile, delimiter='\t',quotechar="\n")
for item in text:
spamwriter.writerow([item])
read_list=[]
initial_list=[]
url="http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
r=requests.get(url)
soup = BeautifulSoup(r._content, "html.parser")
#gdata_even=soup.find_all("td", "class":"ms-rteTableEvenRow-3")
gdata_even=soup.find_all("td", "class":"ms-rteTable-default")
for item in gdata_even:
print item.text.encode("utf-8")
initial_list.append(item.text.encode("utf-8"))
print ""
_writeInCSV(initial_list)
有人可以帮忙吗?
【问题讨论】:
如果我可以在 csv 中复制整个表格会更好,但我正在为如何做到这一点而苦苦挣扎 【参考方案1】:这是一个想法:
从表格中读取标题单元格 从表中读取所有其他行 压缩所有带有标题的数据行单元格,生成字典列表 使用csv.DictWriter()
转储到 csv
实施:
import csv
from pprint import pprint
from bs4 import BeautifulSoup
import requests
url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
rows = soup.select("table.ms-rteTable-default tr")
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("td")]
data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")]))
for row in rows[1:]]
# see what the data looks like at this point
pprint(data)
with open('sara.csv', 'wb') as csvfile:
spamwriter = csv.DictWriter(csvfile, headers, delimiter='\t', quotechar="\n")
for row in data:
spamwriter.writerow(row)
【讨论】:
【参考方案2】:由于@alecxe 已经提供了一个惊人的答案,这里是另一个使用pandas
库的方法。
import pandas as pd
url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
tables = pd.read_html(url)
tb1 = tables[0] # Get the first table.
tb1.columns = tb1.iloc[0] # Assign the first row as header.
tb1 = tb1.iloc[1:] # Drop the first row.
tb1.reset_index(drop=True, inplace=True) # Reset the index.
print tb1.head() # Print first 5 rows.
# tb1.to_csv("table1.csv") # Export to CSV file.
结果:
In [5]: runfile('C:/Users/.../.spyder2/temp.py', wdir='C:/Users/.../.spyder2')
0 Company Dividend Bonus Closure of Register \
0 Nigerian Breweries Plc N3.50 Nil 5th - 11th March 2015
1 Forte Oil Plc N2.50 1 for 5 1st – 7th April 2015
2 Nestle Nigeria N17.50 Nil 27th April 2015
3 Greif Nigeria Plc 60 kobo Nil 25th - 27th March 2015
4 Guaranty Bank Plc N1.50 (final) Nil 17th March 2015
0 AGM Date Payment Date
0 13th May 2015 14th May 2015
1 15th April 2015 22nd April 2015
2 11th May 2015 12th May 2015
3 28th April 2015 5th May 2015
4 31st March 2015 31st March 2015
In [6]:
【讨论】:
我收到错误:C:\Python27\python.exe C:/Users/Anant/XetraWebBot/Test/ReadCSV.py Traceback(最近一次调用最后):文件“C:/Users/ Anant/XetraWebBot/Test/ReadCSV.py",第 4 行,在pandas
或者您没有html5lib
模块。预先警告:pandas
简化了表格抓取,如您在上面看到的,但设置它可能会非常麻烦,除非您使用像 Anaconda 这样的发行版(这是我在上面使用的)。以上是关于获取 BeautifulSoup 中表格的内容的主要内容,如果未能解决你的问题,请参考以下文章
使用python和BeautifulSoup从html中提取表格内容
网页内容爬取:如何提取正文内容 BEAUTIFULSOUP的输出
BeautifulSoup 无法解析内容,因为页面加载速度太慢