Python - 网页抓取 - BeautifulSoup & CSV
Posted
技术标签:
【中文标题】Python - 网页抓取 - BeautifulSoup & CSV【英文标题】:Python - Web Scraping - BeautifulSoup & CSV 【发布时间】:2014-06-20 13:14:52 【问题描述】:我希望提取一个城市相对于多个城市的生活成本变化。我计划在 CSV 文件中列出我想比较的城市,并使用此列表创建 Web 链接,将我带到包含我正在寻找的信息的网站。
这里是一个例子的链接:http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city
不幸的是,我遇到了一些挑战。非常感谢对以下挑战的任何帮助!
-
输出仅显示百分比,但不显示更贵或更便宜。对于上面列出的示例,我基于当前代码的输出显示 48%、129%、63%、43%、42% 和 42%。我试图通过添加一个“if-statement”来纠正这个问题,如果它更昂贵,则添加“+”符号,或者如果它更便宜,则添加一个“-”符号。但是,此“if 语句”无法正常运行。
当我将数据写入 CSV 文件时,每个百分比都会写入一个新行。我似乎无法弄清楚如何将其写成一行。
(与第 2 项相关)当我将数据写入上面列出的示例的 CSV 文件时,数据以下面列出的格式写入。如何更正格式并将数据写入下面列出的首选格式(也没有百分号)?
当前 CSV 格式(注意:'if-statement' 无法正常工作):
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
首选 CSV 格式:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city, 48,129,63,43,42,42
这是我当前的代码:
import requests
import csv
from bs4 import BeautifulSoup
#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
soup_expatistan = BeautifulSoup(page)
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table",class_="comparison")
expatistan_titles = expatistan_table.find_all("tr",class_="expandable")
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th",class_="percent")
percent_difference_title = percent_difference.span['class']
if percent_difference_title == "expensiver":
WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
else:
WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
i+=1
【问题讨论】:
【参考方案1】:答案:
问题1:span
的类是一个列表,你需要检查expensiver
是否在这个列表中。换句话说,替换:
if percent_difference_title == "expensiver"
与:
if "expensiver" in percent_difference.span['class']
问题 2 和 3:您需要将列值列表传递给 writerow()
,而不是字符串。而且,由于您只需要每个城市的一条记录,请在循环之外调用 writerow()
(通过 tr
s)。
其他问题:
在循环前打开csv
文件进行写入
在处理文件时使用 with
上下文管理器
尝试遵循 PEP8
风格指南
以下是修改后的代码:
import requests
import csv
from bs4 import BeautifulSoup
BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/home_city/city'
home_city = 'Phoenix'
with open('City.txt') as input_file:
with open("Expatistan.csv", "w") as output_file:
writer = csv.writer(output_file)
writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
for line in input_file:
city = line.strip()
url = BASE_URL.format(home_city=home_city, city=city)
soup = BeautifulSoup(requests.get(url).text)
table = soup.find("table", class_="comparison")
differences = []
for title in table.find_all("tr", class_="expandable"):
percent_difference = title.find("th", class_="percent")
if "expensiver" in percent_difference.span['class']:
differences.append('+' + percent_difference.span.string)
else:
differences.append('-' + percent_difference.span.string)
writer.writerow([city] + differences)
对于只包含一个new-york-city
行的City.txt
,它会生成具有以下内容的Expatistan.csv
:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%
确保您了解我所做的更改。如果您需要进一步的帮助,请告诉我。
【讨论】:
【参考方案2】:csv.writer.writerow()
接受一个序列并使每个元素成为一列;通常你会给它一个带有列的 list ,但是你传递的是字符串;这会将单个字符添加为列。
只需建立一个列表,然后将其写入 CSV 文件。
首先,打开 CSV 文件一次,而不是针对每个单独的城市;每次打开文件都会清除它。
import requests
import csv
from bs4 import BeautifulSoup
HomeCity = 'Phoenix'
with open("City.txt") as cities, open("Expatistan.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["City", "Food", "Housing", "Clothes",
"Transportation", "Personal Care", "Entertainment"])
for line in cities:
city = line.strip()
url = "http://www.expatistan.com/cost-of-living/comparison//".format(
HomeCity, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)
titles = soup.select("table.comparison tr.expandable")
row = [city]
for title in titles:
percent_difference = title.find("th", class_="percent")
changeclass = percent_difference.span['class']
change = percent_difference.span.string
if "expensiver" in changeclass:
change = '+' + change
else:
change = '-' + change
row.append(change)
writer.writerow(row)
【讨论】:
TIL:BeautifulSoup 中的选择器。【参考方案3】:因此,首先,将writerow
方法传递给一个可迭代对象,该可迭代对象中的每个对象都用逗号分隔。所以如果你给它一个字符串,那么每个字符都会被分开:
WriteResultsFile.writerow('hello there')
写
h,e,l,l,o, ,t,h,e,r,e
但是
WriteResultsFile.writerow(['hello', 'there'])
写
hello,there
这就是为什么你会得到这样的结果
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
您的其余问题是您的网络抓取中的错误。首先,当我抓取网站时,搜索带有 CSS 类“比较”的表格给了我None
。所以我不得不使用
expatistan_table = soup_expatistan.find("table","comparison")
现在,您的“if 语句被破坏”的原因是因为
percent_difference.span['class']
返回一个列表。如果我们将其修改为
percent_difference.span['class'][0]
一切都会如你所愿。
现在,您真正的问题是,在最里面的循环中,您会发现单个商品的价格变化百分比。您希望这些作为价格差异行中的项目,而不是单独的行。所以,我声明了一个空列表items
,我将percent_difference.span.string
附加到该列表中,然后将行写在最内层循环之外,如下所示:
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
最后一个错误是在while
循环中,您重新打开 csv 文件,并覆盖所有内容,因此您最终只拥有最终城市。考虑到所有这些错误(其中许多你应该能够在没有帮助的情况下找到)给我们留下了:
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
print url
soup_expatistan = BeautifulSoup(page)
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table","comparison")
expatistan_titles = expatistan_table.find_all("tr","expandable")
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
i+=1
【讨论】:
【参考方案4】:YAA - 又一个答案。
与其他答案不同,这将数据视为一系列键值对;即:字典列表,然后将其写入 CSV。向 csv 写入器 (DictWriter
) 提供所需字段的列表,它会丢弃附加信息(超出指定字段)并空白缺失信息。另外,如果原页面的信息顺序发生变化,本方案不受影响。
我还假设您将使用 Excel 之类的文件打开 CSV 文件。需要为 csv 编写器提供其他参数才能很好地发生这种情况(请参阅dialect
参数)。鉴于我们没有清理返回的数据,我们应该使用无条件引用明确地对其进行分隔(参见quoting
参数)。
import csv
import requests
from bs4 import BeautifulSoup
#Read text file
with open("City.txt") as cities_h:
cities = cities_h.readlines()
home_city = "Phoenix"
city_data = []
for city in cities:
url = "http://www.expatistan.com/cost-of-living/comparison/%s/%s" % (home_city, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding = resp.encoding)
titles = soup.select("table.comparison tr.expandable")
if titles:
data =
for title in titles:
name = title.find("th", class_ = "clickable")
diff = title.find("th", class_ = "percent")
exp = bool(diff.find("span", class_ = "expensiver"))
data[name.text] = ("+" if exp else "-") + diff.span.text
data["City"] = soup.find("strong", class_ = "city-2").text
city_data.append(data)
with open("Expatistan.csv","w") as csv_h:
fields = \
[
"City",
"Food",
"Housing",
"Clothes",
"Transportation",
"Personal Care",
"Entertainment"
]
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(city_data)
【讨论】:
以上是关于Python - 网页抓取 - BeautifulSoup & CSV的主要内容,如果未能解决你的问题,请参考以下文章
Beautiful Soup 4并没有删除此网页上的所有html
如何解决用 Beautiful Soup 抓取网页却得到乱码的问题