刮取网址的CSV列表并将结果输出到不同的CSV
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了刮取网址的CSV列表并将结果输出到不同的CSV相关的知识,希望对你有一定的参考价值。
我正试图从'YP_LA_Remodel_urls.csv
文件中提取网址(我在下面包含了几个),抓取它们,然后将结果导出到Yp_LA_Remodel_Info.csv
。
如果我拿一个网址(不是来自csv)并刮掉它,那么它工作正常。它只是试图以大规模的方式进行,我已经被挂断了。我已经创建了我需要提取的信息列表。
我正在使用我构建的另一个抓取的脚本,它似乎不适用于这个。我是一个蟒蛇菜鸟,所以放轻松。
任何帮助和/或建议表示赞赏。
示例网址:
https://www.yellowpages.com/search?search_terms=remodeling&geo_location_terms=Los%20Angeles%2C%20CA&page=1
https://www.yellowpages.com/search?search_terms=remodeling&geo_location_terms=Los%20Angeles%2C%20CA&page=2
脚本:
import csv
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
from email import encoders
import time
import os
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
import requests
def license_exists(soup):
contents = []
with open('YP_LA_Remodel_urls.csv','r') as csvf:
urls = csv.reader(csvf)
for url in urls:
if soup(class_="next ajax-page"):
return True
else:
return False
records = []
with open('YP_LA_Remodel_urls.csv') as f_input, open('Yp_LA_Remodel_Info.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv_output_to_csv(f_output, fieldnames=[name for name, result in records])
csv_output.writeheader()
for url in csv_input:
r = requests.get(url[0]) # Assume the URL is in the first column
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all('div', attrs={'class':'info'})
csv_output.to_csv('f_output', index=False, encoding='utf-8')
for result in results:
biz_name = result.find('span', attrs={'itemprop':'name'}).text if result.find('span', attrs={'itemprop':'name'}) is not None else ''
biz_phone = result.find('div', attrs={'itemprop':'telephone'}).text if result.find('span', attrs={'itemprop':'telephone'}) is not None else ''
biz_address = result.find('span', attrs={'itemprop':'streetAddress'}).text if result.find('span', attrs={'itemprop':'streetAddress'}) is not None else ''
biz_city = result.find('span', attrs={'itemprop':'addressLocality'}).text if result.find('span', attrs={'itemprop':'addressLocality'}) is not None else ''
biz_zip = result.find('span', attrs={'itemprop':'postalCode'}).text if result.find('span', attrs={'itemprop':'postalCode'}) is not None else ''
records.append((biz_name, biz_phone, biz_address, biz_city, biz_zip))
df = pd.DataFrame(records, columns=['biz_name', 'biz_phone', 'biz_address', 'biz_city', 'biz_zip'])
答案
这个为两个网址..修改为10000
import pandas as pd
import requests
from bs4 import BeautifulSoup
links = ['https://www.yellowpages.com/search?search_terms=remodeling&geo_location_terms=Los%20Angeles%2C%20CA&page=1',
'https://www.yellowpages.com/search?search_terms=remodeling&geo_location_terms=Los%20Angeles%2C%20CA&page=2']
container = pd.DataFrame(columns=['biz_name', 'biz_phone', 'biz_address', 'biz_city', 'biz_zip'])
pos=0
for l in links:
soup_data = BeautifulSoup(requests.get(l).content)
results = soup_data.find_all('div', attrs={'class':'info'})
records = []
for result in results:
records = []
biz_name = result.find('span', attrs={'itemprop':'name'}).text if result.find('span', attrs={'itemprop':'name'}) is not None else ''
biz_phone = result.find('div', attrs={'itemprop':'telephone'}).text if result.find('span', attrs={'itemprop':'telephone'}) is not None else ''
biz_address = result.find('span', attrs={'itemprop':'streetAddress'}).text if result.find('span', attrs={'itemprop':'streetAddress'}) is not None else ''
biz_city = result.find('span', attrs={'itemprop':'addressLocality'}).text if result.find('span', attrs={'itemprop':'addressLocality'}) is not None else ''
biz_zip = result.find('span', attrs={'itemprop':'postalCode'}).text if result.find('span', attrs={'itemprop':'postalCode'}) is not None else ''
records.append(biz_name)
records.append(biz_phone)
records.append(biz_address)
records.append(biz_city)
records.append(biz_zip)
container.loc[pos] = records
pos+=1
产量
biz_name biz_phone biz_address biz_city
0
1
2 Washington Construction 2874 W 8th St Los Angeles,
3 Os Remodeling Inc. 220 N Avenue 53 Apt 202 Los Angeles,
4 A A Allied Construction 1212 S Longwood Ave Los Angeles,
biz_zip
0
1
2 90005
3 90042
4 90019
希望这可以帮助!!
以上是关于刮取网址的CSV列表并将结果输出到不同的CSV的主要内容,如果未能解决你的问题,请参考以下文章