有没有更好的方法来抓取这些数据?
Posted
技术标签:
【中文标题】有没有更好的方法来抓取这些数据?【英文标题】:Is there a better way to scrape this data? 【发布时间】:2019-05-14 08:24:12 【问题描述】:为了工作,我被要求创建一个包含美国所有对抗疗法医学院名称和地址的电子表格。作为 python 的新手,我认为这将是尝试网络抓取的完美情况。虽然我最终编写了一个返回所需数据的程序,但我知道有更好的方法可以做到这一点,因为我必须进入 excel 并手动删除一些无关字符(例如:“、]、[)。我只是想知道是否有更好的方法可以编写这段代码,这样我就可以得到我需要的东西,减去多余的字符。
编辑:我还附上了一个 csv 文件的图像,该文件是为了显示我所说的无关字符而创建的。
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
# link to the site we want to scrape from
page_response = requests.get(link)
# fetching the content using the requests library
soup = BeautifulSoup(page_response.text, "html.parser")
# Calling BeautifulSoup in order to parse our document
data = []
# Empty list for the first scrape. We only get one column with many rows.
# We still have the line break tags here </br>
for tr in soup.find_all('tr', 'valign': 'top'):
values = [td.get_text('</b>', strip=True) for td in tr.find_all('td')]
data.append(values)
data2 = []
# New list that we'll use to have name on index i, address on index i+1
for i in data:
test = list(str(i).split('</b>'))
# Using the line breaks to our advantage.
name = test[0].strip("['")
'''Here we are saying that the name of the school is the first element
before the first line break'''
addy = test[1:]
# The address is what comes after this first line break
data2.append(name)
data2.append(addy)
# Append the name of the school and address to our new list.
school_name = data2[::2]
# Making a new list that consists of the school name
school_address = data2[1::2]
# Another list that consists of the school's address.
with open("Medschooltest.csv", 'w', encoding='utf-8') as toWrite:
writer = csv.writer(toWrite)
writer.writerows(zip(school_name, school_address))
'''Zip the two together making a 2 column table with the schools name and
it's address'''
print("CSV Completed!")
Created CSV file
【问题讨论】:
结构化数据的最佳解决方案是不需要数据,下一个最佳解决方案是从结构化数据开始。如果这些都不可行,那么考虑通过规范化标记来丢弃上下文,也就是过滤掉不在[a-zA-Z0-9 ]
中的任何内容,尽管这通常是一个过于宽泛的问题,您是否有关于如何过滤无关字符的具体问题?或者一个更具体的问题,主要不是基于意见? ETL 中的“E”,尤其是来自网络的,是一个混乱的交易。
【参考方案1】:
似乎将条件语句与字符串操作一起应用可以解决问题。我认为下面的脚本会让你真正接近你想要的。
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
res = requests.get(link)
soup = BeautifulSoup(res.text, "html.parser")
with open("membersInfo.csv","w",newline="") as infile:
writer = csv.writer(infile)
writer.writerow(["Name","Address"])
for tr in soup.find_all('table', class_='bodyTXT'):
items = ', '.join([item.string for item in tr.select_one('td') if item.string!="\n" and item.string!=None])
name = items.split(",")[0].strip()
address = items.split(name)[1].strip(",")
writer.writerow([name,address])
【讨论】:
编辑为在此行末尾使用.strip()
name = items.split(",")[0].strip()
,以便所有名称在 csv 文件中相应可见。【参考方案2】:
如果您了解 SQL 并且数据采用这种结构化方式,那么将其提取到数据库中将是最佳解决方案。
【讨论】:
以上是关于有没有更好的方法来抓取这些数据?的主要内容,如果未能解决你的问题,请参考以下文章