在会话中发出后续 POST 请求不起作用 - 网络抓取
Posted
技术标签:
【中文标题】在会话中发出后续 POST 请求不起作用 - 网络抓取【英文标题】:Making subsequent POST request in session doesn't work - web scraping 【发布时间】:2016-08-11 23:34:30 【问题描述】:这就是我想要做的:转到here,然后点击“搜索”。抓取数据,然后点击“下一步”,然后继续点击下一步,直到页面用完。 一切都达到“下一个”的作品。这是我的代码。 r.content 的格式在我打印两次时完全不同,这表明 GET 和 POST 请求之间发生了一些不同的事情,即使我想要非常相似的行为。为什么会发生这种情况?
我觉得奇怪的是,即使在似乎返回错误内容的 POST 请求之后,我仍然可以解析我需要的 url - 只是不能解析 __EVENTVALIDATION 输入字段。
错误信息(代码末尾)表示内容不包含我需要进行后续请求的这些数据,但是导航到页面显示它确实有该数据,并且格式非常类似于第一页。
编辑:我让它根据它解析的 html 打开网页,但肯定有问题。运行下面的代码将打开这些页面。
GET 为我提供了一个包含如下数据的网站:
<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="4424DBE6">
<input type="hidden" name="__VIEWSTATEENCRYPTED" id="__VIEWSTATEENCRYPTED" value="">
<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="TlIgNH
虽然 POST 会在页面底部以纯文本形式生成一个包含所有这些数据的站点,如下所示:
|0|hiddenField|__EVENTTARGET||0|hiddenField|__EVENTARGUMENT||0|hiddenField|_
Bad r.content
Good r.content
import requests
from lxml import html
from bs4 import BeautifulSoup
page = requests.get('http://search.cpsa.ca/physiciansearch')
print('got page!')
d = "ctl00$ctl13": "ctl00$ctl13|ctl00$MainContent$physicianSearchView$btnSearch",
"ctl00$MainContent$physicianSearchView$txtLastName": "",
'ctl00$MainContent$physicianSearchView$txtFirstName': "",
'ctl00$MainContent$physicianSearchView$txtCity': "",
"__VIEWSTATEENCRYPTED":"",
'ctl00$MainContent$physicianSearchView$txtPostalCode': "",
'ctl00$MainContent$physicianSearchView$rblPractice': "",
'ctl00$MainContent$physicianSearchView$ddDiscipline': "",
'ctl00$MainContent$physicianSearchView$rblGender': "",
'ctl00$MainContent$physicianSearchView$txtPracticeInterests': "",
'ctl00$MainContent$physicianSearchView$ddApprovals': "",
'ctl00$MainContent$physicianSearchView$ddLanguage': "",
"__EVENTTARGET": "ctl00$MainContent$physicianSearchView$btnSearch",
"__EVENTARGUMENT": "",
'ctl00$MainContent$physicianSearchView$hfPrefetchUrl': "http://service.cpsa.ca/OnlineService/OnlineService.svc/Services/GetAlbertaCities?name=",
'ctl00$MainContent$physicianSearchView$hfRemoveUrl': "http://service.cpsa.ca/OnlineService/OnlineService.svc/Services/GetAlbertaCities?name=%QUERY",
'__ASYNCPOST': 'true'
h = "X-MicrosoftAjax":"Delta = true",
"X-Requested-With":"XMLHttpRequest",
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
urls = []
with requests.session() as s:
r = s.get("http://search.cpsa.ca/PhysicianSearch",headers=h)
soup = BeautifulSoup(r.content, "lxml")
tree = html.fromstring(r.content)
html.open_in_browser(tree)
ev = soup.select("#__EVENTVALIDATION" )[0]["value"]
vs = soup.select("#__VIEWSTATE")[0]["value"]
vsg = soup.select("#__VIEWSTATEGENERATOR")[0]["value"]
d["__EVENTVALIDATION"] = ev
d["__VIEWSTATEGENERATOR"] = vsg
d["__VIEWSTATE"] = vs
r = s.post('http://search.cpsa.ca/PhysicianSearch', data=d,headers=h)
print('opening in browser')
retrievedUrls = tree.xpath('//*[@id="MainContent_physicianSearchView_gvResults"]/tr/td[2]/a/@href')
print(retrievedUrls)
for url in retrievedUrls:
urls.append(url)
endSearch = False
while endSearch == False:
tree = html.fromstring(r.content)
html.open_in_browser(tree)
soup = BeautifulSoup(r.content, "lxml")
print('soup2:')
## BREAKS HERE
ev = soup.select("#__EVENTVALIDATION" )[0]["value"]
## BREAKS HERE,
vs = soup.select("#__VIEWSTATE")[0]["value"]
vsg = soup.select("#__VIEWSTATEGENERATOR")[0]["value"]
d["ctl00$ctl13"] = "ctl00$MainContent$physicianSearchView$ResultsPanel|ctl00$MainContent$physicianSearchView$gvResults$ctl01$btnNextPage"
d["__EVENTVALIDATION"] = ev
d["__EVENTTARGET"] = ""
d["__VIEWSTATEGENERATOR"] = vsg
d["__VIEWSTATE"] = vs
d["ctl00$MainContent$physicianSearchView$gvResults$ctl01$ddlPager"] = 1
d["ctl00$MainContent$physicianSearchView$gvResults$ctl01$ddlPager"] = 1
d["ctl00$MainContent$physicianSearchView$gvResults$ctl01$btnNextPage"] = "Next"
r = requests.post('http://search.cpsa.ca/PhysicianSearch', data=d,headers=h)
tree = html.fromstring(r.content)
tree = html.fromstring(r.content)
retrievedUrls = tree.xpath('//*[@id="MainContent_physicianSearchView_gvResults"]/tr/td[2]/a/@href')
print(urls)
print(retrievedUrls)
endSearch = True
...
Traceback (most recent call last):
File "C:\Users\daniel.bak\workspace\Alberta Physician Scraper\main\main.py", line 63, in <module>
ev = soup.select("#__EVENTVALIDATION" )[0]["value"]
IndexError: list index out of range
【问题讨论】:
【参考方案1】:好吧,这几乎让我发疯了,但它终于起作用了,你必须发出一个 get 请求才能为每个帖子获取一个新的 __EVENTVALIDATION
令牌:
import requests
from bs4 import BeautifulSoup
h = "X-MicrosoftAjax": "Delta = true",
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
"ctl00$ctl13 | ctl00$MainContent$physicianSearchView$btnSearch"
d =
"ctl00$ctl13": "ctl00$MainContent$physicianSearchView$btnSearch",
"__EVENTTARGET": "ctl00$MainContent$physicianSearchView$btnSearch",
'ctl00$MainContent$physicianSearchView$hfPrefetchUrl': "http://service.cpsa.ca/OnlineService/OnlineService.svc/Services/GetAlbertaCities?name=",
'ctl00$MainContent$physicianSearchView$hfRemoveUrl': "http://service.cpsa.ca/OnlineService/OnlineService.svc/Services/GetAlbertaCities?name=%QUERY",
'__ASYNCPOST': 'true'
nxt_d =
"ctl00$ctl13": "ctl00$MainContent$physicianSearchView$ResultsPanel|ctl00$MainContent$physicianSearchView$gvResults$ctl14$ddlPager",
"ctl00$MainContent$physicianSearchView$gvResults$ctl01$ddlPager": "2",
"ctl00$MainContent$physicianSearchView$gvResults$ctl14$ddlPager": "1",
"__ASYNCPOST": "true",
"__EVENTTARGET": "ctl00$MainContent$physicianSearchView$gvResults$ctl14$ddlPager"
url = "http://search.cpsa.ca/PhysicianSearch"
with requests.session() as s:
r = s.get(url, headers=h)
soup = BeautifulSoup(r.content, "lxml")
ev = soup.select("#__EVENTVALIDATION")[0]["value"]
vs = soup.select("#__VIEWSTATE")[0]["value"]
d["__EVENTVALIDATION"] = ev
d["__VIEWSTATE"] = vs
r = s.post(url, data=d, headers=h)
soup = BeautifulSoup(s.get("http://search.cpsa.ca/PhysicianSearch").content, "lxml")
ev = soup.select("#__EVENTVALIDATION")[0]["value"]
vs = soup.select("#__VIEWSTATE")[0]["value"]
nxt_d["__EVENTVALIDATION"] = ev
nxt_d["__VIEWSTATE"] = vs
r = s.post(url, data=nxt_d, headers=h)
如果你打开上一篇文章的源代码,你会看到你点击了第 2 页。我们需要添加更多逻辑来浏览所有页面,我会稍微添加它。
参数:
"ctl00$MainContent$physicianSearchView$gvResults$ctl01$ddlPager": "2",
"ctl00$MainContent$physicianSearchView$gvResults$ctl14$ddlPager": "1"
是要转到的页面和您来自的页面,因此在获取后应该是所有需要更改的。
这将获取所有页面,以编程方式提取大部分值,您可能会拉得更多,尤其是在正则表达式的帮助下,但它在没有硬编码值的情况下拉得最多:
from lxml.html import fromstring
import requests
class Crawler(object):
def __init__(self, ua, url):
self.user_agent = ua
self.post_header = "X-MicrosoftAjax": "Delta = true", "X-Requested-With": "XMLHttpRequest", "user-agent": ua
self.post_data2 = '__ASYNCPOST': 'true',
"ctl00$ctl13": "ctl00$MainContent$physicianSearchView$ResultsPanel|ctl00$MainContent$physicianSearchView$gvResults$ctl14$ddlPager"
self.url = url
self.post_data1 = '__ASYNCPOST': 'true'
def populate(self, xml):
"""Pulls form post data keys and values for initial post."""
k1 = xml.xpath("//*[@id='hfPrefetchUrl']")[0]
k2 = xml.xpath("//*[@id='hfRemoveUrl']")[0]
self.post_data1[k1.get("name")] = k1.get("value")
self.post_data1[k2.get("name")] = k2.get("value")
self.post_data1["ctl00$ctl13"] = xml.xpath("//input[@value='Search']/@name")[0]
self.post_data1["__EVENTTARGET"] = self.post_data1["ctl00$ctl13"]
def populate2(self, xml):
"""Pulls form post data keys and values,
for all subsequent posts,
setting initial page number values.
"""
data = xml.xpath("//*[@id='MainContent_physicianSearchView_gvResults_ddlPager']/@name")
self.pge = data[0]
self.ev = data[1]
self.post_data2["__EVENTTARGET"] = self.ev
self.post_data2[self.ev] = "1"
self.post_data2[self.pge] = "2"
@staticmethod
def put_validation(xml, d):
"""Need to request new __EVENTVALIDATION for each post.
"""
ev = xml.xpath("//*[@id='__EVENTVALIDATION']/@value")[0]
vs = xml.xpath("//*[@id='__VIEWSTATE']/@value")[0]
d["__EVENTVALIDATION"] = ev
d["__VIEWSTATE"] = vs
def next_page(self, d):
"""Increments the page number by one per iteration."""
e = self.post_data2[self.ev]
v = self.post_data2[self.pge]
self.post_data2[self.pge] = str(int(v) + 1)
self.post_data2[self.ev] = str(int(e) + 1)
def start(self):
with requests.session() as s:
# get initial page to pull __EVENTVALIDATION etc..
req = s.get(self.url, headers="user-agent": self.user_agent).content
# add __EVENTVALIDATION" to post data.
self.put_validation(fromstring(req), self.post_data1)
xml = fromstring(req)
# populate the rest of the post data.
self.populate(xml)
resp = fromstring(s.post(self.url, data=self.post_data1, headers=self.post_header).content)
# yield first page results.
yield resp
# fill post data for next pages.
self.populate2(resp)
# when this is an empty list, we will have hit the last page.
nxt = xml.xpath("//*[@id='MainContent_physicianSearchView_gvResults_btnNextPage']/@disabled")
while not nxt:
# update __EVENTVALIDATION token and _VIEWSTATE.
self.put_validation(fromstring(s.get(self.url).content), self.post_data2)
# post to get next page of results.
yield fromstring(s.post(url, data=self.post_data2, headers=self.post_header).content)
nxt = xml.xpath("//*[@id='MainContent_physicianSearchView_gvResults_btnNextPage']/@disabled")
self.next_page(nxt_d)
ua = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
url = "http://search.cpsa.ca/PhysicianSearch"
c = Crawler(ua, url)
for tree in c.start():
# use tree
【讨论】:
这个程序让你精神焕发,所以-1 我也快疯了。非常感谢您的帮助-现在验证它是否有效。实际上,在您取消隐藏之前一分钟,我就想出了一个解决方法,哈哈。 这花了你多长时间? @DanielPaczuskiBak,花了我一个小时,我无法理解为什么没有生成 _EVENTVAIDATION 令牌。我在前 10 页上测试了代码,它工作正常,它会一直循环直到下一个按钮被禁用。如果你愿意,你可以用 lxml 替换 bs4 代码,它应该更快, 为什么不在get中包含标题?以上是关于在会话中发出后续 POST 请求不起作用 - 网络抓取的主要内容,如果未能解决你的问题,请参考以下文章
SilverStripe 功能测试是不是在发出 post 请求后将数据保存在会话中
Volley JsonObjectRequest Post 请求不起作用
在 swift2 中使用 Alamofire pod 时 POST 请求不起作用?