爬虫技术:爬取淘宝美食数据:崔庆才思路
Posted meloncodezhang
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫技术:爬取淘宝美食数据:崔庆才思路相关的知识,希望对你有一定的参考价值。
# TODO selenium已经被检测出来
import random import re import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() def search(): """执行后跳转到登录界面--手动登录,且wait容易引起超时错误,因此需要捕捉""" try: driver.get("https://www.taobao.com/") # 获取输入框 input = WebDriverWait(driver, 15).until( EC.presence_of_element_located((By.CSS_SELECTOR, "#q")) ) # 获取搜索按钮,TODO 为什么不用id选择器呢? submit = WebDriverWait(driver,15).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"#J_TSearchForm > div.search-button > button"))) input.send_keys("美食") time.sleep(1) submit.click() # 获取全部页数 total = WebDriverWait(driver,15).until(EC.presence_of_element_located((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > div.total"))) return total.text except TimeoutError: return search() def next_page(page): try: # 获取第几页输入框 input = WebDriverWait(driver, 15).until( EC.presence_of_element_located((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.form > input")) ) # 获取确定按钮 submit = WebDriverWait(driver,15).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > div.form > span.btn.J_Submit"))) input.clear() input.send_keys(page) # time.sleep(1) # 1.解决方式:翻页翻到34页,出现滑块验证,系统检测出来了,是自动化操作,时间不能固定 second = random.randint(1,6) # 设置随机时间后,第一次就要进行人工验证,# TODO 淘宝验证出selenium的方式是啥? time.sleep(second) submit.click() # 判断条件:当前高亮文本的内容和页码参数是否相同 WebDriverWait(driver,15).until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > ul > li.item.active > span"),str(page))) except TimeoutError: return next_page(page) if __name__ == ‘__main__‘: total = search() # 共 100 页, pattern = re.compile("(\d+)",re.S) total_page = re.search(pattern,total) total_page = int(total_page.group(1)) for i in range(2, total_page + 1): next_page(i)
以上是关于爬虫技术:爬取淘宝美食数据:崔庆才思路的主要内容,如果未能解决你的问题,请参考以下文章