python爬虫小小白入门

Posted 2021-01-28 chingo

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python爬虫小小白入门相关的知识，希望对你有一定的参考价值。

python爬虫小小白入门

学习目标：

爬虫基本思想
python爬虫常用包，官方文档，用途，安装方法，常用方法。
简单爬虫实例——从W3Cschool爬取C语言教程文本

python环境:: Anaconda3, spyder, windows10

一、基本思想

爬虫就是从网页上抓取你想要的内容，主要分为三个步骤。
首先需要仔细分析目标页面内容，知道你想要的内容：文字，图片，视频在HTML中的哪个标签里，然后通过爬虫代码向服务器发起请求，得到HTML页面内容，最后把目标内容解析出来。

分析目标页面-->向服务器发起请求-->得到HTML页面内容-->内容解析

二、python爬虫常见的包

1、urllib [英文官方文档]

用途：urllib是用于处理URL的包，包括以下几个模块，主要用于向服务器发起请求获得响应内容。

urllib.request 用于打开和读取网络资源。

安装方法：
windows——运行——cmd——命令提示符窗口下输入：
pip install urllib
或者conda install urllib

常用方法
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url：要抓取的网页url。

data：使用POST方式发起请求时，需要传递的参数。

timeout：请求时间限制，如果不设置，会使用默认的timeout

返回值：一个响应对象，有geturl() info() getcode() read()等方法

其它具体参数含义可以查看官方文档。

实例

>>> import urllib.request #
>>> response = urllib.request.urlopen(‘http://www.python.org/‘)
>>> print(response.read(100).decode(‘utf-8‘))
#由于urlopen的返回值是来自HTTP服务器的字节流，所以需要解码，防止乱码。
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtm

2、requests [英文文官方文档] [中文官方文档]

用途：用于向服务器发起请求获得响应内容与urllib相同。

安装方法：
windows——运行——cmd——命令提示符窗口下输入：
pip install requests
或者conda install requests

常用方法:
（针对不同的HTTP请求方式）

response =requests.get(http://httpbin.org/put)
相当于：response= requests.request(‘GET‘, ‘https://httpbin.org/get‘)
response =requests.put(‘http://httpbin.org/put‘, data = {‘key‘:‘value‘})
response =requests.delete(‘http://httpbin.org/delete‘)
response =requests.head(‘http://httpbin.org/get‘)
response =requests.options(‘http://httpbin.org/get‘)

返回值为响应对象

获得文本内容：
>>>response.text
获得音频，图片非文本内容：
>>>response.content

响应内容转换为json格式

>>> import requests
>>> response = requests.get(‘https://api.github.com/events‘)
>>> response.json()

定制请求头
因为有的网站默认不能访问，需要为请求添加 HTTP头部信息来模拟浏览器,

>>> url = ‘https://api.github.com/some/endpoint‘
>>> headers = {‘user-agent‘: ‘my-app/0.0.1‘}
>>> response = requests.get(url, headers=headers)

获取响应状态码：

>>> response = requests.get(‘http://httpbin.org/get‘)
>>>response.status_code
200

获取cookieresponse.cookie()

重定向与请求历史response.history()

实例

>>> import requests
>>> response = requests.get(‘https://api.github.com/events‘)
>>> response.text
u‘[{"repository":{"open_issues":0,"url":"https://github.com/...

3、BeautifulSoup 官方中文文档

用途：用于解析HTML内容

安装方法：
windows——运行——cmd——命令提示符窗口下输入：
pip install beautifulsoup4
或者conda install beautifulsoup4

常用方法：

find_all( name , attrs , recursive , string ,**kwargs )

find_all() 方法搜索符合过滤器的条件的标签的所有子节点。

name 参数 name 参数可以查找所有名字为name 的标签,字符串对象会被认为是标签名.
简单的用法如下:

soup.find_all("title")
# [<title>The Dormouse‘s story</title>]

keyword 参数
如果传入id,Beautiful Soup会搜索每个标签的”id”属性.

soup.find_all(id=‘link2‘)
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果传入 href 参数,BeautifulSoup会搜索每个标签的”href”属性:

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

下面的例子在文档树中查找所有包含 id 属性的标签,无论 id 的值是什么:

soup.find_all(id=True)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用多个参数可以同时过滤多个属性:

soup.find_all(href=re.compile("elsie"), id=‘link1‘)

 #[<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

有些标签特殊属性在搜索不能使用,比如HTML5中的 data-* 属性:

data_soup = BeautifulSoup(‘<div data-foo="value">foo!</div>‘)
data_soup.find_all(data-foo="value")
 SyntaxError: keyword can‘t be an expression

但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的标签:

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

按照class属性搜索标签功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的标签:

soup.find_all("a", class_="sister")
 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class_ 参数同样接受不同类型的过滤器 ,字符串,正则表达式,方法或 True :


soup.find_all(class_=re.compile("itl"))#正则表达式过滤
# [<p class="title"><b>The Dormouse‘s story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)#方法过滤
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tag的 class 属性是 多值属性 .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名:

css_soup = BeautifulSoup(‘<p class="body strikeout"></p>‘)
css_soup.find_all("p", class_="strikeout") #字符串过滤

# [<p class="body strikeout"></p>]

css_soup.find_all("p", class_="body")#字符串过滤
# [<p class="body strikeout"></p>]

css_soup.find_all("p", class_="body strikeout")#字符串过滤
# [<p class="body strikeout"></p>]

#class_也可以写成如下形式：
soup.find_all("a", attrs={"class": "sister"}) 
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

find( name , attrs , recursive , string , **kwargs )

find()方法返回找到的第一个结果。
find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None.

4、selenium 英文官方文档：

用途：一个自动化测试框架，很多爬网站必须要登录才可以抓取页面，selenium可以执行js代码，模拟登录，填入表单，点击按钮，下拉操作等。主要使用webdriver模块

安装方法：
windows——运行——cmd——命令提示符窗口下输入：
pip install selenium
或者conda install selenium
主要使用的webdriver模块需要单独下载，不同的浏览器需要下载不同的webdriver。

主流浏览器webdriver下载地址https://selenium-python.readthedocs.io/installation.html

Chrome浏览器安装方法和文档https://sites.google.com/a/chromium.org/chromedriver/getting-started

下载解压后放到Chrome浏览器的安装目录下：

技术分享图片

如果不知道自己的Chrome浏览器的安装目录可以在Chrome浏览器中查看：

技术分享图片

最后需要配置webdirver环境变量，将webdirver的安装路径添加到PATH中。

常用方法：
查找元素：

find_element_by_id

find_element_by_name

find_element_by_xpath

find_element_by_link_text

find_element_by_partial_link_text

find_element_by_tag_name

find_element_by_class_name

find_element_by_css_selector

查找多个元素：

find_elements_by_name

find_elements_by_xpath

find_elements_by_link_text

find_elements_by_partial_link_text

find_elements_by_tag_name

find_elements_by_class_name

find_elements_by_css_selector

填充文本：
element.send_keys("some text")

执行JS代码：

driver.execute_script(‘window.scrollTo(0, document.body.scrollHeight)‘)
driver.execute_script(‘alert("To Bottom")‘)

得到文本内容


driver.get(url)
input = driver.find_element_by_class_name(‘p‘)
print(input.text)

实例：

from selenium import webdriver  

chromedriver = "C:/Users/Administrator/AppData/Local/Google/Chrome/Application/chromedriver.exe"
driver = webdriver.Chrome(chromedriver)

#设置浏览器需要打开的url
url = "http://www.baidu.com"  
driver.get(url)

#在百度搜索框中输入关键字"python"
driver.find_element_by_id("kw").send_keys("python")
#单击搜索按钮
driver.find_element_by_id("su").click() 
driver.quit() #关闭浏览器

运行程序后观察，程序打开浏览器，并且输入关键字python，点击按钮，跳转到搜索结果页面,然后关闭。

5、从W3Cschool爬取C语言教程文本内容

实例：
webdriver也可以发送请求，获取网页内容，因此没有使用urlib，和requests.

from  bs4 import BeautifulSoup
import time 
from selenium import webdriver

class CrawText():
    def __init__(self):
        self.headers = {‘User-Agent‘:‘ozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36‘}          #定制请求头，模拟浏览器
        self.web_url = ‘https://www.w3cschool.cn/c/c-data-types.html‘ #W3CShcoolC语言教程的起始url
        self.folder_path=‘W3Cschool-C.txt‘#文本保存路径

    def get_text(self):
        with open(self.folder_path,‘a‘,encoding=‘utf-8‘) as f:
            driver = webdriver.Chrome(‘C:/Users/Administrator/AppData/Local/Google/Chrome/Application/chromedriver.exe‘)
            #点击操作21次，抓取前20个主题的内容
            for i in range(0,21):
                driver.get(self.web_url) #获得网页内容
                element = driver.find_element_by_class_name(‘content-bg‘)   #得到元素
                soup = BeautifulSoup(element.text,‘lxml‘)        #解析内容
                for i in soup:
                    f.write(i.get_text()) #写入文件
                driver.find_element_by_class_name(‘next-link‘).click()#模拟点击操作
                self.web_url = driver.current_url #设置点击后的url
            driver.quit() 

crawtext=CrawText() #创建类的实例
crawtext.text()     #执行方法

以上是关于python爬虫小小白入门的主要内容，如果未能解决你的问题，请参考以下文章