py爬虫，爬取codeforces分数

Posted 2021-09-25 Keep--Silent

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了py爬虫，爬取codeforces分数相关的知识，希望对你有一定的参考价值。

爬取过程：

py伪装成浏览器，爬取整个网页的代码
用bs解析html代码
找到需要的数据
提取数据

from bs4 import BeautifulSoup
from urllib import request
import urllib.request, urllib.error  # 指定URL,获取网页数据
import urllib


def getData(baseurl):
    # 解析数据
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/103'
    }
    req = urllib.request.Request(baseurl, headers=headers)
    try:
        response = urllib.request.urlopen(req)
        data = response.read().decode("utf-8")
        # print(data)
        return data
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
        return 'Error'


def myre(s):
    flag = 0;
    ans = ""
    for i in range(1, len(s) - 1):
        if s[i] == '>' and s[i - 1] == '\\"':
            flag = 1
        elif flag == 1:
            if s[i] == '<' and s[i + 1] == '/':
                return ans
            else:
                ans += s[i]


def get_rating(name):
    baseurl = "http://codeforces.com/profile/" + name
    data = getData(baseurl)
    bs = BeautifulSoup(data, "html.parser")
    # print(bs)
    temp = bs.select('#pageContent > div:nth-child(3) > div.userbox > div.info > ul > li:nth-child(1)')
    s = str(temp)
    # print(s)
    rating = myre(s)
    if rating is None:
        return "None"
    # print(rating)
    else:
        return rating


if __name__ == '__main__':
    #
    name = "tourist"
    while 1 == 1:
        rating = get_rating(name)
        print(rating)
        name = input()
#   get_rating返回str类型
#   用户名存在则返回分数，不存在返回“None"

1.首先是用getData获取需要的网页的代码，为了伪装成是浏览器，需要header头部，要不然就是明明白白的报文：我是python，这样肯定是不行的。
2. bs, 把html解析成特定的结构，这样方便接下来查找数据。
3. bs.select筛出需要的部分
4. 最后用正则表达式提取需要的部分（不会正则表达式，自己写了一个myre）

附：bs.select的查找方法

以上是关于py爬虫，爬取codeforces分数的主要内容，如果未能解决你的问题，请参考以下文章