python简单爬数据

Posted 2020-09-21

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python简单爬数据相关的知识，希望对你有一定的参考价值。

失败了，即使跟Firefox看到的headers，参数一模一样都不行，爬出来有网页，但是就是不给数据，尝试禁用了js，然后看到了cookie（不禁用js是没有cookie的），用这个cookie爬，还是不行，隔了时间再看，cookie的内容也并没有变化，有点受挫，但还是发出来，也算给自己留个小任务啥的

如果有大佬经过，还望不吝赐教

另外另两个网站的脚本都可以用，过会直接放下代码，过程就不说了

目标网站 http://www.geomag.bgs.ac.uk/data_service/models_compass/igrf_form.shtml

先解决一下date到decimal years的转换，仅考虑到天的粗略转换

def date2dy(year, month, day):
    months = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    oneyear = 365
    if year%100 == 0:
        if year%400 == 0:
            months[1] = 29
            oneyear = 366
    else:
        if year%4 == 0:
            months[1] = 29
            oneyear = 366

    days = 0
    i = 1
    while i < month:
        days = days + months[i]
        i = i + 1
    days = days + day - 1
    return year + days/366

第一个小目标是抓下2016.12.1的数据

打开FireFox的F12，调到网络一栏

技术分享

提交数据得到

技术分享

有用的信息是请求头，请求网址和参数，扒下来扔到程序里面试试

这块我试了大概一天多，抓不下来，我好菜呀.jpg

放下代码吧先，万一有大佬经过还望不吝赐教

#!usr/bin/python

import requests
import sys

web_url = r‘http://www.geomag.bgs.ac.uk/data_service/models_compass/igrf_form.shtml‘
request_url = r‘http://www.geomag.bgs.ac.uk/cgi-bin/igrfsynth‘
filepath = sys.path[0] + ‘\\\\data_igrf_raw_‘ + ‘.html‘
fid = open(filepath, ‘w‘, encoding=‘utf-8‘)
headers = {
    ‘Host‘: ‘www.geomag.bgs.ac.uk‘,
    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; rv:53.0) Gecko/20100101 Firefox/53.0‘,
    ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
    ‘Accept-Language‘: ‘zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3‘,
    ‘Accept-Encoding‘: ‘gzip, deflate‘,
    ‘Content-Type‘: ‘application/x-www-form-urlencoded‘,
    ‘Content-Length‘: ‘136‘,
    ‘Referer‘: ‘http://www.geomag.bgs.ac.uk/data_service/models_compass/igrf_form.shtml‘,
    ‘Connection‘: ‘keep-alive‘,
    ‘Upgrade-Insecure-Requests‘: ‘1‘
}
payload = {
    ‘name‘: ‘-‘,  # your name and email address
    ‘coord‘: ‘1‘,  # ‘1‘: Geodetic ‘2‘: Geocentic
    ‘date‘: ‘2016.92‘,  # decimal years
    ‘alt‘: ‘150‘,  # Altitude
    ‘place‘: ‘‘,
    ‘degmin‘: ‘y‘,  # Position Coordinates: ‘y‘: In Degrees and Minutes ‘n‘: In Decimal Degrees
    ‘latd‘: ‘60‘,  # latitude degrees (degrees negative for south)
    ‘latm‘: ‘0‘,  # latitude minutes
    ‘lond‘: ‘120‘,  # longitude degrees (degrees negative for west)
    ‘lonm‘: ‘0‘,  # longitude minutes
    ‘tot‘: ‘y‘,  # Total Intensity(F)
    ‘dec‘: ‘y‘,  # Declination(D)
    ‘inc‘: ‘y‘,  # Inclination(I)
    ‘hor‘: ‘y‘,  # Horizontal Intensity(H)
    ‘nor‘: ‘y‘,  # North Component (X)
    ‘eas‘: ‘y‘,  # East Component (Y)
    ‘ver‘: ‘y‘,  # Vertical Component (Z)
    ‘map‘: ‘0‘,  # Include a Map of the Location: ‘0‘: NO ‘1‘: YES
    ‘sv‘: ‘n‘
}
#如果需要Secular Variation (rate of change), 加上‘sv‘: ‘y‘
r = requests.post(request_url, data=payload, headers=headers)
fid.write(r.text)
fid.close();

以上是关于python简单爬数据的主要内容，如果未能解决你的问题，请参考以下文章

python简单爬数据

python简单爬数据（这两个成功了）

Python如何简单爬取腾讯新闻网前五页文字内容？

如何用30行代码爬取Google Play 100万个App的数据

scrapy主动退出爬虫的代码片段(python3)