《Python网络数据采集》读书笔记
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了《Python网络数据采集》读书笔记相关的知识,希望对你有一定的参考价值。
1、解析JSON数据
Python把JSON转换成字典,JSON数组转换成列表,JSON字符串转换成Python字符串。
下面的例子演示了使用Python的JSON 解析库,处理JSON字符串中可能出现的不同数据类型:
>>> import json >>> jsonString = '{"arrayOfNums":[{"number":0},{"number":1},{"number":2}],"arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},{"fruit":"pear"}]}' >>> jsonObj = json.loads(jsonString) >>> print(jsonObj.get("arrayOfNums")) [{'number': 0}, {'number': 1}, {'number': 2}] >>> print(jsonObj.get("arrayOfNums")[1]) {'number': 1} >>> print(jsonObj.get("arrayOfNums")[1].get("number")+jsonObj.get("arrayOfNums")[2].get("number")) 3 >>> print(jsonObj.get("arrayOfFruits")[2].get("fruit")) pear
第一行输出是一个组词典构成的列表对象,第二行是一个词典对象,第三行是一个整数(第一行词典列表里整数的和),第四行是一个字符串。
使用Python的JSON解析函数来解码,可以打印出IP地址为50.78.253.58的国家代码。
# -*- coding: utf-8 -*- import json from urllib.request import urlopen def getCountry(ipAddress): response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8') responseJson = json.loads(response) return responseJson.get("country_code") print(getCountry("50.78.253.58")) >>> US
2、维基百科词条的编辑历史页面
做一个采集维基百科的基本程序,寻找编辑历史页面,然后把编辑历史里面的IP地址找出来,查询IP地址所属的国家代码。
# -*- coding: utf-8 -*- import re import datetime import random import json from urllib.request import urlopen from bs4 import BeautifulSoup random.seed(datetime.datetime.now()) def getLinks(articleUrl): html = urlopen("http://en.wikipedia.org"+articleUrl) bsObj = BeautifulSoup(html, "lxml") return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) def getHistoryIPs(pageUrl): # 编辑历史页面URL链接格式是: # http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history pageUrl = pageUrl.replace("/wiki/", "") historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=history" print("history url is: "+historyUrl) html = urlopen(historyUrl) bsObj = BeautifulSoup(html, "lxml") # 找出class属性是"mw-anonuserlink"的链接 # 它们用IP地址代替用户名 ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"}) addressList = set() for ipAddress in ipAddresses: addressList.add(ipAddress.get_text()) return addressList def getCountry(ipAddress): try: response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8') except HTTPError: return None responseJson = json.loads(response) return responseJson.get("country_code") links = getLinks("/wiki/Python_(programming_language)") while(len(links) > 0): for link in links: print("-------------------") historyIPs = getHistoryIPs(link.attrs["href"]) for historyIP in historyIPs: #print(historyIP) country = getCountry(historyIP) if country is not None: print(historyIP+" is from "+country) newLink = links[random.randint(0, len(links)-1)].attrs["href"] links = getLinks(newLink)
首先获取起始词条连接的所有词条的编辑历史(示例中是Python programminglanguage词条)。然后,随机选择一个词条作为起始点,再获取这个页面连接的所有词条的编辑历史,查询编辑者的IP地址所属的国家和地区。重复这个过程直到页面没有连接维基词条为止。
其中,函数getHistoryIPs搜索所有mw-anonuserlin类里面的链接信息(匿名用户的IP地址,不是用户名),返回一个链接列表。
获得了编辑历史的IP地址数据,把它们与上一节的getCountry函数结合起来,查询IP地址所属的国家和地区。
以下是部分输出结果:
------------------- history url is: http://en.wikipedia.org/w/index.php?title=Programming_paradigm&action=history 168.216.130.133 is from US 223.104.186.241 is from CN 31.203.136.191 is from KW 192.117.105.47 is from IL 193.80.242.220 is from AT 223.230.96.108 is from IN 39.36.182.41 is from PK 68.151.180.83 is from CA 218.17.157.55 is from CN 110.55.67.15 is from PH 42.111.56.168 is from IN 92.115.222.143 is from MD 197.255.127.246 is from GH 2605:6000:ec0f:c800:edfd:179f:b648:b4b9 is from US 2a02:c7d:a492:f200:e126:2b36:53ca:513a is from GB ------------------- history url is: http://en.wikipedia.org/w/index.php?title=Object-oriented_programming&action=history 103.74.23.139 is from PK 217.225.8.24 is from DE 223.230.215.145 is from IN 162.204.116.16 is from US 170.142.177.246 is from US 205.251.185.250 is from US 117.239.185.50 is from IN 119.152.87.84 is from PK 93.136.125.208 is from HR 113.199.249.237 is from NP 112.200.199.62 is from PH 103.241.244.36 is from IN 27.251.109.234 is from IN 103.16.68.215 is from IN 121.58.212.157 is from PH 2605:a601:474:600:2088:fbde:7512:53b2 is from US -------------------
以上是关于《Python网络数据采集》读书笔记的主要内容,如果未能解决你的问题,请参考以下文章