抓取人人网中学校名称信息

Posted 2020-10-15 笑看人世冷暖

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了抓取人人网中学校名称信息相关的知识，希望对你有一定的参考价值。

今天老大让采集人人网中资料页面内的各个地区的学校名称

思路分析：

1. 获得高校数据

地址：http://s.xnimg.cn/a44177/allunivlist.js （通过分析页面请求确定为该文件）

对该文件进行人工分析，结合页面弹出框，可以得知，该js文件包含了国家、城市、学校信息。文件为 “非标准” json格式：所有key值均未加双引号。比如其中大学信息格式为：

{id:1,univs:[{id:1001,name:"\\u6e05\\u534e\\u5927\\u5b66"}

因此，第一步应该是格式化数据文件，转换为标准JSON格式

接下来，具体分析该文件的字段信息。

出于提高访问速度的考虑，allunivlist.js 文件被压缩在同一行。为了快速分析该文件，提供一个小技巧：将文件另存到本地，用vim打开，使用 vim 的括号匹配功能（shift+5）从最外层逐渐开始分析。（vim更多知识可以参见《Vimtutor拾遗》）

经过分析，文件整体结构为

[{国家1},{国家2},{国家3}....]

国家定义：{id:xxx, univs:xxxx, name:xxxx, provs:xxxx, country_id:xxx} (provs表示省份)

provs：[{省份1},{省份2},{省份3}....]

省份定义：{id:xxx, univs:xxx, country_id:xxxx, name:xxxx} (univs表示大学)

univs: [{大学1},{大学2},{大学3}....]

大学定义：{id:xxx, name:xxxx}

通过该文件，能够获取到中国所有省份的高校信息。

将信息保存到mysql数据库中，贴上部分代码

 def addJuinorSchools(self, cities):
        \'插入初中数据\'
        starttime = time.time()
        insert = 0
        mycursor = self.__mydb.cursor()
        try:
            for city in cities:
                citynumber, cityname = tuple(city.split(\':\'))
                rqtApi = self.rrjuniorApi + citynumber + \'.html\'
                try:
                    htmlhandle = urllib2.urlopen(rqtApi)
                except Exception as e:
                    self.log.write(time.asctime() + u\'请求初中文档错误:\' + str(e) + \'\\n\')
                else:
                    print(\'---下载%s数据成功---\' % (cityname))
                    htmldoc = htmlhandle.read().decode(\'utf-8\')
                    htmlhandle.close()
                    btsp = bsp(htmldoc, \'html.parser\')
                    countieshtml = btsp.find_all(\'a\', href="#highschool_anchor")
                    counties = []
                    for countyhtml in countieshtml:
                         counties.append([countyhtml.string.strip(),  re.search(r\'[0-9]{4,}\', countyhtml[\'onclick\']).group()])
                    mycursor.execute(self.queryCityIdSql, (cityname,))
                    cityid = mycursor.fetchone()[\'id\']
                    for county in counties:
                        mycursor.execute(self.queryCountyIdSql, (county[0], cityid))
                        try:
                            countyid = mycursor.fetchone()[\'id\']
                        except Exception as e:
                             self.log.write(\'没有找到%s-->%s的id\\n\' % (cityname, county[0]))
                        else:
                            juniorshtml = btsp.select(\'ul[id$=\' + county[1] +\']\')
                            juniorshtml = juniorshtml[0].find_all(\'a\') if len(juniorshtml) else []
                            for junior in juniorshtml:
                                if junior and  len(junior.string):
                                    mycursor.execute(self.queryJuniorSql, junior.string.strip())
                                    if mycursor.fetchone()[\'num\'] == 0:
                                        insert += 1
                                        print(\'插入初中%s--%s--%s\' % (cityname, county[0], junior.string))
                                        mycursor.execute(self.insertJuniorSql, (junior.string.strip(), countyid))
                            self.__mydb.commit()
        except Exception as e:
            self.log.write(time.asctime() + str(e) + \'\\n\')
        mycursor.execute(self.countJuniorSql)
        countnum = mycursor.fetchone()[\'countnum\']
        endtime = time.time()
        self.printExeResult(insert, endtime - starttime, countnum, \'初中\')

具体详见我的github中代码 https://github.com/zhangxux/renren_spider

以上是关于抓取人人网中学校名称信息的主要内容，如果未能解决你的问题，请参考以下文章