python 分析常见爬网索引 - http://index.commoncrawl.org/

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 分析常见爬网索引 - http://index.commoncrawl.org/相关的知识,希望对你有一定的参考价值。

# -*- coding: utf-8 -*-
"""
common-crawl-cdx.py

A simple example program to analyze the Common Crawl index.

This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.

If you are only interested in a certain set of TLDs or PLDs, the program
could be enhanced to use cluster.idx to figure out the correct subset of
the 300 shards to process. e.g. the .us TLD is entirely contained in 
cdx-00298 & cdx-00299

Created on Wed Apr 22 15:05:54 2015

@author: Tom Morris <tfmorris@gmail.com>
"""

from collections import Counter
import json
import requests
import zlib


BASEURL = 'https://aws-publicdatasets.s3.amazonaws.com/'
INDEX1 = 'common-crawl/cc-index/collections/CC-MAIN-2015-11/indexes/'
INDEX2 = 'common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/'
SPLITS = 300


def process_index(index):
    total_length = 0
    total_processed = 0
    total_urls = 0
    mime_types = Counter()
    for i in range(SPLITS):
        unconsumed_text = ''
        filename = 'cdx-%05d.gz' % i
        url = BASEURL + index + filename
        response = requests.get(url, stream=True)
        length = int(response.headers['content-length'].strip())
        decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
        total = 0

        for chunk in response.iter_content(chunk_size=2048):
            total += len(chunk)
            if len(decompressor.unused_data) > 0:
                # restart decompressor if end of a chunk
                to_decompress = decompressor.unused_data + chunk
                decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
            else:
                to_decompress = decompressor.unconsumed_tail + chunk
            s = unconsumed_text + decompressor.decompress(to_decompress)
            unconsumed_text = ''

            if len(s) == 0:
                # Not sure why this happens, but doesn't seem to affect things
                print 'Decompressed nothing %2.2f%%' % (total*100.0/length),\
                    length, total, len(chunk), filename
            for l in s.split('\n'):
                pieces = l.split(' ')
                if len(pieces) < 3 or l[-1] != '}':
                    unconsumed_text = l
                else:
                    json_string = ' '.join(pieces[2:])
                    try:
                        metadata = json.loads(json_string)
                    except:
                        print 'JSON load failed: ', total, l
                        assert False
                    url = metadata['url']
                    if 'mime' in metadata:
                        mime_types[metadata['mime']] += 1
                    else:
                        mime_types['<none>'] += 1
                        # print 'No mime type for ', url
                    total_urls += 1
                    # print url
        print 'Done with ', filename
        total_length += length
        total_processed += total
    print 'Processed %2.2f %% of %d bytes (compressed).  Found %d URLs' %\
        ((total_processed * 100.0 / total_length), total_length, total_urls)
    print "Mime types:"
    for k, v in mime_types.most_common():
        print '%5d %s' % (v, k)

for index in [INDEX2]:
    print 'Processing index: ', index
    process_index(index)
    print 'Done processing index: ', index

以上是关于python 分析常见爬网索引 - http://index.commoncrawl.org/的主要内容,如果未能解决你的问题,请参考以下文章

具有更改日志的非重复爬网的自定义 BCS 索引连接器无法正常工作

构建地理位置照片索引 - 爬网还是依赖现有 API?

优化/自定义 Sharepoint 搜索爬网

python 使用BeautifulSoup爬网浏览

Docker最全教程之Python爬网实战(二十一)

Python爬网获取全国各地律师电话号