python 分析常见爬网索引 - http://index.commoncrawl.org/

Posted 2021-05-09

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python 分析常见爬网索引 - http://index.commoncrawl.org/相关的知识，希望对你有一定的参考价值。

# -*- coding: utf-8 -*-
"""
common-crawl-cdx.py

A simple example program to analyze the Common Crawl index.

This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.

If you are only interested in a certain set of TLDs or PLDs, the program
could be enhanced to use cluster.idx to figure out the correct subset of
the 300 shards to process. e.g. the .us TLD is entirely contained in 
cdx-00298 & cdx-00299

Created on Wed Apr 22 15:05:54 2015

@author: Tom Morris <tfmorris@gmail.com>
"""

from collections import Counter
import json
import requests
import zlib


BASEURL = 'https://aws-publicdatasets.s3.amazonaws.com/'
INDEX1 = 'common-crawl/cc-index/collections/CC-MAIN-2015-11/indexes/'
INDEX2 = 'common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/'
SPLITS = 300


def process_index(index):
    total_length = 0
    total_processed = 0
    total_urls = 0
    mime_types = Counter()
    for i in range(SPLITS):
        unconsumed_text = ''
        filename = 'cdx-%05d.gz' % i
        url = BASEURL + index + filename
        response = requests.get(url, stream=True)
        length = int(response.headers['content-length'].strip())
        decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
        total = 0

        for chunk in response.iter_content(chunk_size=2048):
            total += len(chunk)
            if len(decompressor.unused_data) > 0:
                # restart decompressor if end of a chunk
                to_decompress = decompressor.unused_data + chunk
                decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
            else:
                to_decompress = decompressor.unconsumed_tail + chunk
            s = unconsumed_text + decompressor.decompress(to_decompress)
            unconsumed_text = ''

            if len(s) == 0:
                # Not sure why this happens, but doesn't seem to affect things
                print 'Decompressed nothing %2.2f%%' % (total*100.0/length),\
                    length, total, len(chunk), filename
            for l in s.split('\n'):
                pieces = l.split(' ')
                if len(pieces) < 3 or l[-1] != '}':
                    unconsumed_text = l
                else:
                    json_string = ' '.join(pieces[2:])
                    try:
                        metadata = json.loads(json_string)
                    except:
                        print 'JSON load failed: ', total, l
                        assert False
                    url = metadata['url']
                    if 'mime' in metadata:
                        mime_types[metadata['mime']] += 1
                    else:
                        mime_types['<none>'] += 1
                        # print 'No mime type for ', url
                    total_urls += 1
                    # print url
        print 'Done with ', filename
        total_length += length
        total_processed += total
    print 'Processed %2.2f %% of %d bytes (compressed).  Found %d URLs' %\
        ((total_processed * 100.0 / total_length), total_length, total_urls)
    print "Mime types:"
    for k, v in mime_types.most_common():
        print '%5d %s' % (v, k)

for index in [INDEX2]:
    print 'Processing index: ', index
    process_index(index)
    print 'Done processing index: ', index

以上是关于python 分析常见爬网索引 - http://index.commoncrawl.org/的主要内容，如果未能解决你的问题，请参考以下文章