python处理搜狗新闻数据_140万条
Posted esc_ai
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python处理搜狗新闻数据_140万条相关的知识,希望对你有一定的参考价值。
一、文件处理
gzip -d SogouCA.tar.gz
tar -xvf SogouCA.tar
cat *.txt > SogouCA.txt
cat SogouCA.txt | iconv -f gbk -t utf-8 -c > SougouCA_UTF8.txt
二、数据清理与入库
建表:
CREATE TABLE `news` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`docno` varchar(100) NOT NULL,
`url` varchar(255) DEFAULT NULL,
`contenttitle` varchar(255) DEFAULT NULL,
`content` text,
PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1498017 DEFAULT CHARSET=utf8;
入库:
#!/usr/bin/python
# -*- coding: utf8 -*-
import re
import mysqldb
if __name__ == '__main__':
count = 0
p1 = re.compile(r'(?<=<url>)(.*?)(?=</url>)')
p2 = re.compile(r'(?<=<docno>)(.*?)(?=</docno>)')
p3 = re.compile(r'(?<=<contenttitle>)(.*?)(?=</contenttitle>)')
p4 = re.compile(r'(?<=<content>)(.*?)(?=</content>)')
parr = [p1, p2, p3, p4]
# connect mysql
db = MySQLdb.connect("127.0.0.1", "root", "Node2019!", "sg_news",
charset='utf8')
# get cutsor
cursor = db.cursor()
# SQL 插入语句
sql = """
INSERT INTO news(url,docno, contenttitle, content)
VALUES (%s, %s, %s, %s)
"""
news = []
with open('SougouCA_UTF8.txt', 'r') as f:
for line in f.readlines():
if '<doc>' in line.strip():
continue
if count < 4:
#print 'count:', count, parr[count].findall(line.strip())[0]
pres = parr[count].findall(line.strip())[0]
if pres:
news.append(pres)
else:
news.append(' ')
if '</doc>' in line.strip():
count = 0
sql = sql % ('\\''+str(news[0])+'\\'', '\\''+str(news[1])+'\\'', '\\''+str(news[2])+'\\'',
'\\'' +str(news[3])+'\\'')
try:
cursor.execute(sql)
# 提交到数据库执行
db.commit()
except:
# Rollback in case there is any error
db.rollback()
news = []
sql = """
INSERT INTO news(url, docno, contenttitle, content)
VALUES (%s, %s, %s, %s)
"""
continue
count += 1
以上是关于python处理搜狗新闻数据_140万条的主要内容,如果未能解决你的问题,请参考以下文章
吐血整理!140种Python标准库第三方库和外部工具都有了