python 仅提取PUBMED原始数据的摘要

Posted 2021-05-10

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python 仅提取PUBMED原始数据的摘要相关的知识，希望对你有一定的参考价值。

__author__ = 'sean'

from bs4 import BeautifulSoup
import os
import cPickle as pickle

path = '/Users/sean/ml/dataset/pubmed-bioinfo-abstracts/paperAbstracts/'
filenames = os.listdir(path)

txt_corpus = list()
for thefile in filenames:
    print thefile
    # deal with the damn .DS_Store file in MAC
    if thefile == ".DS_Store":
        continue
    with open(path + thefile, "rb") as f:
        strings = f.read()
        soup = BeautifulSoup(strings)
        for hit in soup.findAll(attrs={'class' : 'abstract_text'}):
            abstract = hit.contents[1].text
        txt_corpus.append(abstract)
print 'done'
with open('pubmed_abstract.pkl', 'wb') as dicpkl:
    pickle.dump(txt_corpus, dicpkl)
print 'pickle saved'

以上是关于python 仅提取PUBMED原始数据的摘要的主要内容，如果未能解决你的问题，请参考以下文章

PubMed

如何从字典中提取仅打印某些变量python

如何仅提取 ELF 部分的原始内容？

密码学相关概念

如何运行 VBA 循环来格式化每个工作表并创建摘要选项卡

使用 Python 从 PowerPivot 模型中提取原始数据