从 MBOX 文件中提取纯文本/文本和 html 正文到列表
Posted
技术标签:
【中文标题】从 MBOX 文件中提取纯文本/文本和 html 正文到列表【英文标题】:Extracting plain/text and html body from MBOX file to a list 【发布时间】:2019-10-21 21:45:54 【问题描述】:我正在尝试从 mbox 文件(之前从 PST 格式转换)中提取电子邮件正文。
我从另一个 [松弛问题] (Extracting the body of an email from mbox file, decoding it to plain text regardless of Charset and Content Transfer Encoding) 中获取了基本功能。它适用于提取“纯/文本”正文内容,但我也想提取“html”内容。
从代码的最后一部分调用函数来提取正文,我尝试修改它以将文本和 html 字符串存储在单独的列表中。
import mailbox
def getcharsets(msg):
charsets = set()
for c in msg.get_charsets():
if c is not None:
charsets.update([c])
return charsets
def handleerror(errmsg, emailmsg, cs):
print()
print(errmsg)
print("This error occurred while decoding with ",cs," charset.")
print("These charsets were found in the one email.",getcharsets(emailmsg))
print("This is the subject:",emailmsg['subject'])
print("This is the sender:",emailmsg['From'])
def getbodyfromemail(msg):
body = 'no_text'
body_html = 'no_html'
#Walk through the parts of the email to find the text body.
if msg.is_multipart():
for part in msg.walk():
# If part is multipart, walk through the subparts.
if part.is_multipart():
for subpart in part.walk():
if subpart.get_content_type() == 'text/plain':
# Get the subpart payload (i.e the message body)
body = subpart.get_payload(decode=True)
#charset = subpart.get_charset()
elif subpart.get_content_type() == 'html':
body_html = subpart.get_payload(decode=True)
#body_html = subpart.get_payload(decode=True)
# Part isn't multipart so get the email body
elif part.get_content_type() == 'text/plain':
body = part.get_payload(decode=True)
#charset = part.get_charset()
# If this isn't a multi-part message then get the payload (i.e the message body)
elif msg.get_content_type() == 'text/plain':
body = msg.get_payload(decode=True)
# No checking done to match the charset with the correct part.
for charset in getcharsets(msg):
try:
body = body.decode(charset)
except UnicodeDecodeError:
handleerror("UnicodeDecodeError: encountered.",msg,charset)
except AttributeError:
handleerror("AttributeError: encountered" ,msg,charset)
return body, body_html
mboxfile = 'Bandeja de entrada'
body = []
body_html = []
for thisemail in mailbox.mbox(mboxfile):
body = body.append(getbodyfromemail(thisemail)[0])
body_html = body_html.append(getbodyfromemail(thisemail)[1])
print(body_html)
但是现在,给我一个错误: AttributeError:“NoneType”对象没有“附加”属性 我期望的输出:
body = [string, string, string]
body_html = [html, html, html]
【问题讨论】:
【参考方案1】:你的代码对我有用,除了你应该用以下内容替换列表附加:
for thisemail in mailbox.mbox(mboxfile):
body.append(getbodyfromemail(thisemail)[0])
body_html.append(getbodyfromemail(thisemail)[1])
print(body_html)
Python list append 可以正常工作,因此它返回None
。您还可以将列表附加替换为例如:
body = body + [getbodyfromemail(thisemail)[0]]
【讨论】:
以上是关于从 MBOX 文件中提取纯文本/文本和 html 正文到列表的主要内容,如果未能解决你的问题,请参考以下文章