BeautifulSoup 和 XML 解析

Posted 2023-02-16

技术标签:

【中文标题】BeautifulSoup 和 XML 解析【英文标题】：BeautifulSoup and XML parsing 【发布时间】：2021-12-16 01:35:46 【问题描述】：

我正在努力使用 BS。我有一个 TEI-XML 文件，我想只捕获 <p> 和 <said> 标记的内容。

所以给定这个输入：

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
...

<body>
               ...
               <p n="10">**I think a quarter of an hour had elapsed** , when I rose to depart , and then , to my surprise , I noticed the half-franc still on the table , but the sous piece was gone .</p>
               <p n="11">
                  **I beckoned to a waiter , and said :
                  <said who="#the_English">“ One of you came to me a little while ago demanding payment . I think he was somewhat hasty in pressing for it ; however , I set the money down , and the fellow has taken the tip , and has neglected the charge for the coffee .** ”</said>
               </p>
...
</TEI>

这是我希望以 CONLL 格式捕获标签的输出：

I 0
think 0
a 0
quarter 0
of 0
...
...
...
...
and 0
said 0
: 0
“ B-said
One I-said
of I-said
you I-said
came I-said
to I-said
...
...

我已尝试使用此代码：

 Import BeautifulSoup
from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("speakers/ABookofGhostsbySSabineBaringGould36638.xml", "r") as file:
    # Read each line in the file, readlines() returns a list of lines
    content = file.readlines()
    # Combine the lines in the list into a string
    content = "".join(content)
    bs_content = bs(content, "lxml")

all_txt=[]
for result in bs_content.findAll("p"):
    said=result.find('said')
    if said ==None:
        conll=[f"token'\t'0" for token in result.get_text().split()]
        all_txt.append(conll)
    else:
        ...

我可以在没有 <said> 标记（if 块的第一部分）的情况下管理 <p>，但是当我得到这样的一行时：

<p n="11">
                  I beckoned to a waiter , and said :
                  <said who="#the_English">“ One of you came to me a little while ago demanding payment . I think he was somewhat hasty in pressing for it ; however , I set the money down , and the fellow has taken the tip , and has neglected the charge for the coffee . ”</said>
               </p>

我正在努力研究如何使用 BeautifulSoup 来获得所需的输出（if 块的第二部分，elsepart）。

你能帮我用 BeautifulSoup 创建 Python 代码吗？

非常感谢！

【问题讨论】：

能否提供一些您已经编写的代码 - 如何创建 minimal reproducible example 谢谢 【参考方案1】：

我就是这样解决问题的

     Import BeautifulSoup
from bs4 import BeautifulSoup as bs
import re
content = []
# Read the XML file
with open("speakers/ABookofGhostsbySSabineBaringGould36638.xml", "r") as file:
    # Read each line in the file, readlines() returns a list of lines
    content = file.readlines()
    # Combine the lines in the list into a string
    content = "".join(content)
    bs_content = bs(content, "lxml")

all_txt=[]
all_txt.append('sentenceID,word,tag')
counter=1
for result in bs_content.findAll("p"):
    said=result.find('said')
    if said ==None:
        for token in result.get_text().split():
            all_txt.append(f"counter,token.replace(',',''),0")
    else:
        said_list=list(said.strings)
        for each_result in result.strings:
                #print('THIS IS SENT '+each_result)
                if each_result not in said_list :
                    for token_result in each_result.split():
                        all_txt.append(f'counter,token_result.replace(",",""),0')
                else:
                    for each_said in said_list:
                        first=True
                        for token_said in each_said.split():
                            if first == True:
                                first=False
                                all_txt.append(f'counter,token_said.replace(",",""),B-said')
                            else:
                                all_txt.append(f'counter,token_said.replace(",",""),I-said')
    counter=counter+1
with open('your_file.csv', 'w') as f:
    for item in all_txt:
        f.write("%s\n" % item)

我不知道这是否是最好的解决方案，但它确实有效。

【讨论】：

以上是关于BeautifulSoup 和 XML 解析的主要内容，如果未能解决你的问题，请参考以下文章