Beautifulsoup4 - 通过强标记值识别信息仅适用于标记的某些值

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Beautifulsoup4 - 通过强标记值识别信息仅适用于标记的某些值相关的知识,希望对你有一定的参考价值。

我正在处理html的以下“块”:

<div class="marketing-directories-results">
    <ul>
        <li>
            <div class="contact-details">
                <h2>
                    A I I Insurance Brokerage of Massachusetts Inc
                </h2>
                <br/>
                <address>
                    183 Davis St
                    <br/>
                    East Douglas
                    <br/>
                    Massachusetts
                    <br/>
                    U S A
                    <br/>
                    MA 01516-113
                </address>
                <p>
                    <a href="http://www.agencyint.com">
                        www.agencyint.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-0">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-0 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Engineering
                        </li>
                        <li>
                            NM General Liability (US direct)
                        </li>
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
        <li>
            <div class="contact-details">
                <h2>
                    ABCO Insurance Underwriters Inc
                </h2>
                <br/>
                <address>
                    ABCO Building, 350 Sevilla Avenue, Suite 201
                    <br/>
                    Coral Gables
                    <br/>
                    Florida
                    <br/>
                    U S A
                    <br/>
                    33134
                </address>
                <p>
                    <a href="http://www.abcoins.com">
                        www.abcoins.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-1">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-1 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
    </ul>
</div>

我从这个HTML中抓取了多个数据点。给我带来麻烦的是“接受业务来自:”和“业务类”的价值观。我可以获得“接受业务来自:”的价值,无论它出现在哪个订单中:

try:
    li_area = company.find('ul', class_='result-info info-cov-' + 
                                  str(company_counter) + ' cc')
    li_stuff = li_area.find_all('li')
    for li in li_stuff:
        if li.strong.text.strip() == 'Accepts Business From:':
            business_final = li.find('li').text.strip()
except AttributeError:
    pass

注意:“company”变量是包含我上面粘贴的html的beautifulsoup对象。

注意:页面上每个记录的类名都会更改 - 我只在HTML示例中包含了一条记录,以保持一些简洁的外观。

当我尝试相同的代码块时,这次用'Accepts Business From:'替换li.strong.text.strip()== 'Classes of business',但代码似乎没有检测到那个强标记,只是'接受Business From:'。我的for循环是不正确的,而不是实际迭代每个包含这些不同强标签的<li>标签?难道这个强大标签的真正价值与“业务类别”不同吗? (我确实直接从网站的html中复制了这个值)。

您可以提供的任何帮助非常感谢

答案

你为'Accepts Business From:'而不是'Classes of business'获取文本的原因是你在错误的地方使用try-except

for li in li_stuff:循环的第二次迭代中,li变成<li>U.S.A</li>,因为没有AttributeError标记,所以它会抛出li.strong来调用<strong>。并且,根据您当前的try-except,错误是在for循环外部捕获并且是passed。因此,循环不会达到第三次迭代,它应该获取“业务类”的文本。

要在捕获错误后继续循环,请使用:

for li in li_stuff:
    try:
        if li.strong.text.strip() == 'Accepts Business From:':
            business_final = li.find('li').text.strip()
            print('Accepts Business From:', business_final)
        if li.strong.text.strip() == 'Classes of business':
            business_final = li.find('li').text.strip()
            print('Classes of business:', business_final)
    except AttributeError:
        pass  # or you can use 'continue' too.

输出:

Accepts Business From: U.S.A
Classes of business: Engineering

但是,由于“业务类”存在许多值,您可以将代码更改为此以获取所有值:

if li.strong.text.strip() == 'Classes of business':
    business_final = ', '.join([x.text.strip() for x in li.find_all('li')])
    print('Classes of business:', business_final)

输出:

Accepts Business From: U.S.A
Classes of business: Engineering, NM General Liability (US direct), Property D&F (US binder), Terrorism

以上是关于Beautifulsoup4 - 通过强标记值识别信息仅适用于标记的某些值的主要内容,如果未能解决你的问题,请参考以下文章

C# - 使用标记来识别值的子字符串总和

转-二值图像连通域标记

LabVIEW仪表盘识别

Python 3.8 - BeautifulSoup 4 - unwrap() 不会删除所有标签

MLKit 是一个强大易用的工具包。通过 ML Kit 您可以很轻松的实现文字识别条码识别图像标记人脸检测对象检测等功能

MLKit 是一个强大易用的工具包。通过 ML Kit 您可以很轻松的实现文字识别条码识别图像标记人脸检测对象检测等功能