BeautifulSoup 在迭代器上执行 find()

Posted

技术标签:

【中文标题】BeautifulSoup 在迭代器上执行 find()【英文标题】:BeautifulSoup perform find() on an iterator 【发布时间】:2019-07-20 12:56:43 【问题描述】:

我有一个页面需要用beautifulsoup 解析,代码如下:

from bs4 import BeautifulSoup

source_html = 'WFM1.html'
data = []

with open(source_html) as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

for tr in soup.find('tbody'):
    name_div = tr.find('div', class_= 'person-name-text inline-block ng-binding ng-hide')
    name = name_div.text.strip()
    shift_span = tr.find('span', class_= 'inline-block ng-binding ng-scope')
    shift = shift_span.text.strip()
    data.append((name, shift))

当我运行它时,它返回“TypeError:find() 不接受关键字参数”。是否可以在迭代器上执行 find() ?如何从迭代器中仅提取特定内容?

为了更清楚,下面是迭代器的样子:

<tr class="ng-scope" role="button" style="" tabindex="0">
    <td class="person-name-column">
        <div aria-hidden="false" class="wfm-checkbox">
            <label><input aria-invalid="false" class="ng-pristine ng-untouched ng-valid ng-empty" type="checkbox"> <span class="wfm-checkbox-toggle"></span> <span class="wfm-checkbox-label person-name-text inline-block ng-binding">FirstName LastName <!-- ngIf: vm.toggles.ViewScheduleOnTimezoneEnabled && vm.selectedTimezone && personSchedule.Timezone.IanaId !== vm.selectedTimezone --></span></label> <!-- ngIf: vm.toggles.ViewScheduleOnTimezoneEnabled && vm.selectedTimezone && personSchedule.Timezone.IanaId !== vm.selectedTimezone -->
        </div>
        <div aria-hidden="true" class="person-name-text inline-block ng-binding ng-hide">
            FirstName LastName
        </div><!-- ngIf: vm.showWarnings -->
    </td><!-- ngIf: ::vm.toggles.ViewShiftCategoryEnabled -->
    <td class="shift-category-cell ng-scope" role="button" style="cursor: pointer;" tabindex="0"><!-- ngIf: ::personSchedule.ShiftCategory.Name --> <span class="inline-block ng-binding ng-scope" id="name" style="background: rgb(255, 99, 71); color: black;">EX</span> <!-- end ngIf: ::personSchedule.ShiftCategory.Name -->
    <!-- ngIf: ::personSchedule.ShiftCategory.Name --><!-- end ngIf: ::personSchedule.ShiftCategory.Name --></td><!-- end ngIf: ::vm.toggles.ViewShiftCategoryEnabled -->
    <td class="schedule schedule-column">
        <div class="relative time-line-for">
            <!-- ngRepeat: dayOff in ::personSchedule.DayOffs -->
            <!-- ngRepeat: shift in ::personSchedule.Shifts -->
            <div class="shift ng-scope">
                <!-- ngRepeat: projection in ::shift.Projections -->
                <div aria-label="Phone 04:00 - 08:00" class="layer absolute floatleft selectable projection-layer ng-scope noneSelected" role="button" style="left: 3.7037%; width: 14.8148%; background-color: rgb(255, 255, 0);" tabindex="0"></div><!-- end ngRepeat: projection in ::shift.Projections -->
                <div aria-label="Lunch 08:00 - 08:30" class="layer absolute floatleft selectable projection-layer ng-scope noneSelected" role="button" style="left: 18.5185%; width: 1.85185%; background-color: rgb(0, 255, 0);" tabindex="0"></div><!-- end ngRepeat: projection in ::shift.Projections -->
                <div aria-label="Coffee 08:30 - 08:45" class="layer absolute floatleft selectable projection-layer ng-scope noneSelected" role="button" style="left: 20.3704%; width: 0.925926%; background-color: rgb(224, 224, 224);" tabindex="0"></div><!-- end ngRepeat: projection in ::shift.Projections -->
                <div aria-label="Phone 08:45 - 10:30" class="layer absolute floatleft selectable projection-layer ng-scope noneSelected" role="button" style="left: 21.2963%; width: 6.48148%; background-color: rgb(255, 255, 0);" tabindex="0"></div><!-- end ngRepeat: projection in ::shift.Projections -->
                <div aria-label="FL 10:30 - 12:30" class="layer absolute floatleft selectable projection-layer ng-scope noneSelected" role="button" style="left: 27.7778%; width: 7.40741%; background-color: rgb(255, 140, 0);" tabindex="0"></div><!-- end ngRepeat: projection in ::shift.Projections -->
            </div><!-- end ngRepeat: shift in ::personSchedule.Shifts -->
            <!-- ngIf: vm.hasHiddenScheduleAtStart(personSchedule) -->
            <!-- ngIf: vm.hasHiddenScheduleAtEnd(personSchedule) -->
        </div>
    </td><!-- ngIf: ::!vm.toggles.EditAndViewInternalNoteEnabled -->
    <!-- ngIf: ::vm.toggles.EditAndViewInternalNoteEnabled -->
    <td class="schedule-note-column ng-scope" role="button" tabindex="0"><span class="noComment"><i class="mdi mdi-comment"></i></span> <!-- ngIf: vm.getScheduleNoteForPerson(personSchedule.PersonId) && vm.getScheduleNoteForPerson(personSchedule.PersonId).length > 0 --></td><!-- end ngIf: ::vm.toggles.EditAndViewInternalNoteEnabled -->
    <!-- ngIf: ::vm.toggles.ShowContractTimeEnabled -->
    <td class="contract-time contract-time-column ng-binding ng-scope">8:00</td><!-- end ngIf: ::vm.toggles.ShowContractTimeEnabled -->
</tr>

【问题讨论】:

【参考方案1】:

您的汤类型是&lt;class 'bs4.BeautifulSoup'&gt;,因此您不需要使用 for 进行迭代。

name_div = soup.find('div', class_= 'person-name-text inline-block ng-binding ng-hide')
name = name_div.text.strip()
shift_span = soup.find('span', class_= 'inline-block ng-binding ng-scope')
shift = shift_span.text.strip()
data.append((name, shift))
print(data)

输出:

[('FirstName LastName', 'EX')]

更新:

如果你有不止一个类person-name-text inline-block ng-binding ng-hide。假设这是你在 hmtl 文件中的 html:

<div aria-hidden="true" class="person-name-text inline-block ng-binding ng-hide">
        FirstName LastName
    </div>
    <div aria-hidden="true" class="person-name-text inline-block ng-binding ng-hide">
        FirstName LastName
    </div>
    <div aria-hidden="true" class="person-name-text inline-block ng-binding ng-hide">
        FirstName LastName
    </div>
    <div aria-hidden="true" class="person-name-text inline-block ng-binding ng-hide">
        FirstName LastName
    </div>

您可以使用 find_all() 来获取所有信息:

name_div = soup.find_all('div', class_= 'person-name-text inline-block ng-binding ng-hide')
for all in name_div:
    print(all.text.strip())

输出:

FirstName LastName
FirstName LastName
FirstName LastName
FirstName LastName

【讨论】:

Find 只会返回一个 div 和一个 span,我需要它来返回所有这些(每次大约 80 个)。 Find_all 不会削减它,因为我需要对每个 div 执行 div.text.strip() 等等。基本上我需要从 tbody 中的每个 tr 中提取 Name 和 Shift 并将其作为元组附加到列表中。

以上是关于BeautifulSoup 在迭代器上执行 find()的主要内容,如果未能解决你的问题,请参考以下文章

如何在迭代器上循环?

在 Dart 中的迭代器上创建 sum、min、max 属性

redux-saga 的单元测试抛出错误:必须在迭代器上调用 runSaga

使用 BeautifulSoup 迭代 XML 以提取特定标签并存储在变量中

Spark:如何在每个执行程序中创建本地数据帧

HTML 和 BeautifulSoup:当结构并不总是事先知道时如何迭代解析?