如何在 Python 中解析带有行跨度的 HTML 表?

Posted

技术标签:

【中文标题】如何在 Python 中解析带有行跨度的 HTML 表?【英文标题】:How to parse an HTML table with rowspans in Python? 【发布时间】:2017-01-09 17:48:45 【问题描述】:

问题

我正在尝试解析一个包含行跨度的 html 表格,例如,我正在尝试解析我的大学日程。

我遇到的问题是,如果最后一行包含行跨度,则下一行缺少 TD,而行跨度现在是缺少的 TD。

我不知道如何解释这一点,我希望能够解析这个时间表。

我尝试了什么

几乎所有我能想到的。

我得到的结果

[
    
        'blok_eind': 4,
        'blok_start': 3,
        'dag': 4, # Should be 5
        'leraar': 'DOODF000',
        'lokaal': 'ALK C212',
        'vak': 'PROJ-T',
    ,
]

如您所见,在上面的输出 sn-p 中有一个值为 PROJ-Tvak 键,dag4 而它应该是 5(又名 Friday/Vrijdag),如此处所示:

我想要的结果

一个 Python dict(),看起来像上面发布的那个,但具有正确的值

地点:

day/dag 是一个 1~5 的整数,代表周一~周五 block_start/blok_start 是一个 int,表示课程开始的时间(时间块,表格左侧) block_end/blok_eind 是一个 int,表示课程在哪个区块结束 clas-s-room/lokaal 是课程所在的教室代码 teacher/leraar是老师的身份证 course/vak是课程ID

上述数据的基本 HTML 结构

<center>
    <table>
        <tr>
            <td>
                <table>
                    <tbody>
                        <tr>
                            <td>
                                <font>
                                    TEACHER-ID
                                </font>
                            </td>
                            <td>
                                <font>
                                    <b>
                                        CLAs-s-rOOM ID
                                    </b>
                                </font>
                            </td>
                        </tr>
                        <tr>
                            <td>
                                <font>
                                    COURSE ID
                                </font>
                            </td>
                        </tr>
                    </tbody>
                </table>
            </td>
        </tr>
    </table>
</center>

代码

HTML

<CENTER><font size="3" face="Arial" color="#000000">
<BR></font>
  <font size="6" face="Arial" color="#0000FF">
16AO4EIO1B
&nbsp;</font> <font size="4" face="Arial">
IO1B
</font>
  <BR>
  <TABLE border="3" rules="all" cellpadding="1" cellspacing="1">
    <TR>
      <TD align="center">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial" color="#000000">
Maandag 29-08
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
Dinsdag 30-08
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
Woensdag 31-08
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
Donderdag 01-09
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
Vrijdag 02-09
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>1</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
8:30
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD  nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
            <TD  nowrap=1><font size="2" face="Arial">
<B>ALK B021</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2"  nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>2</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
10:10
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD  nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
            <TD  nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2"  nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>3</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
10:25
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD  nowrap=1><font size="2" face="Arial">
DOODF000
</font> </TD>
            <TD  nowrap=1><font size="2" face="Arial">
<B>ALK C212</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2"  nowrap=1><font size="2" face="Arial">
PROJ-T
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>4</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD  nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
            <TD  nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2"  nowrap=1><font size="2" face="Arial">
MENT
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>5</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>6</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD  nowrap=1><font size="2" face="Arial">
JONGJ003
</font> </TD>
            <TD  nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2"  nowrap=1><font size="2" face="Arial">
BURG
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>7</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
14:35
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD  nowrap=1><font size="2" face="Arial">
FLUIP000
</font> </TD>
            <TD  nowrap=1><font size="2" face="Arial">
<B>ALK B004</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2"  nowrap=1><font size="2" face="Arial">
ICT algemeen  Prakti
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>8</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
14:50
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD  nowrap=1><font size="2" face="Arial">
KOOLE000
</font> </TD>
            <TD  nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2"  nowrap=1><font size="2" face="Arial">
NED
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>9</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>10</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
17:20
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
  </TABLE>
  <TABLE cellspacing="1" cellpadding="1">
    <TR>
      <TD valign=bottom> <font size="4" face="Arial" color="#0000FF"></TR></TABLE><font size="3" face="Arial">
Periode1   29-08-2016 (35) - 04-09-2016 (35)   G r u b e r  &amp;  P e t t e r s   S o f t w a r e
</font></CENTER>

Python

from pprint import pprint
from bs4 import BeautifulSoup
import requests

r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
                 "/c/c00025.htm")
daytable = 
    1: "Maandag",
    2: "Dinsdag",
    3: "Woensdag",
    4: "Donderdag",
    5: "Vrijdag"

timetable = 
    1: ("8:30", "9:20"),
    2: ("9:20", "10:10"),
    3: ("10:25", "11:15"),
    4: ("11:15", "12:05"),
    5: ("12:05", "12:55"),
    6: ("12:55", "13:45"),
    7: ("13:45", "14:35"),
    8: ("14:50", "15:40"),
    9: ("15:40", "16:30"),
    10: ("16:30", "17:20"),


page = BeautifulSoup(r.content, "lxml")

roster = []
big_rows = 2
last_row_big = False
# There are 10 blocks, each made up out of 2 TR's, run through them
for block_count in range(2, 22, 2):
    # There are 5 days, first column is not data we want
    for day in range(2, 7):
        dayroster = 
            "dag": 0,
            "blok_start": 0,
            "blok_eind": 0,
            "lokaal": "",
            "leraar": "",
            "vak": ""
        
        # This selector provides the clas-s-room
        table_bold = page.select(
            "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
                day) + ") > table > tr > td > font > b")

        # This selector provides the teacher's code and the course ID
        table = page.select(
            "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
                day) + ") > table > tr > td > font")

        # This gets the rowspan on the current row and column
        rowspan = page.select(
            "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
                day) + ")")

        try:
            if table or table_bold and rowspan[0].attrs.get("rowspan") == "4":
                last_row_big = True
                # Setting end of class
                dayroster["blok_eind"] = (block_count // 2) + 1
            else:
                last_row_big = False
                # Setting end of class
                dayroster["blok_eind"] = (block_count // 2)
        except IndexError:
            pass

        if table_bold:
            x = table_bold[0]
            # Clas-s-room ID
            dayroster["lokaal"] = x.contents[0]

        if table:
            iter = 0
            for x in table:
                content = x.contents[0].lstrip("\r\n").rstrip("\r\n")
                # Cell has data
                if content != "":
                    # Set start of class
                    dayroster["blok_start"] = block_count // 2
                    # Set day of class
                    dayroster["dag"] = day - 1
                    if iter == 0:
                        # Teacher ID
                        dayroster["leraar"] = content
                    elif iter == 1:
                        # Course ID
                        dayroster["vak"] = content
                    iter += 1

        if table or table_bold:
            # Store the data
            roster.append(dayroster)

# Remove duplicates
seen = set()
new_l = []
for d in roster:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)
pprint(new_l)

【问题讨论】:

请包括 1) 你的 Python 代码,2) 重现问题所需的最少 HTML,3) 你的预期输出,以及 4) 你在问题本身中得到的,而不是比在外部网站上。 @Ryan 完成,我希望这会更好。 我遇到的问题是,如果最后一行包含行跨度,则下一行缺少一个 TD,而行跨度现在是缺少的 TD。 是您说&lt;td&gt; 实际上从 html 中丢失,或者您是说您的代码 认为 它实际上没有丢失? @JohnGordon &lt;td&gt; 确实从 html 中丢失,因为前一行中的 rowspan 属性使其跨越多行,在 html 中导致需要少一个 @987654346 @ 在下一行,否则上一行有 5 列,下一行有 6 列(5x &lt;td&gt; 和 1x &lt;td&gt; 来自上一行,因为它具有 rowspan 属性) requests 返回 404 错误页面 【参考方案1】:

也许最好使用像“findAll”这样的bs4内置函数来解析你的表格。

您可以使用以下代码:

from pprint import pprint
from bs4 import BeautifulSoup
import requests

r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
                 "/c/c00025.htm")

content=r.content
page = BeautifulSoup(content, "html")
table=page.find('table')
trs=table.findAll("tr", ,recursive=False)
tr_count=0
trs.pop(0)
final_table=

for tr in trs:
    tds=tr.findAll("td", ,recursive=False)
    if tds:
        td_count=0
        tds.pop(0)
        for td in tds:
            if td.has_attr('rowspan'):                              
                final_table[str(tr_count)+"-"+str(td_count)]=td.text.strip()
                if int(td.attrs['rowspan'])==4:
                    final_table[str(tr_count+1)+"-"+str(td_count)]=td.text.strip()
                if final_table.has_key(str(tr_count)+"-"+str(td_count+1)):
                    td_count=td_count+1         
            td_count=td_count+1
        tr_count=tr_count+1

roster=[]
for i in range(0,10): #iterate over time
    for j in range(0,5): #iterate over day
        item=final_table[str(i)+"-"+str(j)]
        if len(item)!=0:    
            block_eind=i+1          

            try:
                if final_table[str(i+1)+"-"+str(j)]==final_table[str(i)+"-"+str(j)]:
                        block_eind=i+2
            except:
                pass

            try:
                lokaal=item.split('\r\n \n\n')[0]
                leraar=item.split('\r\n \n\n')[1].split('\n \n\r\n')[0]
                vak=item.split('\n \n\r\n')[1]
            except:
                lokaal=leraar=vak="---"

            dayroster = 
                "dag": j+1,
                "blok_start": i+1,
                "blok_eind": block_eind,
                "lokaal": lokaal,
                "leraar": leraar,
                "vak": vak
            


            dayroster_double = 
                "dag": j+1,
                "blok_start": i,
                "blok_eind": block_eind,
                "lokaal": lokaal,
                "leraar": leraar,
                "vak": vak
            

            #use to prevent double dict for same event
            if dayroster_double not in roster:
                roster.append(dayroster)

print (roster)

【讨论】:

你会使用find_all; findAll 仅用于支持 BeautifulSoup 3 代码,已弃用,取而代之的是符合 PEP8 的名称。 你说得对,谢谢。更改是在我的代码中进行的。很明显,你的版本绝对比我的好。问候 Next nit: ==Trueif 语句中永远不需要;如果表达式对if 本身产生了真实的结果,则离开测试。 if td.has_attr('rowspan'):if td.has_attr('rowspan')==True: 一样好用,但更简洁、更易读。 Martijn,我知道,我同意你的看法。我根据this更改代码。【参考方案2】:

您必须跟踪前几行的行跨度,每列一个。

您可以简单地通过将行跨度的整数值复制到字典中来做到这一点,随后的行递减行跨度值直到它下降到1(或者我们可以将整数值存储负1并下降到0为了便于编码)。然后,您可以根据之前的行跨度调整后续表计数。

您的表格通过使用大小为 2 的默认跨度(以 2 为步长递增)使这有点复杂,但可以通过除以 2 轻松恢复为可管理的数字。

与其使用大量的 CSS 选择器,不如只选择表格行,然后我们将对其进行迭代:

roster = []
rowspans =   # track rowspanning cells
# every second row in the table
rows = page.select('html > body > center > table > tr')[1:21:2]
for block, row in enumerate(rows, 1):
    # take direct child td cells, but skip the first cell:
    daycells = row.select('> td')[1:]
    rowspan_offset = 0
    for daynum, daycell in enumerate(daycells, 1):
        # rowspan handling; if there is a rowspan here, adjust to find correct position
        daynum += rowspan_offset
        while rowspans.get(daynum, 0):
            rowspan_offset += 1
            rowspans[daynum] -= 1
            daynum += 1
        # now we have a correct day number for this cell, adjusted for
        # rowspanning cells.
        # update the rowspan accounting for this cell
        rowspan = (int(daycell.get('rowspan', 2)) // 2) - 1
        if rowspan:
            rowspans[daynum] = rowspan

        texts = daycell.select("table > tr > td > font")
        if texts:
            # class info found
            teacher, clas-s-room, course = (c.get_text(strip=True) for c in texts)
            roster.append(
                'blok_start': block,
                'blok_eind': block + rowspan,
                'dag': daynum,
                'leraar': teacher,
                'lokaal': clas-s-room,
                'vak': course
            )

    # days that were skipped at the end due to a rowspan
    while daynum < 5:
        daynum += 1
        if rowspans.get(daynum, 0):
            rowspans[daynum] -= 1

这会产生正确的输出:

['blok_eind': 2,
  'blok_start': 1,
  'dag': 5,
  'leraar': u'BLEEJ002',
  'lokaal': u'ALK B021',
  'vak': u'WEBD',
 'blok_eind': 3,
  'blok_start': 2,
  'dag': 3,
  'leraar': u'BLEEJ002',
  'lokaal': u'ALK B021B',
  'vak': u'WEBD',
 'blok_eind': 4,
  'blok_start': 3,
  'dag': 5,
  'leraar': u'DOODF000',
  'lokaal': u'ALK C212',
  'vak': u'PROJ-T',
 'blok_eind': 5,
  'blok_start': 4,
  'dag': 3,
  'leraar': u'BLEEJ002',
  'lokaal': u'ALK B021B',
  'vak': u'MENT',
 'blok_eind': 7,
  'blok_start': 6,
  'dag': 5,
  'leraar': u'JONGJ003',
  'lokaal': u'ALK B008',
  'vak': u'BURG',
 'blok_eind': 8,
  'blok_start': 7,
  'dag': 3,
  'leraar': u'FLUIP000',
  'lokaal': u'ALK B004',
  'vak': u'ICT algemeen  Prakti',
 'blok_eind': 9,
  'blok_start': 8,
  'dag': 5,
  'leraar': u'KOOLE000',
  'lokaal': u'ALK B008',
  'vak': u'NED']

此外,即使课程跨越超过 2 个区块,或者只有一个区块,此代码也将继续工作;支持任何行跨大小。

【讨论】:

以上是关于如何在 Python 中解析带有行跨度的 HTML 表?的主要内容,如果未能解决你的问题,请参考以下文章

考虑到行跨度和列跨度,如何从一维数组创建动态 html 表?

选择具有行跨度的表的第一列

如何提取和忽略标记中的跨度? - Python

垂直对齐引导行中的跨度

如何通过 Selenium 和 Python 从 html 标签跨度获取文本

突出显示 HTML 表格中悬停行中没有行跨度的单元格