无法从 beautifulsoup 中正确打印出组合表

Posted

技术标签:

【中文标题】无法从 beautifulsoup 中正确打印出组合表【英文标题】:cannot print out the combined table properly from beautifulsoup 【发布时间】:2018-11-23 17:36:39 【问题描述】:

由于这个URL表是合并的,所以不能按预期打印出来,输出格式很奇怪,谢谢!

# -*- coding:UTF-8 -*-
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1024, 768))
display.start()
from bs4 import BeautifulSoup
from selenium import webdriver
import re

driver = webdriver.Firefox()
driver.get("url")

soup = BeautifulSoup(driver.page_source.encode('utf-8'),'html.parser')
rows = soup.findAll("td", "class" : re.compile('table_eng_small_text_.\d'))

result = ','.join(r.text for r in rows)
print(result)

driver.close()
display.stop()

预期输出:

1   10  SEASONS KING(T032)  12  7-1/4   7   4-3/4   1   1-1/4   1.09.35 25.06   22.31   21.98 
2   2   HAPPY SOUND(V107)   1   1/2 1   3/4 2   1-1/4   1.09.56 23.90   22.71   22.95 
3   14  NATURAL FRIENDSHIP(S359)    4   2-1/2   4   2-3/4   3   1-1/2   1.09.59 24.30   22.75   22.54 
4   13  LUCKY PLACE(T004)   14  10  13  6-1/2   4   3   1.09.84 25.50   22.15   22.19 
5   9   NO LAUGHING MATTER(V032)    9   5-3/4   9   5   5   3-1/2   1.09.89 24.82   22.59   22.48 
6   1   FREE NOVEMBER(T123) 5   4   5   3-3/4   6   4-1/2   1.10.07 24.54   22.67   22.86 
7   7   FRIENDS FOREVER(T079)   2   1/2 2   3/4 7   4-1/2   1.10.08 23.98   22.75   23.35 
8   5   REAL SUPREME(L247)  3   1-1/2   3   2-1/4   8   5-3/4   1.10.27 24.14   22.83   23.30 
9   6   BE THERE AHEAD(S193)    6   4-1/4   6   3-3/4   9   6-1/4   1.10.34 24.58   22.63   23.13 
10  8   GOLD PRECIOUS(P364) 13  8-1/4   11  6-1/2   10  6-3/4   1.10.41 25.22   22.43   22.76 
11  11  DUTCH WINDMILL(T288)    10  5-3/4   14  7   11  7   1.10.48 24.82   22.91   22.75 
12  3   HAPPY THREE(V162)   11  5-3/4   8   4-3/4   12  7   1.10.49 24.82   22.55   23.12 
13  4   SILVER GATSBY(T161) 8   5-1/2   12  6-1/2   13  7-1/2   1.10.56 24.78   22.87   22.91 
14  12  CHANS DELIGHT(P420) 7   4-1/2   10  5-3/4   14  9-3/4   1.10.92 24.62   22.91   23.39 

【问题讨论】:

【参考方案1】:

您也可以试试这个解决方案,它只使用BeautifulSouprequests

from bs4 import BeautifulSoup
from requests import get
from re import compile

URL = ("http://www.hkjc.com/english/racing/display_sectionaltime.asp?"
       "RaceDate=03/09/2016&Raceno=1&All=0#Race1")

# get html
html = get(URL).text
soup = BeautifulSoup(html, 'lxml')

# extract table rows
rows = soup.findAll("td", "class" : compile('table_eng_small_text_.\d'))

# get items without tabs, newlines etc.
items = [r.text.replace('\t', '').replace('\n', '').replace('\r', '').strip()
         for r in rows]

# remove empty items
items = [item for item in items if item]

# turn table rows into list of lists
table_rows = [items[i:i+16] for i in range(0, len(items), 16)]

# format and print table contents
print('\n'.join(','.join(row[:4] + row[6:7] + row[9:10] + row[12:])
                for row in table_rows))

哪些输出:

1,10,SEASONS KING(T032),12    7-1/4,7    4-3/4,1    1-1/4,1.09.35,25.06,22.31,21.98
2,2,HAPPY SOUND(V107),1    1/2,1    3/4,2    1-1/4,1.09.56,23.90,22.71,22.95
3,14,NATURAL FRIENDSHIP(S359),4    2-1/2,4    2-3/4,3    1-1/2,1.09.59,24.30,22.75,22.54
4,13,LUCKY PLACE(T004),14    10,13    6-1/2,4    3,1.09.84,25.50,22.15,22.19
...

【讨论】:

【参考方案2】:

你能试试这个吗..

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get("http://www.hkjc.com/english/racing/display_sectionaltime.asp?RaceDate=03/09/2016&Raceno=1&All=0#Race1")

soup = BeautifulSoup(driver.page_source.encode('utf-8'),'html.parser')
rows = soup.findAll("td", "class" : 'table_eng_small_text_t4')

# remove all \t \n use replace because some string has \t and \n in center
n_rows = []
for row in rows:
    row = row.text.strip().replace('\t', '').replace('\n', '')
    # some string has ascii
    row = row.encode('ascii', 'ignore').decode('utf-8')
    n_rows.append(row)

#make new list of lists because now we have only a list of strings
another_rows = []

# get only data that we need
while len(n_rows):
    row = n_rows[:16]
    # remove some data that we don't need
    another_rows.append(row[:4] + row[5:6] + row[7:8] + row[9:])
    n_rows = n_rows[16:]

for row in another_rows:
    # remove all empty data
    row = [x for x in row if x]
    print(', '.join(row))

输出

1, 10, SEASONS KING(T032), 12    7-1/4, 7    4-3/4, 1    1-1/4, 1.09.35, 25.06, 22.31, 21.98
...
...
14, 12, CHANS DELIGHT(P420), 7    4-1/2, 10    5-3/4, 14    9-3/4, 1.10.92, 24.62, 22.91, 23.39

【讨论】:

以上是关于无法从 beautifulsoup 中正确打印出组合表的主要内容,如果未能解决你的问题,请参考以下文章

如何从 JSPlumb 中的组中删除项目?

为啥 BeautifulSoup 无法正确读取/解析此 RSS (XML) 文档?

设置运行Mono的串口的权限

登录国家图书馆后,点击方正电子图书,一直在loading无法跳出,跳出后也不显示我已经登录(已经关闭弹出组

为什么BeautifulSoup无法解析页面的所有元素? (答案:BeautifulSoup中的CSS选择器)

无法正确打印 Vector 的内容