无法从 beautifulsoup 中正确打印出组合表
Posted
技术标签:
【中文标题】无法从 beautifulsoup 中正确打印出组合表【英文标题】:cannot print out the combined table properly from beautifulsoup 【发布时间】:2018-11-23 17:36:39 【问题描述】:由于这个URL表是合并的,所以不能按预期打印出来,输出格式很奇怪,谢谢!
# -*- coding:UTF-8 -*-
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1024, 768))
display.start()
from bs4 import BeautifulSoup
from selenium import webdriver
import re
driver = webdriver.Firefox()
driver.get("url")
soup = BeautifulSoup(driver.page_source.encode('utf-8'),'html.parser')
rows = soup.findAll("td", "class" : re.compile('table_eng_small_text_.\d'))
result = ','.join(r.text for r in rows)
print(result)
driver.close()
display.stop()
预期输出:
1 10 SEASONS KING(T032) 12 7-1/4 7 4-3/4 1 1-1/4 1.09.35 25.06 22.31 21.98
2 2 HAPPY SOUND(V107) 1 1/2 1 3/4 2 1-1/4 1.09.56 23.90 22.71 22.95
3 14 NATURAL FRIENDSHIP(S359) 4 2-1/2 4 2-3/4 3 1-1/2 1.09.59 24.30 22.75 22.54
4 13 LUCKY PLACE(T004) 14 10 13 6-1/2 4 3 1.09.84 25.50 22.15 22.19
5 9 NO LAUGHING MATTER(V032) 9 5-3/4 9 5 5 3-1/2 1.09.89 24.82 22.59 22.48
6 1 FREE NOVEMBER(T123) 5 4 5 3-3/4 6 4-1/2 1.10.07 24.54 22.67 22.86
7 7 FRIENDS FOREVER(T079) 2 1/2 2 3/4 7 4-1/2 1.10.08 23.98 22.75 23.35
8 5 REAL SUPREME(L247) 3 1-1/2 3 2-1/4 8 5-3/4 1.10.27 24.14 22.83 23.30
9 6 BE THERE AHEAD(S193) 6 4-1/4 6 3-3/4 9 6-1/4 1.10.34 24.58 22.63 23.13
10 8 GOLD PRECIOUS(P364) 13 8-1/4 11 6-1/2 10 6-3/4 1.10.41 25.22 22.43 22.76
11 11 DUTCH WINDMILL(T288) 10 5-3/4 14 7 11 7 1.10.48 24.82 22.91 22.75
12 3 HAPPY THREE(V162) 11 5-3/4 8 4-3/4 12 7 1.10.49 24.82 22.55 23.12
13 4 SILVER GATSBY(T161) 8 5-1/2 12 6-1/2 13 7-1/2 1.10.56 24.78 22.87 22.91
14 12 CHANS DELIGHT(P420) 7 4-1/2 10 5-3/4 14 9-3/4 1.10.92 24.62 22.91 23.39
【问题讨论】:
【参考方案1】:您也可以试试这个解决方案,它只使用BeautifulSoup
和requests
:
from bs4 import BeautifulSoup
from requests import get
from re import compile
URL = ("http://www.hkjc.com/english/racing/display_sectionaltime.asp?"
"RaceDate=03/09/2016&Raceno=1&All=0#Race1")
# get html
html = get(URL).text
soup = BeautifulSoup(html, 'lxml')
# extract table rows
rows = soup.findAll("td", "class" : compile('table_eng_small_text_.\d'))
# get items without tabs, newlines etc.
items = [r.text.replace('\t', '').replace('\n', '').replace('\r', '').strip()
for r in rows]
# remove empty items
items = [item for item in items if item]
# turn table rows into list of lists
table_rows = [items[i:i+16] for i in range(0, len(items), 16)]
# format and print table contents
print('\n'.join(','.join(row[:4] + row[6:7] + row[9:10] + row[12:])
for row in table_rows))
哪些输出:
1,10,SEASONS KING(T032),12 7-1/4,7 4-3/4,1 1-1/4,1.09.35,25.06,22.31,21.98
2,2,HAPPY SOUND(V107),1 1/2,1 3/4,2 1-1/4,1.09.56,23.90,22.71,22.95
3,14,NATURAL FRIENDSHIP(S359),4 2-1/2,4 2-3/4,3 1-1/2,1.09.59,24.30,22.75,22.54
4,13,LUCKY PLACE(T004),14 10,13 6-1/2,4 3,1.09.84,25.50,22.15,22.19
...
【讨论】:
【参考方案2】:你能试试这个吗..
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get("http://www.hkjc.com/english/racing/display_sectionaltime.asp?RaceDate=03/09/2016&Raceno=1&All=0#Race1")
soup = BeautifulSoup(driver.page_source.encode('utf-8'),'html.parser')
rows = soup.findAll("td", "class" : 'table_eng_small_text_t4')
# remove all \t \n use replace because some string has \t and \n in center
n_rows = []
for row in rows:
row = row.text.strip().replace('\t', '').replace('\n', '')
# some string has ascii
row = row.encode('ascii', 'ignore').decode('utf-8')
n_rows.append(row)
#make new list of lists because now we have only a list of strings
another_rows = []
# get only data that we need
while len(n_rows):
row = n_rows[:16]
# remove some data that we don't need
another_rows.append(row[:4] + row[5:6] + row[7:8] + row[9:])
n_rows = n_rows[16:]
for row in another_rows:
# remove all empty data
row = [x for x in row if x]
print(', '.join(row))
输出
1, 10, SEASONS KING(T032), 12 7-1/4, 7 4-3/4, 1 1-1/4, 1.09.35, 25.06, 22.31, 21.98
...
...
14, 12, CHANS DELIGHT(P420), 7 4-1/2, 10 5-3/4, 14 9-3/4, 1.10.92, 24.62, 22.91, 23.39
【讨论】:
以上是关于无法从 beautifulsoup 中正确打印出组合表的主要内容,如果未能解决你的问题,请参考以下文章
为啥 BeautifulSoup 无法正确读取/解析此 RSS (XML) 文档?
登录国家图书馆后,点击方正电子图书,一直在loading无法跳出,跳出后也不显示我已经登录(已经关闭弹出组