CSS 选择器在 Python Web Scrape 中没有按下任何东西
Posted
技术标签:
【中文标题】CSS 选择器在 Python Web Scrape 中没有按下任何东西【英文标题】:CSS Selector not parsing anything in Python Webscrape 【发布时间】:2016-12-30 19:32:47 【问题描述】:我正在尝试使用 Python 和 CSS 选择器对这个网站进行网页抓取:http://canoeracing.org.uk/marathon/results/burton2016.htm,但是我正在使用的 CSS 选择器在 DOM 树中找不到任何要解析的内容。我已经设法使用网络抓取工具 Kimono 来抓取它,它也使用 CSS 选择器,所以我知道它们是正确的。代码如下,我使用的 CSS 选择器用于网站中每个表格的第二列 - body > table > tbody > tr > td:nth-child(2)
。我从http://www.ilab.rutgers.edu/~vverna/scrape-the-web-using-css-selectors-in-python.html 获取了 CSS 抓取代码。
import lxml.html
from lxml.cssselect import CSSSelector
# get some html
import requests
r = requests.get('http://canoeracing.org.uk/marathon/results/burton2016.htm')
# build the DOM Tree
tree = lxml.html.fromstring(r.text)
# construct a CSS Selector
sel = CSSSelector('body > table > tbody > tr > td:nth-child(2)')
# Apply the selector to the DOM tree.
results = sel(tree)
print results
# print the HTML for the first result.
match = results[0]
print lxml.html.tostring(match)
# get the href attribute of the first result
print match.get('href')
# print the text of the first result.
print match.text
# get the text out of all the results
data = [result.text for result in results]
【问题讨论】:
【参考方案1】:没有tbody,就是浏览器添加的,你要body > table > tr > td:nth-child(2)
:
随着那个变化:
In [1]: import lxml.html
In [2]: import requests
In [3]: r = requests.get('http://canoeracing.org.uk/marathon/results/burton2016.htm')
In [4]: tree = lxml.html.fromstring(r.text)
In [5]: results = tree.cssselect('body > table > tr > td:nth-child(2)')
In [6]: print results
[<Element td at 0x7f1cb1334100>, <Element td at 0x7f1cb1334260>, <Element td at 0x7f1cb13342b8>, <Element td at 0x7f1cb1334470>, <Element td at 0x7f1cb1334368>, <Element td at 0x7f1cb13344c8>, <Element td at 0x7f1cb1334578>, <Element td at 0x7f1cb1334628>, <Element td at 0x7f1cb1334aa0>, <Element td at 0x7f1cb1334788>, <Element td at 0x7f1cb13347e0>, <Element td at 0x7f1cb1334940>, <Element td at 0x7f1cb1334a48>, <Element td at 0x7f1cb1334af8>, <Element td at 0x7f1cb1328310>, <Element td at 0x7f1cb1328788>, <Element td at 0x7f1cb1328158>, <Element td at 0x7f1cb1328260>, <Element td at 0x7f1cb1328470>, <Element td at 0x7f1cb1328578>, <Element td at 0x7f1cb1328628>, <Element td at 0x7f1cb1328aa0>, <Element td at 0x7f1cb13288e8>, <Element td at 0x7f1cb1328940>, <Element td at 0x7f1cb1328a48>, <Element td at 0x7f1cb1328e10>, <Element td at 0x7f1cb1328c58>, <Element td at 0x7f1cb1328c00>, <Element td at 0x7f1cb1328db8>, <Element td at 0x7f1cb1328ec0>, <Element td at 0x7f1cb1328f70>, <Element td at 0x7f1cb1328af8>, <Element td at 0x7f1cb13282b8>, <Element td at 0x7f1cb1328cb0>, <Element td at 0x7f1cb132e100>, <Element td at 0x7f1cb132e0a8>, <Element td at 0x7f1cb132e368>, <Element td at 0x7f1cb132e680>, <Element td at 0x7f1cb1343730>, <Element td at 0x7f1cb1343680>, <Element td at 0x7f1cb1343628>, <Element td at 0x7f1cb13435d0>, <Element td at 0x7f1cb1343578>, <Element td at 0x7f1cb13434c8>, <Element td at 0x7f1cb1343470>, <Element td at 0x7f1cb13436d8>, <Element td at 0x7f1cb1343368>, <Element td at 0x7f1cb13432b8>, <Element td at 0x7f1cb1343158>, <Element td at 0x7f1cb13430a8>, <Element td at 0x7f1cb13433c0>, <Element td at 0x7f1cb1343788>, <Element td at 0x7f1cb13437e0>, <Element td at 0x7f1cb1343838>, <Element td at 0x7f1cb1343890>, <Element td at 0x7f1cb13438e8>, <Element td at 0x7f1cb1343940>, <Element td at 0x7f1cb1343998>, <Element td at 0x7f1cb13439f0>, <Element td at 0x7f1cb1343a48>, <Element td at 0x7f1cb1343aa0>, <Element td at 0x7f1cb1343af8>, <Element td at 0x7f1cb1343b50>, <Element td at 0x7f1cb1343ba8>, <Element td at 0x7f1cb1343c00>, <Element td at 0x7f1cb1343c58>, <Element td at 0x7f1cb1343cb0>, <Element td at 0x7f1cb1343d08>, <Element td at 0x7f1cb1343d60>, <Element td at 0x7f1cb1343db8>, <Element td at 0x7f1cb1343e10>, <Element td at 0x7f1cb1343e68>, <Element td at 0x7f1cb1343ec0>, <Element td at 0x7f1cb1343f18>, <Element td at 0x7f1cb1343f70>, <Element td at 0x7f1cb1343fc8>, <Element td at 0x7f1cb134b050>, <Element td at 0x7f1cb134b0a8>, <Element td at 0x7f1cb134b100>, <Element td at 0x7f1cb134b158>, <Element td at 0x7f1cb134b1b0>, <Element td at 0x7f1cb134b208>, <Element td at 0x7f1cb134b260>, <Element td at 0x7f1cb134b2b8>, <Element td at 0x7f1cb134b310>, <Element td at 0x7f1cb134b368>, <Element td at 0x7f1cb134b3c0>, <Element td at 0x7f1cb134b418>, <Element td at 0x7f1cb134b470>, <Element td at 0x7f1cb134b4c8>, <Element td at 0x7f1cb134b520>, <Element td at 0x7f1cb134b578>, <Element td at 0x7f1cb134b5d0>, <Element td at 0x7f1cb134b628>, <Element td at 0x7f1cb134b680>, <Element td at 0x7f1cb134b6d8>, <Element td at 0x7f1cb134b730>, <Element td at 0x7f1cb134b788>, <Element td at 0x7f1cb134b7e0>, <Element td at 0x7f1cb134b838>, <Element td at 0x7f1cb134b890>, <Element td at 0x7f1cb134b8e8>, <Element td at 0x7f1cb134b940>, <Element td at 0x7f1cb134b998>, <Element td at 0x7f1cb134b9f0>, <Element td at 0x7f1cb134ba48>, <Element td at 0x7f1cb134baa0>, <Element td at 0x7f1cb134baf8>, <Element td at 0x7f1cb134bb50>, <Element td at 0x7f1cb134bba8>, <Element td at 0x7f1cb134bc00>, <Element td at 0x7f1cb134bc58>, <Element td at 0x7f1cb134bcb0>, <Element td at 0x7f1cb134bd08>, <Element td at 0x7f1cb134bd60>, <Element td at 0x7f1cb134bdb8>, <Element td at 0x7f1cb134be10>, <Element td at 0x7f1cb134be68>, <Element td at 0x7f1cb134bec0>, <Element td at 0x7f1cb134bf18>, <Element td at 0x7f1cb134bf70>, <Element td at 0x7f1cb134bfc8>, <Element td at 0x7f1cb134c050>, <Element td at 0x7f1cb134c0a8>, <Element td at 0x7f1cb134c100>, <Element td at 0x7f1cb134c158>, <Element td at 0x7f1cb134c1b0>, <Element td at 0x7f1cb134c208>, <Element td at 0x7f1cb134c260>, <Element td at 0x7f1cb134c2b8>, <Element td at 0x7f1cb134c310>, <Element td at 0x7f1cb134c368>, <Element td at 0x7f1cb134c3c0>, <Element td at 0x7f1cb134c418>, <Element td at 0x7f1cb134c470>, <Element td at 0x7f1cb134c4c8>, <Element td at 0x7f1cb134c520>, <Element td at 0x7f1cb134c578>, <Element td at 0x7f1cb134c5d0>, <Element td at 0x7f1cb134c628>, <Element td at 0x7f1cb134c680>, <Element td at 0x7f1cb134c6d8>, <Element td at 0x7f1cb134c730>, <Element td at 0x7f1cb134c788>, <Element td at 0x7f1cb134c7e0>, <Element td at 0x7f1cb134c838>, <Element td at 0x7f1cb134c890>, <Element td at 0x7f1cb134c8e8>, <Element td at 0x7f1cb134c940>, <Element td at 0x7f1cb134c998>, <Element td at 0x7f1cb134c9f0>, <Element td at 0x7f1cb134ca48>, <Element td at 0x7f1cb134caa0>, <Element td at 0x7f1cb134caf8>, <Element td at 0x7f1cb134cb50>, <Element td at 0x7f1cb134cba8>, <Element td at 0x7f1cb134cc00>, <Element td at 0x7f1cb134cc58>, <Element td at 0x7f1cb134ccb0>, <Element td at 0x7f1cb134cd08>, <Element td at 0x7f1cb134cd60>, <Element td at 0x7f1cb134cdb8>, <Element td at 0x7f1cb134ce10>, <Element td at 0x7f1cb134ce68>, <Element td at 0x7f1cb134cec0>, <Element td at 0x7f1cb134cf18>, <Element td at 0x7f1cb134cf70>, <Element td at 0x7f1cb134cfc8>, <Element td at 0x7f1cb134d050>, <Element td at 0x7f1cb134d0a8>, <Element td at 0x7f1cb134d100>]
In [7]: match = results[0]
In [8]: print lxml.html.tostring(match)
<td>CONNOR PETERS</td>
In [9]: print match.get('href')
None
In [10]: print match.text
CONNOR PETERS
In [11]: data = [result.text for result in results]
In [12]: print(data)
['CONNOR PETERS', 'NICKY CRESSER', 'MARK WILKES', 'MATT PARKES', 'ALEX ABRAHAM', 'JOE FITZPATRICK', 'RICHARD ROGERS', 'DANNY BEAZLEY', 'JAMES SMYTHE', 'JAMIE CHRISTIE', 'JAMES HINVES', 'DAVID BELBIN', 'TOM DIAPER', 'PETER DEBOER', 'MARTIN RINVOLUCRI', 'LEE HOWSON', 'DAMON GRIMSEY', 'MATTHEW OLIVER', 'JOSHUA BEST', 'CHRIS CARTER', 'DUNCAN OUGHTON', 'HOWARD BLACKMAN', 'PATRICK MONGAN', 'JAMES DORAN', 'MICHAEL FITZSIMONS', 'SHUNA NEAVE', 'GUY PETERS', 'WILLIAM DOUGHTY', 'MICK NADAL', 'BILL LAWRENSON', 'MARK WEVILL', 'JOHN ASTBURY', 'JACOB HUBNER', 'SEB SHAW', 'TONY BATES', 'PETER MIETUS', 'CHRISTOPHER SKELLERN', 'GEORGE RANDALL', 'NEVILLE COLLEY', 'COLIN CHUDLEY', 'DAVE RICKETTS', 'LEWIS SMITH', 'ALASKA SIMPSON', 'DAVID CUDDINGTON', 'BEN BEDDARD', 'DAVID GLOVER', 'DEBORAH QUITTENTON', 'NEIL ORME', 'KASIA CHMIEL', 'RICHARD HUMPHREYS', 'MARCIN KRUCZYNSKI', 'IMRE KUCSKA', 'JOSHUA SMITH', 'DAVE HADLEY', 'LAURENCE FOWKES', 'AMELIA DINGLEY', 'MICHELLE BUTLER', 'LYNDA OUGHTON', 'LUCY GUEST', 'GARETH FERGUSSON', 'TOMASZ CHLIPALA', 'TONY SPENCER', 'KATIE ***ES', 'HAYDYN COOKE-BAYLEY', 'DAVID WALTERS', 'STEPHEN KITSON', 'BEN ASTON', "ANGUS O'CONNOR", 'KEVIN LACK', 'MOLLY LEVER', 'MAX BEDDARD', 'CALLUM ADAIR', 'EMMA WILKINSON', 'DAVE CIANCHI', 'STEPHEN HALL', 'NAT KEMP', 'ANDREW LEGGATT', 'JACK ROUNSLEY', 'KATE MCMANUS', 'RICHARD MONGAN', 'LYNETTE SHAHMORADIAN', 'ALAN WILLIAMS', 'SIMON LEWIS', 'OLIVER 1 COOK', 'SARAH MILLEST', 'ALEXANDRA FARMER', 'RAY SIMMONS', 'CATHERINE CATON', 'KARL ZAREMBA', 'PHIL ROBERTS', 'CLAIRE COOPER', 'EMMA SMITHSON', 'HELEN RANDALL', 'SAM MARSH', 'LIAM NELSON', 'KATH NADAL', 'ADAM PRICE', 'AMANDA MYLETT', 'SAM DARLING', 'JULIA MIETUS', 'LINDSEY LACK', 'STEVE SAUNDERS', 'PHILL BURGESS', 'PENNY GLOVER', 'PETER KILLEY', 'EDWARD SHAW', 'JESS PROCTOR', 'JULIANNE WALTERS', 'JESSICA STEWART', 'KERRY CHRISTIE', 'ANDY COOK', 'LIAM HALL', 'KEITH NEWBOLD', 'JANET HICKMAN', 'ELLIOT COOPS', 'TEIFION ROGERS', 'JUSTIN ROE', 'ABBIE FISHER', 'EMMA CHRISTIE', 'ZARA MONTGOMERY', 'TESNI MILES', 'LEWIS ANDREWS', 'CONOR SIMMONS', 'IGGY ROGERS', 'MATTHEW COOK', 'ARCHIE LEVER', 'CHARLIE MAYNE', 'MCKENZIE MILES', 'LIBBY MAYNE', 'ROSS ORME', 'BRUCE BLACKMAN', 'STEPHEN BALL', 'SIMON RICKETTS', 'ALISON CHMIEL', 'PATRICK ALLINSON', 'PASCAL BAUER', 'MICHEAL WALTERS', 'JONATHAN CAVE', 'ANDREW NEVITT', 'MICK MORAN', 'STANI CHMIEL', 'MICHAEL FUDGER', 'LEE CHAMP', 'ROB KIRBY', 'KAY SPENCER', 'JANE MILLAR', 'THOMAS GILL', 'LOUISE CLIVE', 'BECKY FARMER', 'DAVID TARBUCK', 'OSCAR HUISSOON', 'ELLIE LAWLEY', 'ALLISON MILES', 'NICOLA RUDGE', 'EMMA CHRISTIE', 'LEWIS ANDREWS', '01:27:25.46', '01:34:13.50', '01:07:30.70', '01:12:06.66', '01:16:39.34', '00:33:38.65', '00:35:38.33', '00:37:39.45', '00:39:39.12', '01:02:58.03', '01:07:30.70', '01:12:06.66', '00:32:38.65', '00:35:38.33', '00:37:39.45']
第一个里面也没有 href 属性,或者我认为任何 td 所以不确定应该得到什么。
【讨论】:
以上是关于CSS 选择器在 Python Web Scrape 中没有按下任何东西的主要内容,如果未能解决你的问题,请参考以下文章
如何让 Visual Studio 的设计器在 ASP.NET Web 用户控件中正确呈现 CSS?
为啥我的 jQuery :not() 选择器在 CSS 中不起作用?