python 抓取网页表格,具体请看程序,网页中html可以看到是中文,但是解析之后却是这样,求大神解答?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 抓取网页表格,具体请看程序,网页中html可以看到是中文,但是解析之后却是这样,求大神解答?相关的知识,希望对你有一定的参考价值。

#! /usr/bin/env python
#coding=utf-8

import urllib2
import re
from bs4 import BeautifulSoup

#-----------------------------------
def main():
userMainurl="http://wsbs.bjepb.gov.cn/air2008/Air1.aspx"
req=urllib2.Request(userMainurl)
resp=urllib2.urlopen(req)
resphtml=resp.read()
#print respHtml

songtasteHtmlEncoding="utf-8"
soup=BeautifulSoup(respHtml,from_encoding=songtasteHtmlEncoding)
#soup=BeautifulSoup(resp,from_encoding="GB2312")
#foundClassH1user=soup.findall(attrs="class":"yubao")
foundClassH1user=soup.find_all("td")
print foundClassH1user
#print soup
'''
if (foundClassH1user):
h1userStr=foundClassH1user.string
print h1userStr
print soup
'''

if __name__=="__main__":
main()
输出结果是:
> "C:\Python27\python.exe" -u "C:\Program Files (x86)\UliPad\1212.py"
<td>\xe4\xb8\x9c\xe5\x9f\x8e\xe5\xa4\xa9\xe5\x9d\x9b</td><td>184</td><td>\xe8\x87\xad\xe6\xb0\xa7</td><td>4</td><td>\xe4\xb8\xad\xe5\xba\xa6\xe6\xb1\xa1\xe6\x9f\x93</td>
部分内容略去了,我在网页看到的是这样:

参考技术A 将二进制再转成字符就行了
>>> aa= b"<td>\xe4\xb8\x9c\xe5\x9f\x8e\xe5\xa4\xa9\xe5\x9d\x9b</td><td>184</td><td>\xe8\x87\xad\xe6\xb0\xa7</td><td>4</td><td>\xe4\xb8\xad\xe5\xba\xa6\xe6\xb1\xa1\xe6\x9f\x93</td>"
>>> bb = aa.decode("utf-8").encode("gb2312")
>>> bb.decode("gb2312")
'<td>东城天坛</td><td>184</td><td>臭氧</td><td>4</td><td>中度污染</td>'
>>>追问

>>> bb = aa.decode("utf-8").encode("gb2312")

>>> bb.decode("gb2312")
u'\u4e1c\u57ce\u5929\u575b184\u81ed\u6c274\u4e2d\u5ea6\u6c61\u67d3'
我这里出现这些还是不对呀。。。

追答

你的aa怎么写的?

追问

aa= b"\xe4\xb8\x9c\xe5\x9f\x8e\xe5\xa4\xa9\xe5\x9d\x9b184\xe8\x87\xad\xe6\xb0\xa74\xe4\xb8\xad\xe5\xba\xa6\xe6\xb1\xa1\xe6\x9f\x93"

参考技术B 你好:
请将你的编码改一下:
coding=utf-8
改成coding=cp936追问

#coding=utf-8改成cp936么?

追答

你好:
是的;
这是编码问题!

追问

没有改善啊还是

追答

不可能啊:

你看看我的运行结果:

本回答被提问者采纳

使用 Python 抓取网页动态内容(动态 HTML/Javascript 表格)

【中文标题】使用 Python 抓取网页动态内容(动态 HTML/Javascript 表格)【英文标题】:Web scraping dynamic content with Python (dynamic HTML/Javascript table) 【发布时间】:2021-10-04 12:31:32 【问题描述】:

我想从动态 HTML 表中抓取数据,其中某些数据需要单击按钮才能加载(使用 Javascript)。我感兴趣的数据在this webpage,到目前为止,我只设法抓取了默认加载的数据。

在之前链接的网页上,我正在尝试提取名为“Fundamental”(picture showing what I am trying to scrape) 的表中包含的数据。

到目前为止,我编写了这个代码:

import pandas as pd
import requests as rq
from bs4 import BeautifulSoup

headers = "user-agent": "chrome"
url = "https://www.investing.com/indices/stoxx-600-components"
htmlcontent = rq.get(url, headers=headers).text
soup = BeautifulSoup(htmlcontent, "lxml")

table_price = soup.find("table", "id": "cr1")

indexcomponents = []

rows = table_price.find_all("tr")

for row in rows[1:]:
    columns = row.find_all("td")
    indexcomponents.append([
        columns[1].text,
        columns[2].text,
        columns[6].text,
        columns[7].text,
        columns[8].text])

for n in range(len(indexcomponents)):
    print(indexcomponents[n])

我很清楚已经有人问过类似的问题,但我是 Python 的初学者,对 Javascript 完全一无所知,因此,我没有成功实施建议的解决方案。提前感谢您的帮助!

【问题讨论】:

【参考方案1】:

Selenium 与 Scrapy 为您提供所需的工作解决方案:

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from time import sleep


class TableSpider(scrapy.Spider):
    name = 'table'
     
    allowed_domains = ['www.investing.com'] 
    start_urls = [
        'https://www.investing.com/indices/stoxx-600-components'
    ]

    def __init__(self):
        chrome_options = Options()
        #chrome_options.add_argument("--headless")

        chrome_path = which("chromedriver")

        self.driver = webdriver.Chrome(executable_path=chrome_path)#, options=chrome_options)
        self.driver.set_window_size(1920, 1080)
        self.driver.get("https://www.investing.com/indices/stoxx-600-components")
        sleep(5)
        rur_tab = self.driver.find_element_by_id("filter_fundamental")
        rur_tab.click()
        sleep(5)

        self.html = self.driver.page_source
        self.driver.close()
    
        def parse(self, response):
            resp = Selector(text=self.html)
            for tr in resp.xpath('(//tbody)[2]/tr'):
                yield 
                    'Average Vol': tr.xpath(".//td[3]/text()").get(),
                    'Market Cap': tr.xpath(".//td[4]/text()").get()
                    
                

输出:总输出的一部分:

    'Average Vol': '1.31M', 'Market Cap': '7.19B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '950.47K', 'Market Cap': '18.44B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '921.90K', 'Market Cap': '5.82B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '375.59K', 'Market Cap': '5.39B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '191.61K', 'Market Cap': '5.76B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '62.44K', 'Market Cap': '10.52B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '1.31M', 'Market Cap': '15.13B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '163.85K', 'Market Cap': '29.76B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '2.79M', 'Market Cap': '233.86B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '146.01K', 'Market Cap': '2.30B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '201.49K', 'Market Cap': '8.18B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '911.90K', 'Market Cap': '50.36B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '1.92M', 'Market Cap': '2.91B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '2.28M', 'Market Cap': '28.28B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '81.20M', 'Market Cap': '32.06B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '313.25K', 'Market Cap': '6.59B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '1.04M', 'Market Cap': '102.35B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '4.09M', 'Market Cap': '414.52B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '6.21K', 'Market Cap': '32.33B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '303.02K', 'Market Cap': '4.57B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '190.13K', 'Market Cap': '6.61B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '1.40M', 'Market Cap': '7.49B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '495.52K', 'Market Cap': '4.93B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '40.93K', 'Market Cap': '4.94B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '665.41K', 'Market Cap': '9.98B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '459.73K', 'Market Cap': '2.18B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '522.84K', 'Market Cap': '6.19B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '237.73K', 'Market Cap': '3.80B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '465.56K', 'Market Cap': '24.44B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '495.88K', 'Market Cap': '22.04B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '2.13M', 'Market Cap': '11.15B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '478.85K', 'Market Cap': '119.16B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '825.97K', 'Market Cap': '25.40B'
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
'Average Vol': '371.43K', 'Market Cap': '54.56B'
2021-07-29 11:23:02 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-29 11:23:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
'downloader/request_bytes': 326,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 134730,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 5.182406,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 7, 29, 5, 23, 2, 200527),
 'httpcompression/response_bytes': 911212,
 'httpcompression/response_count': 1,
 'item_scraped_count': 589

【讨论】:

非常感谢您提出的解决方案!但是,我对 Scrapy 很陌生。您能否也发布您获得此输出的方式?我尝试使用 CrawlerProcess 但它没有用。提前致谢! 使用scrapy项目会好很多。我使用了scrapy项目,该站点的要求是添加用户代理,我在我的项目文件夹的settings.py中添加了用户代理。不要忘记添加用户代理。谢谢 你可以从 scrapy 文档中获取帮助:docs.scrapy.org/en/latest/topics/…

以上是关于python 抓取网页表格,具体请看程序,网页中html可以看到是中文,但是解析之后却是这样,求大神解答?的主要内容,如果未能解决你的问题,请参考以下文章

使用 Python 抓取网页动态内容(动态 HTML/Javascript 表格)

c#抓取动态网页中的数据

python 如何抓取动态页面内容?

如何利用python读取网页中变量的内容

如何用python写爬虫来获取网页中所有的文章以及关键词

Python网页抓取 - 当页面通过JS加载内容时如何获取资源?