刮wsj.com

Posted

技术标签:

【中文标题】刮wsj.com【英文标题】:Scraping wsj.com 【发布时间】:2020-06-21 16:50:36 【问题描述】:

我想从 wsj.com 上抓取一些数据并打印出来。实际网站为:https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main,数据为 NYSE Issues Advancing, Deciling 和 NYSE Share Volume Advancing, Declining。

我在观看 youtube 视频后尝试使用 beautifulsoup,但我无法让任何类在正文中返回值。

这是我的代码:

from bs4 import BeautifulSoup
import requests


source = requests.get('https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main').text

soup = BeautifulSoup(source, 'lxml')

body = soup.find('body')

adv = body.find('td', class_='WSJTables--table__cell--2dzGiO7q WSJTheme--table__cell--1At-VGNg ')


print(adv)

此外,在检查网络中的元素时,我注意到这些数据也可以作为 JSON 使用。

这里是链接:https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary

所以我编写了另一个脚本来尝试使用 JSON 解析这些数据,但它再次无法正常工作。

代码如下:

import json

import requests

url = 'https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary'

response = json.loads(requests.get(url).text)

print(response)

我得到的错误是:

 File "C:\Users\User\Anaconda3\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

JSONDecodeError: Expecting value

我还尝试了this link 中的几种不同方法,但似乎都不起作用。

你能告诉我如何抓取这些数据吗?

【问题讨论】:

你的预期输出是什么? 【参考方案1】:
from bs4 import BeautifulSoup
import requests
import json


params = 
    'id': '"application":"WSJ","marketsDiaryType":"overview"',
    'type': 'mdc_marketsdiary'


headers = 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0"

r = requests.get(
    "https://www.wsj.com/market-data/stocks", params=params, headers=headers).json()


data = json.dumps(r, indent=4)

print(data)

输出:


    "id": "\"application\":\"WSJ\",\"marketsDiaryType\":\"overview\"",
    "type": "mdc_marketsdiary",
    "data": 
        "instrumentSets": [
            
                "headerFields": [
                    
                        "value": "name",
                        "label": "Issues"
                    
                ],
                "instruments": [
                    
                        "name": "Advancing",
                        "NASDAQ": "169",
                        "NYSE": "69"
                    ,
                    
                        "name": "Declining",
                        "NASDAQ": "3,190",
                        "NYSE": "2,973"
                    ,
                    
                        "name": "Unchanged",
                        "NASDAQ": "24",
                        "NYSE": "10"
                    ,
                    
                        "name": "Total",
                        "NASDAQ": "3,383",
                        "NYSE": "3,052"
                    
                ]
            ,
            
                "headerFields": [
                    
                        "value": "name",
                        "label": "Issues At"
                    
                ],
                "instruments": [
                    
                        "name": "New Highs",
                        "NASDAQ": "53",
                        "NYSE": "14"
                    ,
                    
                        "name": "New Lows",
                        "NASDAQ": "1,406",
                        "NYSE": "1,620"
                    
                ]
            ,
            
                "headerFields": [
                    
                        "value": "name",
                        "label": "Share Volume"
                    
                ],
                "instruments": [
                    
                        "name": "Total",
                        "NASDAQ": "4,454,691,895",
                        "NYSE": "7,790,947,818"
                    ,
                    
                        "name": "Advancing",
                        "NASDAQ": "506,192,012",
                        "NYSE": "219,412,232"
                    ,
                    
                        "name": "Declining",
                        "NASDAQ": "3,948,035,191",
                        "NYSE": "7,570,377,893"
                    ,
                    
                        "name": "Unchanged",
                        "NASDAQ": "464,692",
                        "NYSE": "1,157,693"
                    
                ]
            
        ],
        "timestamp": "4:00 PM EDT 3/09/20"
    ,
    "hash": "\"id\":\"\\\"application\\\":\\\"WSJ\\\",\\\"marketsDiaryType\\\":\\\"overview\\\"\",\"type\":\"mdc_marketsdiary\",\"data\":\"instrumentSets\":[\"headerFields\":[\"value\":\"name\",\"label\":\"Issues\"],\"instruments\":[\"name\":\"Advancing\",\"NASDAQ\":\"169\",\"NYSE\":\"69\",\"name\":\"Declining\",\"NASDAQ\":\"3,190\",\"NYSE\":\"2,973\",\"name\":\"Unchanged\",\"NASDAQ\":\"24\",\"NYSE\":\"10\",\"name\":\"Total\",\"NASDAQ\":\"3,383\",\"NYSE\":\"3,052\"],\"headerFields\":[\"value\":\"name\",\"label\":\"Issues At\"],\"instruments\":[\"name\":\"New Highs\",\"NASDAQ\":\"53\",\"NYSE\":\"14\",\"name\":\"New Lows\",\"NASDAQ\":\"1,406\",\"NYSE\":\"1,620\"],\"headerFields\":[\"value\":\"name\",\"label\":\"Share Volume\"],\"instruments\":[\"name\":\"Total\",\"NASDAQ\":\"4,454,691,895\",\"NYSE\":\"7,790,947,818\",\"name\":\"Advancing\",\"NASDAQ\":\"506,192,012\",\"NYSE\":\"219,412,232\",\"name\":\"Declining\",\"NASDAQ\":\"3,948,035,191\",\"NYSE\":\"7,570,377,893\",\"name\":\"Unchanged\",\"NASDAQ\":\"464,692\",\"NYSE\":\"1,157,693\"]],\"timestamp\":\"4:00 PM EDT 3/09/20\""

注意:您可以使用dict print(r.keys()) 访问它。

【讨论】:

这是一个绝妙的解决方案。谢谢你。您能否解释一下您添加的参数和标题以及您在哪里找到它们以及在哪些情况下需要它们? 欢迎您 :) 乐于助人。目前通过电话在线。一旦我回到笔记本电脑上,我会解释一下 欢迎您,查看headers 和parameters 非常有用的教程。你总是需要标题吗?你在哪里找到标题的? @trustory 有时您需要标头,有时则不需要。这是一个悬而未决的问题,您可以在使用代码时自行回答。您可以在浏览器开发者工具中找到标头并检查网络部分,然后您可以找到使用的标头。【参考方案2】:

您需要在 url 上添加一个 header,这样它就不会返回 error=404。

import pandas as pd
from urllib.request import urlopen   
from bs4 import BeautifulSoup as soup

url = 'https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary'
# put a header on the request
headers = 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'
req = urllib.request.Request(url=url, headers=headers)
with urlopen(req) as response:
    page_html = response.read()
df = pd.DataFrame()    
data = json.loads(page_html).get('data')
for instrumentSets in data.get('instrumentSets'):
    for k,v in instrumentSets.items():
        if k == 'instruments':
            df = df.append(pd.DataFrame(v))
df=df.rename(columns = 'name':'Issues')
df

Result:

【讨论】:

以上是关于刮wsj.com的主要内容,如果未能解决你的问题,请参考以下文章

耐刮擦性能测试常用的几种方法

自动刮34圈代码

男性女性的心灵刮骨疗毒师

用beautifulsoup刮工地

刮动态加载的网站

刮亚马逊品牌页面