Selenium:从 Coincodex 抓取历史数据并转换为 Pandas 数据框

Posted

技术标签:

【中文标题】Selenium:从 Coincodex 抓取历史数据并转换为 Pandas 数据框【英文标题】:Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe 【发布时间】:2022-01-04 15:10:12 【问题描述】:

在尝试使用 Selenium 从 https://coincodex.com/crypto/bitcoin/historical-data/ 抓取多个站点的一些历史数据时,我确实遇到了困难。不知何故,我确实失败了以下步骤:

    从后续页面获取数据(不仅是 9 月,也就是第 1 页) 将每个值的“$”替换为“$” 将值 B(十亿)转换为一个完整的数字(1B 转换为 1000000000)

预定义的任务是:使用 Selenium 和 BeautifulSoup 从年初到 9 月底的所有数据进行 Web 抓取,并将其转换为 pandas df。到目前为止我的代码是:

from selenium import webdriver
import time

URL = "https://coincodex.com/crypto/bitcoin/historical-data/"

driver = webdriver.Chrome(executable_path = "/usr/local/bin/chromedriver")
driver.get(URL)
time.sleep(2)

webpage = driver.page_source

from bs4 import BeautifulSoup
Web page fetched from driver is parsed using Beautiful Soup.
htmlPage = BeautifulSoup(driver.page_source, 'html.parser')

Table = HTMLPage.find('table', class_='styled-table full-size-table')

Rows = Table.find_all('tr', class_='ng-star-inserted')
len(Rows)

# Empty list is created to store the data
extracted_data = []
# Loop to go through each row of table
for i in range(0, len(Rows)):
 try:
  # Empty dictionary to store data present in each row
  RowDict = 
  # Extracted all the columns of a row and stored in a variable
  Values = Rows[i].find_all('td')
  
  # Values (Open, High, Close etc.) are extracted and stored in dictionary
  if len(Values) == 7:
   RowDict["Date"] = Values[0].text.replace(',', '')
   RowDict["Open"] = Values[1].text.replace(',', '')
   RowDict["High"] = Values[2].text.replace(',', '')
   RowDict["Low"] = Values[3].text.replace(',', '')
   RowDict["Close"] = Values[4].text.replace(',', '')
   RowDict["Volume"] = Values[5].text.replace(',', '')
   RowDict["Market Cap"] = Values[6].text.replace(',', '')
   extracted_data.append(RowDict)
 except:
  print("Row Number: " + str(i))
 finally:
  # To move to the next row
  i = i + 1

extracted_data = pd.DataFrame(extracted_data)
print(extracted_data)

抱歉,我是 Python 和 Web-Scraping 的新手,希望有人能帮助我。将不胜感激。

【问题讨论】:

【参考方案1】:

要从网站Coincodex 的所有七列中提取 比特币 (BTC) 历史数据 并将它们打印到一个文本文件中,您需要为 visibility_of_all_elements_located() 诱导 WebDriverWait,然后使用List Comprehension,您可以创建一个列表,然后创建一个DataFrame,最后使用以下Locator Strategies 将值导出到不包括索引的TEXT 文件:

代码块:

driver.get("https://coincodex.com/crypto/bitcoin/historical-data/")
headers = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table th")))]
dates = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(1)")))]
opens = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(2)")))]
highs = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(3)")))]
lows = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(4)")))]
closes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(5)")))]
volumes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(6)")))]
marketcaps = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(7)")))]
df = pd.DataFrame(data=list(zip(dates, opens, highs, lows, closes, volumes, marketcaps)), columns=headers)
print(df)
driver.quit()

注意:您必须添加以下导入:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
    

控制台输出:

            Date      Open      High       Low     Close     Volume Market Cap
0   Oct 30, 2021  $ 62,225  $ 62,225  $ 60,860  $ 61,661   $ 82.73B    $ 1.16T
1   Oct 31, 2021  $ 61,856  $ 62,379  $ 60,135  $ 61,340   $ 74.91B    $ 1.15T
2   Nov 01, 2021  $ 61,290  $ 62,368  $ 59,675  $ 61,065   $ 76.19B    $ 1.16T
3   Nov 02, 2021  $ 60,939  $ 64,071  $ 60,682  $ 63,176   $ 74.05B    $ 1.18T
4   Nov 03, 2021  $ 63,167  $ 63,446  $ 61,653  $ 62,941   $ 78.08B    $ 1.18T
5   Nov 04, 2021  $ 62,907  $ 63,048  $ 60,740  $ 61,368   $ 91.06B    $ 1.17T
6   Nov 05, 2021  $ 61,419  $ 62,480  $ 60,770  $ 61,026   $ 78.06B    $ 1.16T
7   Nov 06, 2021  $ 60,959  $ 61,525  $ 60,083  $ 61,416   $ 67.75B    $ 1.15T
8   Nov 07, 2021  $ 61,454  $ 63,180  $ 61,333  $ 63,180   $ 51.66B    $ 1.17T
9   Nov 08, 2021  $ 63,278  $ 67,670  $ 63,278  $ 67,500   $ 74.25B    $ 1.24T
10  Nov 09, 2021  $ 67,511  $ 68,476  $ 66,359  $ 66,913   $ 87.83B    $ 1.27T
11  Nov 10, 2021  $ 66,929  $ 68,770  $ 63,348  $ 64,871   $ 82.52B    $ 1.26T
12  Nov 11, 2021  $ 64,934  $ 65,580  $ 64,199  $ 64,800  $ 100.84B    $ 1.22T
13  Nov 12, 2021  $ 64,774  $ 65,380  $ 62,434  $ 64,315   $ 71.88B    $ 1.21T
14  Nov 13, 2021  $ 64,174  $ 64,850  $ 63,413  $ 64,471   $ 65.34B    $ 1.21T
15  Nov 14, 2021  $ 64,385  $ 65,255  $ 63,623  $ 65,255   $ 59.25B    $ 1.22T
16  Nov 15, 2021  $ 65,500  $ 66,263  $ 63,540  $ 63,716   $ 92.91B    $ 1.23T
17  Nov 16, 2021  $ 63,610  $ 63,610  $ 58,904  $ 60,190  $ 103.18B    $ 1.15T
18  Nov 17, 2021  $ 60,111  $ 60,734  $ 58,758  $ 60,339   $ 96.57B    $ 1.13T
19  Nov 18, 2021  $ 60,348  $ 60,863  $ 56,542  $ 56,749   $ 86.65B    $ 1.11T
20  Nov 19, 2021  $ 56,960  $ 58,289  $ 55,653  $ 58,047   $ 98.57B    $ 1.08T
21  Nov 20, 2021  $ 58,069  $ 59,815  $ 57,486  $ 59,815   $ 61.67B    $ 1.11T
22  Nov 21, 2021  $ 59,670  $ 59,845  $ 58,545  $ 58,681   $ 54.40B    $ 1.12T
23  Nov 22, 2021  $ 58,712  $ 59,061  $ 55,689  $ 56,370   $ 64.89B    $ 1.08T
24  Nov 23, 2021  $ 56,258  $ 57,832  $ 55,778  $ 57,673   $ 80.27B    $ 1.07T
25  Nov 24, 2021  $ 57,531  $ 57,694  $ 55,970  $ 57,103   $ 92.08B    $ 1.07T
26  Nov 25, 2021  $ 57,193  $ 59,333  $ 57,011  $ 58,907   $ 85.14B    $ 1.10T
27  Nov 26, 2021  $ 58,914  $ 59,120  $ 53,660  $ 53,664   $ 90.87B    $ 1.05T
28  Nov 27, 2021  $ 53,559  $ 55,204  $ 53,559  $ 54,487   $ 85.68B    $ 1.03T
29  Nov 28, 2021  $ 54,819  $ 57,315  $ 53,630  $ 57,159   $ 72.40B    $ 1.03T

参考文献

您可以在以下位置找到相关的详细讨论:

Python Selenium: How do I print the values from a website in a text file?

【讨论】:

【参考方案2】:

Coincodex 提供了一个查询 UI,您可以在其中调整时间范围。在将开始和结束设置为 1 月 1 日和 9 月 30 日并单击“选择”按钮后,站点使用https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791 端点向后端发送GET 请求。如果你向这个 URL 发送请求,你可以从这个区间取回你需要的所有数据:

import requests, json
import pandas as pd
data = json.loads(requests.get('https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791').text)
df = pd.DataFrame(data['data'])

输出:

              time_start             time_end  price_open_usd  ...  price_avg_ETH    volume_ETH  market_cap_ETH
0    2021-01-01 00:00:00  2021-01-02 00:00:00    28938.896888  ...      39.496780  8.728544e+07    7.341417e+08
1    2021-01-02 00:00:00  2021-01-03 00:00:00    29329.695772  ...      40.934106  9.351177e+07    7.608959e+08
2    2021-01-03 00:00:00  2021-01-04 00:00:00    32148.048500  ...      38.970510  1.448755e+08    7.244327e+08
3    2021-01-04 00:00:00  2021-01-05 00:00:00    32949.399464  ...      31.433580  1.292715e+08    5.843597e+08
4    2021-01-05 00:00:00  2021-01-06 00:00:00    32023.293433  ...      30.478852  1.186652e+08    5.666423e+08
..                   ...                  ...             ...  ...            ...           ...             ...
268  2021-09-26 00:00:00  2021-09-27 00:00:00    42670.363351  ...      14.438247  1.573066e+07    2.718238e+08
269  2021-09-27 00:00:00  2021-09-28 00:00:00    43204.962300  ...      14.157527  1.660821e+07    2.665518e+08
270  2021-09-28 00:00:00  2021-09-29 00:00:00    42111.843283  ...      14.439326  1.782125e+07    2.718712e+08
271  2021-09-29 00:00:00  2021-09-30 00:00:00    41004.598500  ...      14.510256  1.748895e+07    2.732201e+08
272  2021-09-30 00:00:00  2021-10-01 00:00:00    41536.594100  ...      14.454206  1.810257e+07    2.721773e+08

[273 rows x 23 columns]

【讨论】:

感谢您的回答 - 问题是,我们必须使用 Selenium 和网络抓取...

以上是关于Selenium:从 Coincodex 抓取历史数据并转换为 Pandas 数据框的主要内容,如果未能解决你的问题,请参考以下文章

如何使用Python和Selenium分页来抓取页面

使用 Selenium 从 Twitter 抓取关注者

使用 BS4 或 Selenium 从 finishline.com 抓取网页

使用 Selenium 返回空 DataFrame 从网站抓取表格

在 python 上使用 selenium 或 beautifulsoup 从带有链接的页面中抓取数据,没有类,没有 id

用于网页抓取的 Selenium 与 BeautifulSoup