Selenium:从 Coincodex 抓取历史数据并转换为 Pandas 数据框
Posted
技术标签:
【中文标题】Selenium:从 Coincodex 抓取历史数据并转换为 Pandas 数据框【英文标题】:Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe 【发布时间】:2022-01-04 15:10:12 【问题描述】:在尝试使用 Selenium 从 https://coincodex.com/crypto/bitcoin/historical-data/ 抓取多个站点的一些历史数据时,我确实遇到了困难。不知何故,我确实失败了以下步骤:
-
从后续页面获取数据(不仅是 9 月,也就是第 1 页)
将每个值的“$”替换为“$”
将值 B(十亿)转换为一个完整的数字(1B 转换为 1000000000)
预定义的任务是:使用 Selenium 和 BeautifulSoup 从年初到 9 月底的所有数据进行 Web 抓取,并将其转换为 pandas df。到目前为止我的代码是:
from selenium import webdriver
import time
URL = "https://coincodex.com/crypto/bitcoin/historical-data/"
driver = webdriver.Chrome(executable_path = "/usr/local/bin/chromedriver")
driver.get(URL)
time.sleep(2)
webpage = driver.page_source
from bs4 import BeautifulSoup
Web page fetched from driver is parsed using Beautiful Soup.
htmlPage = BeautifulSoup(driver.page_source, 'html.parser')
Table = HTMLPage.find('table', class_='styled-table full-size-table')
Rows = Table.find_all('tr', class_='ng-star-inserted')
len(Rows)
# Empty list is created to store the data
extracted_data = []
# Loop to go through each row of table
for i in range(0, len(Rows)):
try:
# Empty dictionary to store data present in each row
RowDict =
# Extracted all the columns of a row and stored in a variable
Values = Rows[i].find_all('td')
# Values (Open, High, Close etc.) are extracted and stored in dictionary
if len(Values) == 7:
RowDict["Date"] = Values[0].text.replace(',', '')
RowDict["Open"] = Values[1].text.replace(',', '')
RowDict["High"] = Values[2].text.replace(',', '')
RowDict["Low"] = Values[3].text.replace(',', '')
RowDict["Close"] = Values[4].text.replace(',', '')
RowDict["Volume"] = Values[5].text.replace(',', '')
RowDict["Market Cap"] = Values[6].text.replace(',', '')
extracted_data.append(RowDict)
except:
print("Row Number: " + str(i))
finally:
# To move to the next row
i = i + 1
extracted_data = pd.DataFrame(extracted_data)
print(extracted_data)
抱歉,我是 Python 和 Web-Scraping 的新手,希望有人能帮助我。将不胜感激。
【问题讨论】:
【参考方案1】:要从网站Coincodex 的所有七列中提取 比特币 (BTC) 历史数据 并将它们打印到一个文本文件中,您需要为 visibility_of_all_elements_located() 诱导 WebDriverWait,然后使用List Comprehension,您可以创建一个列表,然后创建一个DataFrame,最后使用以下Locator Strategies 将值导出到不包括索引的TEXT 文件:
代码块:
driver.get("https://coincodex.com/crypto/bitcoin/historical-data/")
headers = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table th")))]
dates = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(1)")))]
opens = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(2)")))]
highs = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(3)")))]
lows = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(4)")))]
closes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(5)")))]
volumes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(6)")))]
marketcaps = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(7)")))]
df = pd.DataFrame(data=list(zip(dates, opens, highs, lows, closes, volumes, marketcaps)), columns=headers)
print(df)
driver.quit()
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
控制台输出:
Date Open High Low Close Volume Market Cap
0 Oct 30, 2021 $ 62,225 $ 62,225 $ 60,860 $ 61,661 $ 82.73B $ 1.16T
1 Oct 31, 2021 $ 61,856 $ 62,379 $ 60,135 $ 61,340 $ 74.91B $ 1.15T
2 Nov 01, 2021 $ 61,290 $ 62,368 $ 59,675 $ 61,065 $ 76.19B $ 1.16T
3 Nov 02, 2021 $ 60,939 $ 64,071 $ 60,682 $ 63,176 $ 74.05B $ 1.18T
4 Nov 03, 2021 $ 63,167 $ 63,446 $ 61,653 $ 62,941 $ 78.08B $ 1.18T
5 Nov 04, 2021 $ 62,907 $ 63,048 $ 60,740 $ 61,368 $ 91.06B $ 1.17T
6 Nov 05, 2021 $ 61,419 $ 62,480 $ 60,770 $ 61,026 $ 78.06B $ 1.16T
7 Nov 06, 2021 $ 60,959 $ 61,525 $ 60,083 $ 61,416 $ 67.75B $ 1.15T
8 Nov 07, 2021 $ 61,454 $ 63,180 $ 61,333 $ 63,180 $ 51.66B $ 1.17T
9 Nov 08, 2021 $ 63,278 $ 67,670 $ 63,278 $ 67,500 $ 74.25B $ 1.24T
10 Nov 09, 2021 $ 67,511 $ 68,476 $ 66,359 $ 66,913 $ 87.83B $ 1.27T
11 Nov 10, 2021 $ 66,929 $ 68,770 $ 63,348 $ 64,871 $ 82.52B $ 1.26T
12 Nov 11, 2021 $ 64,934 $ 65,580 $ 64,199 $ 64,800 $ 100.84B $ 1.22T
13 Nov 12, 2021 $ 64,774 $ 65,380 $ 62,434 $ 64,315 $ 71.88B $ 1.21T
14 Nov 13, 2021 $ 64,174 $ 64,850 $ 63,413 $ 64,471 $ 65.34B $ 1.21T
15 Nov 14, 2021 $ 64,385 $ 65,255 $ 63,623 $ 65,255 $ 59.25B $ 1.22T
16 Nov 15, 2021 $ 65,500 $ 66,263 $ 63,540 $ 63,716 $ 92.91B $ 1.23T
17 Nov 16, 2021 $ 63,610 $ 63,610 $ 58,904 $ 60,190 $ 103.18B $ 1.15T
18 Nov 17, 2021 $ 60,111 $ 60,734 $ 58,758 $ 60,339 $ 96.57B $ 1.13T
19 Nov 18, 2021 $ 60,348 $ 60,863 $ 56,542 $ 56,749 $ 86.65B $ 1.11T
20 Nov 19, 2021 $ 56,960 $ 58,289 $ 55,653 $ 58,047 $ 98.57B $ 1.08T
21 Nov 20, 2021 $ 58,069 $ 59,815 $ 57,486 $ 59,815 $ 61.67B $ 1.11T
22 Nov 21, 2021 $ 59,670 $ 59,845 $ 58,545 $ 58,681 $ 54.40B $ 1.12T
23 Nov 22, 2021 $ 58,712 $ 59,061 $ 55,689 $ 56,370 $ 64.89B $ 1.08T
24 Nov 23, 2021 $ 56,258 $ 57,832 $ 55,778 $ 57,673 $ 80.27B $ 1.07T
25 Nov 24, 2021 $ 57,531 $ 57,694 $ 55,970 $ 57,103 $ 92.08B $ 1.07T
26 Nov 25, 2021 $ 57,193 $ 59,333 $ 57,011 $ 58,907 $ 85.14B $ 1.10T
27 Nov 26, 2021 $ 58,914 $ 59,120 $ 53,660 $ 53,664 $ 90.87B $ 1.05T
28 Nov 27, 2021 $ 53,559 $ 55,204 $ 53,559 $ 54,487 $ 85.68B $ 1.03T
29 Nov 28, 2021 $ 54,819 $ 57,315 $ 53,630 $ 57,159 $ 72.40B $ 1.03T
参考文献
您可以在以下位置找到相关的详细讨论:
Python Selenium: How do I print the values from a website in a text file?【讨论】:
【参考方案2】:Coincodex 提供了一个查询 UI,您可以在其中调整时间范围。在将开始和结束设置为 1 月 1 日和 9 月 30 日并单击“选择”按钮后,站点使用https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791
端点向后端发送GET
请求。如果你向这个 URL 发送请求,你可以从这个区间取回你需要的所有数据:
import requests, json
import pandas as pd
data = json.loads(requests.get('https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791').text)
df = pd.DataFrame(data['data'])
输出:
time_start time_end price_open_usd ... price_avg_ETH volume_ETH market_cap_ETH
0 2021-01-01 00:00:00 2021-01-02 00:00:00 28938.896888 ... 39.496780 8.728544e+07 7.341417e+08
1 2021-01-02 00:00:00 2021-01-03 00:00:00 29329.695772 ... 40.934106 9.351177e+07 7.608959e+08
2 2021-01-03 00:00:00 2021-01-04 00:00:00 32148.048500 ... 38.970510 1.448755e+08 7.244327e+08
3 2021-01-04 00:00:00 2021-01-05 00:00:00 32949.399464 ... 31.433580 1.292715e+08 5.843597e+08
4 2021-01-05 00:00:00 2021-01-06 00:00:00 32023.293433 ... 30.478852 1.186652e+08 5.666423e+08
.. ... ... ... ... ... ... ...
268 2021-09-26 00:00:00 2021-09-27 00:00:00 42670.363351 ... 14.438247 1.573066e+07 2.718238e+08
269 2021-09-27 00:00:00 2021-09-28 00:00:00 43204.962300 ... 14.157527 1.660821e+07 2.665518e+08
270 2021-09-28 00:00:00 2021-09-29 00:00:00 42111.843283 ... 14.439326 1.782125e+07 2.718712e+08
271 2021-09-29 00:00:00 2021-09-30 00:00:00 41004.598500 ... 14.510256 1.748895e+07 2.732201e+08
272 2021-09-30 00:00:00 2021-10-01 00:00:00 41536.594100 ... 14.454206 1.810257e+07 2.721773e+08
[273 rows x 23 columns]
【讨论】:
感谢您的回答 - 问题是,我们必须使用 Selenium 和网络抓取...以上是关于Selenium:从 Coincodex 抓取历史数据并转换为 Pandas 数据框的主要内容,如果未能解决你的问题,请参考以下文章
使用 BS4 或 Selenium 从 finishline.com 抓取网页
使用 Selenium 返回空 DataFrame 从网站抓取表格
在 python 上使用 selenium 或 beautifulsoup 从带有链接的页面中抓取数据,没有类,没有 id