将表格抓取并写入数据框会显示 TypeError

Posted

技术标签:

【中文标题】将表格抓取并写入数据框会显示 TypeError【英文标题】:Scraping and writing the table into dataframe shows me TypeError 【发布时间】:2022-01-06 12:11:56 【问题描述】:

我正在尝试抓取表格并写入他们向我显示typeerror 的数据框中。如何解决这些错误?

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
from selenium import webdriver
import pandas as pd
temp=[]
driver= webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')
driver.get("https://www.fami-qs.org/certified-companies-6-0.html")
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='Inline Frame Example']")))
headers=WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@id='sites']//thead"))).text
rows=WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@id='sites']//tbody"))).text
temp.append(rows)
df = pd.DataFrame(temp,columns=headers)
print(df)

在标题中我传递数据FAMI-QS Number ... Expiry date 而在行中我将传递FAM-0694 ... 2022-09-04

【问题讨论】:

【参考方案1】:

您可以仅使用 pandas 从 api 调用 html 响应中获取所有表数据,如下所示:

代码:

import requests
import pandas as pd

headers = 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'

url = "https://famiqs.viasyst.net/certified-sites"

req = requests.get(url,headers=headers)

table = pd.read_html(req.text)

df = table[0]#.to_csv('info.csv',index = False)

print(df)

输出:

    FAMI-QS Number  ... Expiry date
0          FAM-0694  ...  2022-09-04
1          FAM-1491  ...  2022-10-17
2     FAM-ISFSF-003  ...  2022-10-27
3          FAM-1533  ...  2022-10-31
4          FAM-1090  ...  2022-11-13
...             ...  ...         ...
1472    FAM-1761-01  ...  2024-10-27
1473       FAM-1796  ...  2024-09-29
1474    FAM-1427-01  ...  2023-12-01
1475       FAM-1861  ...  2024-11-22
1476    FAM-0005-07  ...  2024-11-25

[1477 rows x 7 columns]

【讨论】:

【参考方案2】:

scrape FAMI QS NumberSite Name 列,您需要使用List Comprehension inducing @ 创建所需文本的列表987654322@ 用于visibility_of_all_elements_located(),您可以使用以下任一Locator Strategies:

代码块:

driver = webdriver.Chrome(service=s, options=options)
driver.get("https://www.fami-qs.org/certified-companies-6-0.html")
FAMI_QS_Numbers = []
Site_Names = []
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='Inline Frame Example']")))
FAMI_QS_Numbers.extend([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='sites']//tbody//tr/descendant::td[1]")))])
Site_Names.extend([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='sites']//tbody//tr//td/p")))])
df = pd.DataFrame(data=list(zip(FAMI_QS_Numbers, Site_Names)), columns=['FAMI QS Number', 'Site Name'])
print(df)
driver.quit()

控制台输出:

  FAMI QS Number                             Site Name
0       FAM-1293                    AmTech Ingredients
1       FAM-0841                    3F FEED & FOOD S L
2       FAM-1361                5N Plus Additives GmbH
3    FAM-1301-01                   A & V Corp. Limited
4       FAM-1146  A. + E. Fischer-Chemie GmbH & Co. KG
5       FAM-1589          A.M FOOD CHEMICAL CO LIMITED
6    FAM-0613-01                          A.W.P. S.r.l
7       FAM-0867             AB AGRI POLSKA Sp. z o.o.
8    FAM-1510-02                              AB Vista
9    FAM-1510-01                            AB Vista *

【讨论】:

【参考方案3】:

抓取所有列中的所有数据,您需要为 <table> 元素的 visibility_of_element_located() 诱导 WebDriverWait,提取 outerHTML ,使用read_html()阅读outerHTML,你可以使用以下Locator Strategies:

代码块:

driver.get("https://www.fami-qs.org/certified-companies-6-0.html")
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='Inline Frame Example']")))
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#sites"))).get_attribute("outerHTML")
df  = pd.read_html(data)
print(df)
driver.quit()

控制台输出:

[  FAMI-QS Number                             Site Name              City  ... Status Certified from Expiry date
0       FAM-1293                    AmTech Ingredients        albert lea  ...  Valid     2020-10-08  2023-10-07
1       FAM-0841                    3F FEED & FOOD S L       vizcolozano  ...  Valid     2020-04-17  2023-04-16
2       FAM-1361                5N Plus Additives GmbH  eisenhüttenstadt  ...  Valid     2020-10-01  2023-09-30
3    FAM-1301-01                   A & V Corp. Limited            xiamen  ...  Valid     2020-09-09  2023-09-08
4       FAM-1146  A. + E. Fischer-Chemie GmbH & Co. KG         wiesbaden  ...  Valid     2020-06-05  2023-06-04
5       FAM-1589          A.M FOOD CHEMICAL CO LIMITED             jinan  ...  Valid     2020-01-07  2023-01-06
6    FAM-0613-01                          A.W.P. S.r.l        crevalcore  ...  Valid     2020-02-27  2023-02-07
7       FAM-0867             AB AGRI POLSKA Sp. z o.o.           smigiel  ...  Valid     2020-08-03  2023-03-19
8    FAM-1510-02                              AB Vista       marlborough  ...  Valid     2020-04-16  2023-04-15
9    FAM-1510-01                            AB Vista *         rotterdam  ...  Valid     2020-04-16  2023-04-15

[10 rows x 7 columns]]

【讨论】:

以上是关于将表格抓取并写入数据框会显示 TypeError的主要内容,如果未能解决你的问题,请参考以下文章

将从HTML表格抓取的数据写入CSV文件

使用 python 和 Beautifulsoup4 从抓取数据中写入和保存 CSV 文件

使用循环进行 Web 抓取并写入 csv

Swift Parse - 本地数据存储并在表格视图中显示对象

Python - Web 抓取 HTML 表格并打印到 CSV

网页抓取 - Python;写入 CSV