我试图在kickstarter上获得不同项目的支持者和家乡
Posted
技术标签:
【中文标题】我试图在kickstarter上获得不同项目的支持者和家乡【英文标题】:I try to get the backers and the home city of different projects on kickstarter 【发布时间】:2018-01-02 00:14:29 【问题描述】:通过以下代码,我尝试从 kickstarter 获取支持者所在的城市和地点。但是,我一直遇到以下错误:
中的文件“D:/location”,第 60 行 page1 = urllib.request.urlopen(projects[counter]) IndexError: 列表索引超出范围
是否有人有更优雅的解决方案将页面提供给 urllib.request.urlopen? (参见 ** ** 中的行)
代码:
# coding: utf-8
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
from datetime import datetime
from collections import OrderedDict
import re
browser = webdriver.Firefox()
browser.get('https://www.kickstarter.com/discover?ref=nav')
categories = browser.find_elements_by_class_name('category-container')
category_links = []
for category_link in categories:
#Each item in the list is a tuple of the category's name and its link.category_links.append((str(category_link.find_element_by_class_name('f3').text),
category_link.find_element_by_class_name('bg-white').get_attribute('href')))
scraped_data = []
now = datetime.now()
counter = 1
for category in category_links:
browser.get(category[1])
browser.find_element_by_class_name('sentence-open').click()
time.sleep(2)
browser.find_element_by_id('category_filter').click()
time.sleep(2)
for i in range(27):
try:
time.sleep(2)
browser.find_element_by_id('category_'+str(i)).click()
time.sleep(2)
except:
pass
#while True:
# try:
# browser.find_element_by_class_name('load_more').click()
# except:
# break
projects = []
for project_link in browser.find_elements_by_class_name('clamp-3'):
projects.append(project_link.find_element_by_tag_name('a').get_attribute('href'))
for project in projects:
**page1 = urllib.request.urlopen(projects[counter])**
soup1 = BeautifulSoup(page1, "lxml")
**page2 = urllib.request.urlopen(projects[counter].split('?')**[0]+'/community')
soup2 = BeautifulSoup(page2, "lxml")
time.sleep(2)
print(str(counter)+': '+project+'\nStatus: Started.')
project_dict = OrderedDict()
project_dict['Category'] = category[0]
browser.get(project)
project_dict['Name'] = soup1.find(class_='type-24 type-28-sm type-38-md navy-700 medium mb3').text
project_dict['Home State'] = str(soup1.find(class_='nowrap navy-700 flex items-center medium type-12').text)
try:
project_dict['Backer State'] = str(soup2.find(class_='location-list-wrapper js-location-list-wrapper').text)
except:
pass
print('Status: Done.')
counter+=1
scraped_data.append(project_dict)
later = datetime.now()
diff = later - now
print('The scraping took '+str(round(diff.seconds/60.0,2))+' minutes, and scraped '+str(len(scraped_data))+' projects.')
df = pd.DataFrame(scraped_data)
df.to_csv('kickstarter-data.csv')
【问题讨论】:
【参考方案1】:如果您只使用counter
打印项目状态消息,您可以使用range
或enumerate
代替。这是enumerate
的示例:
for counter, project in enumerate(projects):
... code ...
enumerate
生成一个元组 (index, item) ,因此您的其余代码应该可以正常工作。
还有一些事情:
列表索引从 0 开始,因此当您使用 counter
访问项目时,您会得到一个 IndexError
,因为您使用 1 启动 counter
。
在for循环中你不需要projects[counter]
,只需使用project
【讨论】:
以上是关于我试图在kickstarter上获得不同项目的支持者和家乡的主要内容,如果未能解决你的问题,请参考以下文章
pxe+kickstart部署多个版本的Linux操作系统(下)---实践篇