无法纠正 - ValueError: unknown url type: Link
Posted
技术标签:
【中文标题】无法纠正 - ValueError: unknown url type: Link【英文标题】:Unable to rectify - ValueError: unknown url type: Link 【发布时间】:2018-01-14 06:44:27 【问题描述】:我目前正在运行此代码以将文章 url 链接抓取到 csv 文件中,并访问这些 url(在 csv 文件中)以将相应的信息抓取到文本文件中。
我能够抓取到 csv 文件的链接,但我无法访问 csv 文件以抓取更多信息(文本文件也未创建)并且我遇到了 ValueError
import csv
from lxml import html
from time import sleep
import requests
from bs4 import BeautifulSoup
import urllib
import urllib2
from random import randint
outputFile = open("All_links.csv", r'wb')
fileWriter = csv.writer(outputFile)
fileWriter.writerow(["Link"])
#fileWriter.writerow(["Sl. No.", "Page Number", "Link"])
url1 = 'https://www.marketingweek.com/page/'
url2 = '/?s=big+data'
sl_no = 1
#iterating from 1st page through 361th page
for i in xrange(1, 361):
#generating final url to be scraped using page number
url = url1 + str(i) + url2
#Fetching page
response = requests.get(url)
sleep(randint(10, 20))
#using html parser
htmlContent = html.fromstring(response.content)
#Capturing all 'a' tags under h2 tag with class 'hentry-title entry-title'
page_links = htmlContent.xpath('//div[@class = "archive-constraint"]//h2[@class = "hentry-title entry-title"]/a/@href')
for page_link in page_links:
print page_link
fileWriter.writerow([page_link])
sl_no += 1
with open('All_links.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('LinksOutput.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
这是我遇到的错误:
File "c:\users\rrj17\documents\visual studio 2015\Projects\webscrape\webscrape\webscrape.py", line 47, in <module>
soup = BeautifulSoup(urllib2.urlopen(url))
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 421, in open
protocol = req.get_type()
File "C:\Python27\lib\urllib2.py", line 283, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: Link
请求一些帮助。
【问题讨论】:
【参考方案1】:尝试跳过csv
文件中的第一行...您可能会在不知不觉中尝试解析标题。
with open('All_links.csv', 'rb') as f1:
reader = csv.reader(f1)
next(reader) # read the header and send it to oblivion
for line in reader: # NOW start reading
...
您也不需要f1.seek(0)
,因为f1
在读取模式下会自动指向文件的开头。
【讨论】:
不..发生错误。很抱歉造成混乱。以上是关于无法纠正 - ValueError: unknown url type: Link的主要内容,如果未能解决你的问题,请参考以下文章
Java项目/maven项目无法添加Web,Tomcat显示Unknow