在函数内部的python中将元素添加到池数组
Posted
技术标签:
【中文标题】在函数内部的python中将元素添加到池数组【英文标题】:Adding elements to an pool array in python inside of an function 【发布时间】:2022-01-07 23:24:36 【问题描述】:我想编写一个网络爬虫,我需要将页面中的链接添加到池内的数组中,但池仅适用于给定的 url,不适用于我在 def 中提供的附加链接功能。
from concurrent import futures
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
def linksSearchAndAppend(url):
req = Request(url)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
if link[0]=="/":
link[0]==""
link=url+link
global urls
urls.append(links)
print (urls)
urlListend=open("urlList.txt", "r")
urls=[]
for line in urlListend:
urls.append(line.rstrip())
urlListend.close()
#main multithreading is working
e = futures.ThreadPoolExecutor(max_workers=8)
for url in urls:
e.submit(linksSearchAndAppend, url)
e.shutdown()
【问题讨论】:
据我所知,甚至没有调用 linkSearchAndAppend 函数 【参考方案1】:from concurrent import futures
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
def linksSearchAndAppend(url):
req = Request(url)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
#print (soup)
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
#if link[0]=="/":
# link[0]==""
# link=url+link
global urls
urls.append(links)
print (links)
urlListend=open("urlList.txt", "r")
urls=[]
for line in urlListend:
urls.append(line.rstrip())
urlListend.close()
#main multithreading is working
e = futures.ThreadPoolExecutor(max_workers=8)
for url in urls:
e.submit(linksSearchAndAppend, url)
e.shutdown()
【讨论】:
它仍然不能与附加部分一起工作,但超过了 if case this works 正如目前所写,您的答案尚不清楚。请edit 添加其他详细信息,以帮助其他人了解这如何解决所提出的问题。你可以找到更多关于如何写好答案的信息in the help center。【参考方案2】:这可行,但它仍然需要一个“alreadysearchedUrls”数组,这样它就不会重复搜索已经搜索过的“urls”
from concurrent import futures
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
def linksSearchAndAppend(url):
req = Request(url)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
#print (soup)
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
#if link[0]=="/":
# link[0]==""
# link=url+link
global urls
urls.append(links)
print (urls)
urlListend=open("urlList.txt", "r")
urls=[]
for line in urlListend:
urls.append(line.rstrip())
urlListend.close()
#main multithreading is working
for i in urls:
e = futures.ThreadPoolExecutor(max_workers=8)
for url in urls:
e.submit(linksSearchAndAppend, url)
e.shutdown()
【讨论】:
我仍然不知道如何通过正在运行的进程在池中添加元素,但我猜想以这种方式找到了解决方法以上是关于在函数内部的python中将元素添加到池数组的主要内容,如果未能解决你的问题,请参考以下文章