在函数内部的python中将元素添加到池数组

Posted

技术标签:

【中文标题】在函数内部的python中将元素添加到池数组【英文标题】:Adding elements to an pool array in python inside of an function 【发布时间】:2022-01-07 23:24:36 【问题描述】:

我想编写一个网络爬虫,我需要将页面中的链接添加到池内的数组中,但池仅适用于给定的 url,不适用于我在 def 中提供的附加链接功能。

from concurrent import futures
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen



def linksSearchAndAppend(url):
    req = Request(url)
    html_page = urlopen(req)

    soup = BeautifulSoup(html_page, "lxml")

    links = []
    for link in soup.findAll('a'):
        links.append(link.get('href'))
        if link[0]=="/":
            link[0]==""
            link=url+link

    global urls
    urls.append(links)
    print (urls)
    



urlListend=open("urlList.txt", "r")
urls=[]
for line in urlListend:
    urls.append(line.rstrip())
urlListend.close()
#main multithreading is working
e = futures.ThreadPoolExecutor(max_workers=8)
for url in urls:
    e.submit(linksSearchAndAppend, url)
e.shutdown()

【问题讨论】:

据我所知,甚至没有调用 linkSearchAndAppend 函数 【参考方案1】:
from concurrent import futures
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen



def linksSearchAndAppend(url):
    req = Request(url)
    html_page = urlopen(req)

    soup = BeautifulSoup(html_page, "lxml")
    #print (soup)
    links = []
    for link in soup.findAll('a'):
        links.append(link.get('href'))
        #if link[0]=="/":
        #    link[0]==""
        #    link=url+link

    global urls
    urls.append(links)
    print (links)
    



urlListend=open("urlList.txt", "r")
urls=[]
for line in urlListend:
    urls.append(line.rstrip())
urlListend.close()
#main multithreading is working
e = futures.ThreadPoolExecutor(max_workers=8)
for url in urls:
    e.submit(linksSearchAndAppend, url)
e.shutdown()

【讨论】:

它仍然不能与附加部分一起工作,但超过了 if case this works 正如目前所写,您的答案尚不清楚。请edit 添加其他详细信息,以帮助其他人了解这如何解决所提出的问题。你可以找到更多关于如何写好答案的信息in the help center。【参考方案2】:

这可行,但它仍然需要一个“alreadysearchedUrls”数组,这样它就不会重复搜索已经搜索过的“urls”

from concurrent import futures
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen



def linksSearchAndAppend(url):
    req = Request(url)
    html_page = urlopen(req)

    soup = BeautifulSoup(html_page, "lxml")
    #print (soup)
    links = []
    for link in soup.findAll('a'):
        links.append(link.get('href'))
        #if link[0]=="/":
        #    link[0]==""
        #    link=url+link

    global urls
    urls.append(links)
    print (urls)
    



urlListend=open("urlList.txt", "r")
urls=[]
for line in urlListend:
    urls.append(line.rstrip())
urlListend.close()
#main multithreading is working
for i in urls:

    e = futures.ThreadPoolExecutor(max_workers=8)
    for url in urls:
        e.submit(linksSearchAndAppend, url)
    e.shutdown()

【讨论】:

我仍然不知道如何通过正在运行的进程在池中添加元素,但我猜想以这种方式找到了解决方法

以上是关于在函数内部的python中将元素添加到池数组的主要内容,如果未能解决你的问题,请参考以下文章

在jQuery中将元素添加到二维数组[重复]

在 Swift 中将元素添加到数组中

C++ 向数组添加元素

在javascript中将数组作为元素添加到数组中

如何在c ++中的类构造函数中将整个数组初始化为单个元素

在python中将数组的元素从科学计数法转换为十进制计数法