PHP 并行 curl 请求

Posted

技术标签:

【中文标题】PHP 并行 curl 请求【英文标题】:PHP Parallel curl requests 【发布时间】:2012-03-07 16:53:01 【问题描述】:

我正在做一个简单的应用程序,它从 15 个不同的 URL 读取 json 数据。我有一个特殊的需要,我需要在服务器上执行此操作。我正在使用file_get_contents($url)

因为我使用的是 file_get_contents($url)。我写了一个简单的脚本,是吗:

$websites = array(
    $url1,
    $url2,
    $url3,
     ...
    $url15
);

foreach ($websites as $website) 
    $data[] = file_get_contents($website);

事实证明它非常慢,因为它等待第一个请求,然后执行下一个请求。

【问题讨论】:

Google 为“curl parallel requests”提供了许多结果 php 是一种单线程语言,它没有任何对并发的内部支持。您可以编写一个脚本来获取单个 URL(作为参数提供)并执行它的 15 个实例。 感谢您的所有意见。 :) 万一有人偶然发现这个页面,GordonM 上面的评论是不正确的; PHP curl 库专门支持多个并行请求。除此之外,您还可以使用 pthreads 扩展创建完全多线程的 PHP 应用程序,尽管这样做完全没有必要且过大,因为 curl 扩展只支持它。 【参考方案1】:

如果你的意思是多卷曲,那么这样的事情可能会有所帮助:

$nodes = array($url1, $url2, $url3); $node_count = count($nodes); $curl_arr = array(); $master = curl_multi_init(); for($i = 0; $i < $node_count; $i++) $url =$nodes[$i]; $curl_arr[$i] = curl_init($url); curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true); curl_multi_add_handle($master, $curl_arr[$i]); do curl_multi_exec($master,$running); while($running > 0); for($i = 0; $i < $node_count; $i++) $results[] = curl_multi_getcontent ( $curl_arr[$i] ); print_r($results);

希望对你有所帮助

【讨论】:

现在试试... :)。我会告诉你它是否有效,非常感谢。 哦,这一直发生在我身上!或者他们投票赞成答案但不接受,或者接受但不投票。令人沮丧。 我可以知道 $running 包含什么吗? @ramyabr 布尔值(参考)如果 multicurl 仍在运行并获取数据。 你的 multi_exec 循环会工作,但它也会浪费大量的 cpu,使用 100% 的 CPU(1 个核心)直到所有内容都下载完毕,因为你的循环是垃圾邮件 curl_multi_exec(),一个 async 函数,尽可能快,直到下载所有内容。如果您将其更改为do curl_multi_exec($master,$running);if($running&gt;0)curl_multi_select($mh,1); while($running &gt; 0);,那么它将使用~1% cpu 而不是100% cpu(尽管仍然可以构建更好的循环,这会更好for(;;)curl_multi_exec($mh,$running);if($running&lt;1)break;curl_multi_select($mh,1);【参考方案2】:

我想提供一个更完整的示例,而不会使 CPU 达到 100% 并在出现轻微错误或意外情况时崩溃。

它还向您展示了如何获取标头、正文、请求信息和手动重定向。

免责声明,此代码旨在扩展并实现到库中或作为快速入门,因此将其中的功能保持在最低限度。

function mtime()
    return microtime(true);

function ptime($prev)
    $t = microtime(true) - $prev;
    $t = $t * 1000;
    return str_pad($t, 20, 0, STR_PAD_RIGHT);


// This function exists to add compatibility for CURLM_CALL_MULTI_PERFORM for old curl versions, on modern curl it will only run once and be the equivalent of calling curl_multi_exec
function curl_multi_exec_full($mh, &$still_running) 
    // In theory curl_multi_exec should never return CURLM_CALL_MULTI_PERFORM (-1) because it has been deprecated
    // In practice it sometimes does
    // So imagine that this just runs curl_multi_exec once and returns it's value
    do 
        $state = curl_multi_exec($mh, $still_running);

        // curl_multi_select($mh, $timeout) simply blocks for $timeout seconds while curl_multi_exec() returns CURLM_CALL_MULTI_PERFORM
        // We add it to prevent CPU 100% usage in case this thing misbehaves (especially for old curl on windows)
     while ($still_running > 0 && $state === CURLM_CALL_MULTI_PERFORM && curl_multi_select($mh, 0.1));
    return $state;


// This function replaces curl_multi_select and makes the name make more sense, since all we're doing is waiting for curl, it also forces a minimum sleep time between requests to avoid excessive CPU usage.
function curl_multi_wait($mh, $minTime = 0.001, $maxTime = 1)
    $umin = $minTime*1000000;

    $start_time = microtime(true);

    // it sleeps until there is some activity on any of the descriptors (curl files)
    // it returns the number of descriptors (curl files that can have activity)
    $num_descriptors = curl_multi_select($mh, $maxTime);

    // if the system returns -1, it means that the wait time is unknown, and we have to decide the minimum time to wait
    // but our `$timespan` check below catches this edge case, so this `if` isn't really necessary
    if($num_descriptors === -1)
        usleep($umin);
    

    $timespan = (microtime(true) - $start_time);

    // This thing runs very fast, up to 1000 times for 2 urls, which wastes a lot of CPU
    // This will reduce the runs so that each interval is separated by at least minTime
    if($timespan < $umin)
        usleep($umin - $timespan);
        //print "sleep for ".($umin - $timeDiff).PHP_EOL;
    



$handles = [
    [
        CURLOPT_URL=>"http://example.com/",
        CURLOPT_HEADER=>false,
        CURLOPT_RETURNTRANSFER=>true,
        CURLOPT_FOLLOWLOCATION=>false,
    ],
    [
        CURLOPT_URL=>"http://www.php.net",
        CURLOPT_HEADER=>false,
        CURLOPT_RETURNTRANSFER=>true,
        CURLOPT_FOLLOWLOCATION=>false,

        // this function is called by curl for each header received
        // This complies with RFC822 and RFC2616, please do not suggest edits to make use of the mb_ string functions, it is incorrect!
        // https://***.com/a/41135574
        CURLOPT_HEADERFUNCTION=>function($ch, $header)
        
            print "header from http://www.php.net: ".$header;
            //$header = explode(':', $header, 2);
            //if (count($header) < 2) // ignore invalid headers
            //    return $len;
            //

            //$headers[strtolower(trim($header[0]))][] = trim($header[1]);

            return strlen($header);
        
    ]
];




//create the multiple cURL handle
$mh = curl_multi_init();

$chandles = [];
foreach($handles as $opts) 
    // create cURL resources
    $ch = curl_init();

    // set URL and other appropriate options
    curl_setopt_array($ch, $opts);

    // add the handle
    curl_multi_add_handle($mh, $ch);

    $chandles[] = $ch;



//execute the multi handle
$prevRunning = null;
$count = 0;
do 
    $time = mtime();

    // $running contains the number of currently running requests
    $status = curl_multi_exec_full($mh, $running);
    $count++;

    print ptime($time).": curl_multi_exec status=$status running $running".PHP_EOL;

    // One less is running, meaning one has finished
    if($running < $prevRunning)
        print ptime($time).": curl_multi_info_read".PHP_EOL;

        // msg: The CURLMSG_DONE constant. Other return values are currently not available.
        // result: One of the CURLE_* constants. If everything is OK, the CURLE_OK will be the result.
        // handle: Resource of type curl indicates the handle which it concerns.
        while ($read = curl_multi_info_read($mh, $msgs_in_queue)) 

            $info = curl_getinfo($read['handle']);

            if($read['result'] !== CURLE_OK)
                // handle the error somehow
                print "Error: ".$info['url'].PHP_EOL;
            

            if($read['result'] === CURLE_OK)
                /*
                // This will automatically follow the redirect and still give you control over the previous page
                // TODO: max redirect checks and redirect timeouts
                if(isset($info['redirect_url']) && trim($info['redirect_url'])!=='')

                    print "running redirect: ".$info['redirect_url'].PHP_EOL;
                    $ch3 = curl_init();
                    curl_setopt($ch3, CURLOPT_URL, $info['redirect_url']);
                    curl_setopt($ch3, CURLOPT_HEADER, 0);
                    curl_setopt($ch3, CURLOPT_RETURNTRANSFER, 1);
                    curl_setopt($ch3, CURLOPT_FOLLOWLOCATION, 0);
                    curl_multi_add_handle($mh,$ch3);
                
                */

                print_r($info);
                $body = curl_multi_getcontent($read['handle']);
                print $body;
            
        
    

    // Still running? keep waiting...
    if ($running > 0) 
        curl_multi_wait($mh);
    

    $prevRunning = $running;

 while ($running > 0 && $status == CURLM_OK);

//close the handles
foreach($chandles as $ch)
    curl_multi_remove_handle($mh, $ch);

curl_multi_close($mh);

print $count.PHP_EOL;

【讨论】:

您的 multi_exec() 循环毫无意义,并且将始终在第一行退出...如果您绝对坚持支持 CURLM_CALL_MULTI_PERFORM (至少从 2012 年起已弃用 curl 并且不再使用),循环应该是:for (;;) do $ex = curl_multi_exec($mh, $still_running); while ($ex === CURLM_CALL_MULTI_PERFORM); if ($ex !== CURLM_OK) /*handle curl error?*/ if ($still_running &lt; 1) break; curl_multi_select($mh, 1); 您的代码处理 CURLM_CALL_MULTI_PERFORM(因此 CCMP)错误,如果您获得 CCMP,则不应运行 select(),如果您获得 CCMP,则应该再次调用 multi_exec() ,但更糟糕的是,截至(2012 年?) curl 不再返回 CCMP,因此您的 $state === CCMP 检查将总是失败,这意味着您的 exec 循环将在第一次之后总是退出迭代 我最初的理由是将它添加为旧版本 curl(2012 年之前)的向后兼容性,如果它立即存在循环就可以了。这也是我将它打包成curl_multi_exec_full 的原因,它可以重命名为curl_multi_exec 以实现2012 年后的兼容性。 CCMP 将再次选择并执行。我真的很感谢您的评论,并希望更多地说明代码错误的原因,现在我没有看到错误。 一方面:如果您获得 CCMP,则运行 select(),这是错误的。如果您获得 CCMP,您不应该等待更多数据到达。这意味着如果您获得 CCMP,您应该立即运行 curl_multi_exec() (如果单个 multi_exec() 使用过多的 cpu/时间,但它允许需要非常低延迟/实时系统的程序执行其他操作,但是这么多人们不明白如何正确使用它,curl 开发人员决定弃用它:太多人弄错了,很少有人真正需要它。在 curl 邮件列表中,只有 1 人抱怨并实际使用它) 二:如果你没有获得 CCMP,你永远不会运行 select(),但这也是错误的,有时(在这些天,OFTEN)你应该运行select() 即使你没有得到 CCMP,但你的代码没有。【参考方案3】:

我不是特别喜欢任何现有答案的方法

Timo 的代码:可能会在 CURLM_CALL_MULTI_PERFORM 期间休眠/选择()这是错误的,它也可能在 ($still_running > 0 && $exec != CURLM_CALL_MULTI_PERFORM) 时无法休眠,这可能会使代码以 100% 的 cpu 使用率旋转( 1个核心)无缘无故

Sudhir 的代码:当 $still_running > 0 时不会休眠,并且垃圾邮件调用 async-函数 curl_multi_exec() 直到所有内容都已下载,这导致 php 使用 100% cpu (of 1 cpu core) 直到下载完所有内容,换句话说,它在下载时无法休眠

这里有一个没有这些问题的方法:

$websites = array(
    "http://google.com",
    "http://example.org"
    // $url2,
    // $url3,
    // ...
    // $url15
);
$mh = curl_multi_init();
foreach ($websites as $website) 
    $worker = curl_init($website);
    curl_setopt_array($worker, [
        CURLOPT_RETURNTRANSFER => 1
    ]);
    curl_multi_add_handle($mh, $worker);

for (;;) 
    $still_running = null;
    do 
        $err = curl_multi_exec($mh, $still_running);
     while ($err === CURLM_CALL_MULTI_PERFORM);
    if ($err !== CURLM_OK) 
        // handle curl multi error?
    
    if ($still_running < 1) 
        // all downloads completed
        break;
    
    // some haven't finished downloading, sleep until more data arrives:
    curl_multi_select($mh, 1);

$results = [];
while (false !== ($info = curl_multi_info_read($mh))) 
    if ($info["result"] !== CURLE_OK) 
        // handle download error?
    
    $results[curl_getinfo($info["handle"], CURLINFO_EFFECTIVE_URL)] = curl_multi_getcontent($info["handle"]);
    curl_multi_remove_handle($mh, $info["handle"]);
    curl_close($info["handle"]);

curl_multi_close($mh);
var_export($results);

请注意,所有 3 方法共享的一个问题(我的回答、Sudhir 的回答和 Timo 的回答)是它们将同时打开所有连接,如果您有 1,000,000 个网站要获取,这些脚本将尝试同时打开 1,000,000 个连接。如果您需要喜欢.. 一次只下载 50 个网站,或者类似的东西,不妨试试:

$websites = array(
    "http://google.com",
    "http://example.org"
    // $url2,
    // $url3,
    // ...
    // $url15
);
var_dump(fetch_urls($websites,50));
function fetch_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $return_fault_reason = true): array

    if ($max_connections < 1) 
        throw new InvalidArgumentException("max_connections MUST be >=1");
    
    foreach ($urls as $key => $foo) 
        if (! is_string($foo)) 
            throw new \InvalidArgumentException("all urls must be strings!");
        
        if (empty($foo)) 
            unset($urls[$key]); // ?
        
    
    unset($foo);
    // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, $return_fault_reason) 
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) 
            do 
                $err = curl_multi_exec($mh, $still_running);
             while ($err === CURLM_CALL_MULTI_PERFORM);
            if ($still_running < count($workers)) 
                // some workers finished, fetch their response and close them
                break;
            
            $cms = curl_multi_select($mh, 1);
            // var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        
        while (false !== ($info = curl_multi_info_read($mh))) 
            // echo "NOT FALSE!";
            // var_dump($info);
            
                if ($info['msg'] !== CURLMSG_DONE) 
                    continue;
                
                if ($info['result'] !== CURLE_OK) 
                    if ($return_fault_reason) 
                        $ret[$workers[(int) $info['handle']]] = print_r(array(
                            false,
                            $info['result'],
                            "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result'])
                        ), true);
                    
                 elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) 
                    if ($return_fault_reason) 
                        $ret[$workers[(int) $info['handle']]] = print_r(array(
                            false,
                            $err,
                            "curl error " . $err . ": " . curl_strerror($err)
                        ), true);
                    
                 else 
                    $ret[$workers[(int) $info['handle']]] = curl_multi_getcontent($info['handle']);
                
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int) $info['handle']]));
                unset($workers[(int) $info['handle']]);
                curl_close($info['handle']);
            
        
        // echo "NO MORE INFO!";
    ;
    foreach ($urls as $url) 
        while (count($workers) >= $max_connections) 
            // echo "TOO MANY WORKERS!\n";
            $work();
        
        $neww = curl_init($url);
        if (! $neww) 
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of system resources", E_USER_WARNING);
            if ($return_fault_reason) 
                $ret[$url] = array(
                    false,
                    - 1,
                    "curl_init() failed"
                );
            
            continue;
        
        $workers[(int) $neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        // curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    
    while (count($workers) > 0) 
        // echo "WAITING FOR WORKERS TO BECOME 0!";
        // var_dump(count($workers));
        $work();
    
    curl_multi_close($mh);
    return $ret;

这将下载整个列表并且不会同时下载超过 50 个网址 (但即使这种方法也将所有结果存储在 ram 中,因此即使这种方法最终也可能会耗尽 ram;如果您想将其存储在数据库中而不是 ram 中,可以修改 curl_multi_getcontent 部分以将其存储在数据库而不是 ram-persistent 变量。)

【讨论】:

您能告诉我$return_fault_reason 挂载到什么位置吗? @AliNiaz 很抱歉在从this answer 复制代码时忘记了这一点,$return_fault_reason 应该是一个参数,告诉您是否应该忽略失败的下载,或者是否应该附带失败的下载错误信息;我现在用$return_fault_reason 参数更新了代码。

以上是关于PHP 并行 curl 请求的主要内容,如果未能解决你的问题,请参考以下文章

PHP AyeEmtract - 并行实现Curl以提取电子邮件地址的PHP类

一个PHP类,它并行实现Curl来提取电子邮件地址

记一次重构:并行化调用接口实践

php curl 多线程方法

使用 zeroMQ 在 PHP 中进行并行处理

两个同时的 AJAX 请求不会并行运行