cURL 返回空数组
Posted
技术标签:
【中文标题】cURL 返回空数组【英文标题】:cURL returns null array 【发布时间】:2021-08-11 10:28:35 【问题描述】:我用 php cURL 做了一个简单的网络爬虫,它应该从亚马逊上抓取关键字 samsung
已被搜索到的特定页面的所有图像。
代码如下:
$curl = curl_init(); // $curl is going to be data type curl resource
$search_string = "samsung";
$url = "https://www.amazon.com/s?k$search_string";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); // ssl
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // storing in variable
$result = curl_exec($curl);
preg_match_all("!https://m.media-amazon.com/images/I/[^\s]*?._AC_UL320_.jpg!", $result, $matches);
print_r($matches);
curl_close($curl);
但现在我得到的是空数组:
Array ( [0] => Array ( ) )
我不知道为什么会这样,所以如果你知道出了什么问题或者我该如何处理,请告诉我,我非常感谢你们的任何想法......
提前致谢。
注意,我指定了[^\s]*?
正则表达式而不是图像名称来加载网页上所有可用的图像。
更新 #1:
curl --head https://www.amazon.com/s?k=samsung
的结果
HTTP/1.1 503 Service Unavailable
Content-Type: text/html
Content-Length: 2671
Connection: keep-alive
Server: Server
Date: Tue, 15 Jun 2021 20:59:38 GMT
x-amz-rid: 9BVX8KQMWJ4QDJ75ETYV
Vary: Content-Type,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
Last-Modified: Fri, 14 May 2021 19:08:48 GMT
ETag: "a6f-5c24ef9383000"
Accept-Ranges: bytes
Strict-Transport-Security: max-age=47474747; includeSubDomains; preload
Permissions-Policy: interest-cohort=()
X-Cache: Error from cloudfront
Via: 1.1 5345148f0ba8ae3c67b69d035acdbfc5.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: AMS50-C1
X-Amz-Cf-Id: AHdq2-QLEtCE4WvXZIEh_P75D8hCrHP09EAkNqBer5VBS-pI-blj1w==
【问题讨论】:
对您的请求的响应很可能是重定向或重写,并且不包括您要查找的内容。 @DonR 那么我该怎么做才能解决这个问题 您必须像浏览器一样处理重定向并请求新资源。 @DonR 你能解释更多或者给我一个例子的链接吗 第 1 步:打开一个终端/cmd,然后只需 运行 curl 即可查看它为您的 URL 返回的内容。但您可能想先修正 URL 中的拼写错误(输入echo $url;
看看有什么问题)。
【参考方案1】:
第一期:你的代码:
$url = "https://www.amazon.com/s?k$search_string";
应该是(注意“=”)
$url = "https://www.amazon.com/s?k=$search_string";
第二期:亚马逊很聪明,他们不会让你随心所欲地刮。结果是以下内容:
你可以看到这个:
$result = curl_exec($curl);
var_dump($result);
第三个问题:正则表达式不起作用。应该在https://www.phpliveregex.com/#tab-preg-match-all 测试正则表达式 (使用右键单击>查看源代码,复制并粘贴页面内容。)
从我得到的你的正则表达式没有返回任何结果,但是这样做了:https://m.media-amazon.com/images/I/[^\s]*?.jpg
可能是字符串位._AC_UL320_
也是亚马逊反刮的东西……:(
【讨论】:
只要您已经足够好地模拟了真实用户,您就可以抓取 AWS。我假设 OP 的 UserAgent 仍然是cURL/PHP
或类似的,这显然是危险的。带有 CookieJar 的 Guzzle 有很大帮助,因为它保留了之前设置的 cookie,如果它们不存在会触发类似的验证码
您通过没有类似浏览器的Accept
标头和没有用户代理来触发反抓取的东西,有趣的是 curl/xxx 显然是一个黑名单用户代理,但是“ libcurl"(一个非标准的ua字符串)没有被列入黑名单,你可以通过运行curl 'https://www.amazon.com/s?k=samsung' --compressed -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' --user-agent 'libcurl'
得到正确的结果【参考方案2】:
不是https://www.amazon.com/s?k$search_string
,应该是'https://www.amazon.com/s?k='.urlencode($search_string);
,Amazon.com 还要求您发送 Accept-Encoding 标头,否则您将面临获得 gzip 压缩响应的风险,而无需解压,这意味着您需要CURLOPT_ENCODING,如果您不提供 User-Agent
标头,亚马逊也会阻止您,因此您必须提供 CURLOPT_USERAGENT,亚马逊也会在没有类似浏览器的 Accept 标头的情况下阻止您,因此您需要 CURLOPT_HTTPHEADER => array('accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng')
还有Do not parse html with regex,正则表达式是一种不够复杂的工具,无法理解 HTML 使用的结构。 HTML 不是正则语言,因此不能被正则表达式解析。正则表达式查询无法将 HTML 分解成有意义的部分。 而是使用 DOMDocument 之类的 HTML 解析器
这段代码
<?php
$curl = curl_init(); // $curl is going to be data type curl resource
$search_string = "samsung";
$url = "https://www.amazon.com/s?k=".urlencode($search_string);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); // ssl
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // storing in variable
curl_setopt_array($curl,array(
CURLOPT_ENCODING =>'',
CURLOPT_USERAGENT=>'libcurl',
CURLOPT_HTTPHEADER=>array(
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
)
));
$html=curl_exec($curl);
$domd = new DOMDocument();
@$domd->loadHTML($html);
foreach($domd->getElementsByTagName("img") as $img)
echo $img->getAttribute("src"),"\n";
输出
//fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:136-7756522-9160852:777GSTVR1XJ9MBF1N0KN$uedata=s:%2Frd%2Fuedata%3Fstaticb%26id%3D777GSTVR1XJ9MBF1N0KN:0
https://images-na.ssl-images-amazon.com/images/G/01/gno/sprites/nav-sprite-global-1x-hm-dsk-reorg._CB405937547_.png
https://m.media-amazon.com/images/I/81HdcaHSq4L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/91eAcgt9fSL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81afsli5ctL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61m1Dot5KCL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61HFJwSDQ4L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/216-OX9rBaL._SS72_.png
https://m.media-amazon.com/images/I/21OXy0oJ8VL._SS160_.png
https://m.media-amazon.com/images/I/61jfI8GyQgL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61LUNEgB6iL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/813dec-cszS._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81AT+Flc+EL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/216-OX9rBaL._SS72_.png
https://m.media-amazon.com/images/I/21OXy0oJ8VL._SS160_.png
https://m.media-amazon.com/images/I/61a5ejk6K2L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81+3SWSAhDL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61pwE8H34zL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71ejkOW4y2L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71G6eW8H8hL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/91dFUw5MUTS._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81P4RzFnw6L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/712iry8nIYL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61VgW9ZZXiL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61ft-L7HnUL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51icdppvRVL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/6164p9jY2jS._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51skvShlcsL._AC_UY218_.jpg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/93913ead-ae42-4933-8fc4-e9f88b0396c9/1635f47b-1fa9-40ca-8d85-47f529c1ba8b/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/6aa489c6-af9d-48d0-94c8-cce1a4f50fc7/ff2a7805-3166-41b9-9881-d00901ca9dfd/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/73b89b9f-ee28-446f-8535-beacd328c95a/8caa5478-3583-49f9-9dcb-6e5b0a254fa6/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/457fd8ad-f566-4682-bb66-fd865954aec0/fb2cdc76-7ed6-4b86-9196-d40c3ead2914/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/5c60fcd5-17c1-4389-8423-2252436f21c8/0125e72d-9178-4048-bea3-9d268a406a05/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/f852e5ab-0fa9-4f91-b195-b0facc4d0d70/30b0ec08-79b2-428d-98df-aadffd2c00eb/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/d173de56-5162-463f-be97-d256c1895024/7974c773-0c53-43a1-bfb4-91d7cc3ce801/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/2cfe5e10-6a7e-43f4-80c7-d87f212b8007/43e8a030-58c5-491a-9854-cd4d8824a873/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/personalization/ybh/loading-4x-gray._CB485916920_.gif
https://assoc-na.associates-amazon.com/abid/um?s=136-7756522-9160852&m=ATVPDKIKX0DER
//fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:136-7756522-9160852:777GSTVR1XJ9MBF1N0KN$uedata=s:%2Frd%2Fuedata%3Fnoscript%26id%3D777GSTVR1XJ9MBF1N0KN:0
【讨论】:
【参考方案3】:$url = "https://www.amazon.com/s?k$search_string"; 是的,你的网址是错误的 实际网址是。你可以试试
$url = "https://www.amazon.com/s?k=$search_string";
【讨论】:
【参考方案4】:首先有一个错字 改变
$url = "https://www.amazon.com/s?k".$search_string;
到
$url = "https://www.amazon.com/s?k=".$search_string;
亚马逊希望在请求内容时会有一些标头值,请参考以下 curl 请求
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.3>
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v>
));
curl_setopt($curl, CURLOPT_ENCODING, '');
$result=curl_exec($curl);
最后,改变你的 preg_match_all 函数
preg_match_all("!https://m.media-amazon.com/images/I/[^\s]*?._AC_UL320_.jpg!", $result, $matches);
到
preg_match_all('/(https?:\/\/\S+\.(?:jpg|png|gif))\s+/', $result, $matches);
完整代码:
<?php
$curl = curl_init();
$search_string = "samsung";
$url = "https://www.amazon.com/s?k=".$search_string;
//set headers to match with amazon header . you can check headers with any browsers developer tool.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36');
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
));
curl_setopt($curl, CURLOPT_ENCODING, '');
$result=curl_exec($curl);
preg_match_all('/(https?:\/\/\S+\.(?:jpg|png|gif))\s+/', $result, $matches);
print_r($matches);
【讨论】:
以上是关于cURL 返回空数组的主要内容,如果未能解决你的问题,请参考以下文章