Powershell + Selenium 爬虫--代理(03)

Posted PS_cmdlet

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Powershell + Selenium 爬虫--代理(03)相关的知识,希望对你有一定的参考价值。

上一篇介绍了Senlinum 的操作, 真正需要使用senlenium 爬取目标网站还需要做一些其他伪装, 例如: 设置浏览器的代理来访问目标网站, 这样以来可以避免目标网站发现是爬虫, 从而把自己的上网IP 拉进网站后台的黑名单当中, 这样有可能造成自己的IP 被永久限制访问网站或者限制访问指定的内容

为此, 我们找到了一些网上的免费的代理网址, 通过代理网址提供的免费代理IP 来访问目标网站就相对来说安全多了, 避免了自己 上网IP 暴露给目标网站

讲到这里, 我再梳理一下以上的逻辑:

1. 确定要爬取的目标网址

2. 使用代理IP 伪装自己, 访问目标网址

3. 代理IP 池, 有待进一步验证和更新

#ipmo D:\\tools\\Selenium\\WebDriver.Support.dll
#ipmo D:\\tools\\Selenium\\WebDriver.dll
$proxyurl = \'http://www.66ip.cn/\'
$testurl = "https://www.baidu.com"
$ChromeOption = New-Object OpenQA.Selenium.Chrome.ChromeOptions
$ChromeOption.AddExcludedArgument("enable-automation") # For closed "disable-infobars" message
$ChromeOption.AddArguments("--start-maximized") # By default open chrome will use maximized window
$ChromeOption.AddArgument(\'--disable-blink-features=AutomationControlled\') # Set "window.navigator.webdriver" = False
#$ChromeOption.AddArgument(\'--proxy-server=http://219.159.38.200:56210\') # Set proxy address access target website
$ChromeDriver = New-Object OpenQA.Selenium.Chrome.ChromeDriver($ChromeOption)
$ChromeDriver.Navigate().GoToUrl($proxyurl)
sleep 5
#region   https://www.89ip.cn
<#
$i = 0
$proxyIPs = @()
while ($true)
{
	$i++
	if ($i -ne 1)
	{
		$ChromeDriver.FindElementByLinkText(\'下一页\') |Out-Null
        sleep 3
	}
	$trs = $ChromeDriver.FindElementsByCssSelector(\'tbody tr\')
	if ($trs.Count -gt 0)
	{
		$j = 0
		foreach ($tr in $trs)
		{
			$j++
			$w = $j.ToString() + \'/\' + $trs.Count.ToString()
			$percent = "{0:0.0%}" -f ($j/$trs.Count)
			Write-Progress -Activity "Process test proxy address" -Status "请耐心等待,Current $i 页 $w , $percent" -PercentComplete ($j/($trs.count) * 100)
			
			$trinfo = $tr.Text -split \' \'
			$recordtime = $trinfo[4] + " " + $trinfo[5]
			try
			{
				$testproxy = "http://{0}:{1}" -f ($trinfo[0]), ($trinfo[1])
				$testresult = Invoke-WebRequest -Uri $testurl -Proxy $testproxy -TimeoutSec 3 -ErrorAction Stop
				if ($testresult.StatusCode -eq 200)
				{
					Write-Host $testproxy
					$obj = New-Object psobject
					$obj | Add-Member -MemberType NoteProperty -Name IP -Value $trinfo[0] -Force
					$obj | Add-Member -MemberType NoteProperty -Name Port -Value $trinfo[1] -Force
					$obj | Add-Member -MemberType NoteProperty -Name Region -Value $trinfo[2] -Force
					$obj | Add-Member -MemberType NoteProperty -Name ISP -Value $trinfo[3] -Force
					$obj | Add-Member -MemberType NoteProperty -Name RecordTime -Value $recordtime -Force
					#$obj | epcsv d:\\ProxyServerList-20210828.csv -Encoding UTF8 -Append -Force -NoTypeInformation
                    $proxyIPs +=$obj
				}
			}
			catch
			{
				#$errormsg = $_.Exception.Message
				#Write-Host "$testproxy Test Failed "
			}
		}
	}
    else
    {
        break
    }
}
$proxyIPs |epcsv d:\\ProxyServerList-20210829.csv -Encoding UTF8 -Force -NoTypeInformation

#>
#endregion

#region http://www.66ip.cn/
$proxylist = @()
$regionnames = ($ChromeDriver.FindElementsByTagName(\'li\') |select text -Last 34).Text
foreach($regionname in $regionnames)
{
    $ChromeDriver.FindElementByLinkText($regionname).Click()
    sleep 3
    $trcount = ($ChromeDriver.FindElementsByTagName(\'tr\') |measure |select count).count
    $filtercount = $trcount - 3
    $iplist = $ChromeDriver.FindElementsByTagName(\'tr\') |select Text -Last $filtercount
    $j = 0
    if($iplist.Count -ge 0)
    {
        foreach($ipstring in $iplist.Text)
        {
            $ipinfo = $ipstring -split \' \'
            $ipaddress = $ipinfo[0]
            $ipport = $ipinfo[1]
            $ipregion = $ipinfo[2]
            $iptype = $ipinfo[3]
            $j++
	    	$w = $j.ToString() + \'/\' + $iplist.Count.ToString()
	    	$percent = "{0:0.0%}" -f ($j/$iplist.Count)
	    	Write-Progress -Activity "Process test proxy address" -Status "请耐心等待,Current $ipregion  $w , $percent" -PercentComplete ($j/($iplist.count) * 100)
	    		
            try
            {
                $testproxy = "http://{0}:{1}" -f $ipaddress, $ipport
	    		$testresult = Invoke-WebRequest -Uri $testurl -Proxy $testproxy -TimeoutSec 3 -ErrorAction Stop
	    		if ($testresult.StatusCode -eq 200)
	    		{
                    Write-Host $testproxy
	    			$obj = New-Object psobject
                    $obj |Add-Member -MemberType NoteProperty -Name IPAddress -Value $ipaddress -Force
                    $obj |Add-Member -MemberType NoteProperty -Name Port -Value $ipport -Force
                    $obj |Add-Member -MemberType NoteProperty -Name Region -Value $ipregion -Force
                    $proxylist +=$obj
                }
            }
            catch
            {
                
            }
        }
    }
}
$proxylist |select IPAddress,Port,Region -Unique |ogv
#endregion

 

 

以上是关于Powershell + Selenium 爬虫--代理(03)的主要内容,如果未能解决你的问题,请参考以下文章

使用 PowerShell 和 Selenium 4 查找元素

pathon selenium爬虫里面如何定位一个button元素?

解决selenium驱动被识别反爬,让爬虫顺利跑起来

爬虫如何用python+selenium网页爬虫

Python 爬虫实例(12)—— python selenium 爬虫

Python爬虫之selenium的使用