Powershell + Selenium 爬虫--代理(03)
Posted PS_cmdlet
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Powershell + Selenium 爬虫--代理(03)相关的知识,希望对你有一定的参考价值。
上一篇介绍了Senlinum 的操作, 真正需要使用senlenium 爬取目标网站还需要做一些其他伪装, 例如: 设置浏览器的代理来访问目标网站, 这样以来可以避免目标网站发现是爬虫, 从而把自己的上网IP 拉进网站后台的黑名单当中, 这样有可能造成自己的IP 被永久限制访问网站或者限制访问指定的内容
为此, 我们找到了一些网上的免费的代理网址, 通过代理网址提供的免费代理IP 来访问目标网站就相对来说安全多了, 避免了自己 上网IP 暴露给目标网站
讲到这里, 我再梳理一下以上的逻辑:
1. 确定要爬取的目标网址
2. 使用代理IP 伪装自己, 访问目标网址
3. 代理IP 池, 有待进一步验证和更新
#ipmo D:\\tools\\Selenium\\WebDriver.Support.dll
#ipmo D:\\tools\\Selenium\\WebDriver.dll
$proxyurl = \'http://www.66ip.cn/\'
$testurl = "https://www.baidu.com"
$ChromeOption = New-Object OpenQA.Selenium.Chrome.ChromeOptions
$ChromeOption.AddExcludedArgument("enable-automation") # For closed "disable-infobars" message
$ChromeOption.AddArguments("--start-maximized") # By default open chrome will use maximized window
$ChromeOption.AddArgument(\'--disable-blink-features=AutomationControlled\') # Set "window.navigator.webdriver" = False
#$ChromeOption.AddArgument(\'--proxy-server=http://219.159.38.200:56210\') # Set proxy address access target website
$ChromeDriver = New-Object OpenQA.Selenium.Chrome.ChromeDriver($ChromeOption)
$ChromeDriver.Navigate().GoToUrl($proxyurl)
sleep 5
#region https://www.89ip.cn
<#
$i = 0
$proxyIPs = @()
while ($true)
{
$i++
if ($i -ne 1)
{
$ChromeDriver.FindElementByLinkText(\'下一页\') |Out-Null
sleep 3
}
$trs = $ChromeDriver.FindElementsByCssSelector(\'tbody tr\')
if ($trs.Count -gt 0)
{
$j = 0
foreach ($tr in $trs)
{
$j++
$w = $j.ToString() + \'/\' + $trs.Count.ToString()
$percent = "{0:0.0%}" -f ($j/$trs.Count)
Write-Progress -Activity "Process test proxy address" -Status "请耐心等待,Current $i 页 $w , $percent" -PercentComplete ($j/($trs.count) * 100)
$trinfo = $tr.Text -split \' \'
$recordtime = $trinfo[4] + " " + $trinfo[5]
try
{
$testproxy = "http://{0}:{1}" -f ($trinfo[0]), ($trinfo[1])
$testresult = Invoke-WebRequest -Uri $testurl -Proxy $testproxy -TimeoutSec 3 -ErrorAction Stop
if ($testresult.StatusCode -eq 200)
{
Write-Host $testproxy
$obj = New-Object psobject
$obj | Add-Member -MemberType NoteProperty -Name IP -Value $trinfo[0] -Force
$obj | Add-Member -MemberType NoteProperty -Name Port -Value $trinfo[1] -Force
$obj | Add-Member -MemberType NoteProperty -Name Region -Value $trinfo[2] -Force
$obj | Add-Member -MemberType NoteProperty -Name ISP -Value $trinfo[3] -Force
$obj | Add-Member -MemberType NoteProperty -Name RecordTime -Value $recordtime -Force
#$obj | epcsv d:\\ProxyServerList-20210828.csv -Encoding UTF8 -Append -Force -NoTypeInformation
$proxyIPs +=$obj
}
}
catch
{
#$errormsg = $_.Exception.Message
#Write-Host "$testproxy Test Failed "
}
}
}
else
{
break
}
}
$proxyIPs |epcsv d:\\ProxyServerList-20210829.csv -Encoding UTF8 -Force -NoTypeInformation
#>
#endregion
#region http://www.66ip.cn/
$proxylist = @()
$regionnames = ($ChromeDriver.FindElementsByTagName(\'li\') |select text -Last 34).Text
foreach($regionname in $regionnames)
{
$ChromeDriver.FindElementByLinkText($regionname).Click()
sleep 3
$trcount = ($ChromeDriver.FindElementsByTagName(\'tr\') |measure |select count).count
$filtercount = $trcount - 3
$iplist = $ChromeDriver.FindElementsByTagName(\'tr\') |select Text -Last $filtercount
$j = 0
if($iplist.Count -ge 0)
{
foreach($ipstring in $iplist.Text)
{
$ipinfo = $ipstring -split \' \'
$ipaddress = $ipinfo[0]
$ipport = $ipinfo[1]
$ipregion = $ipinfo[2]
$iptype = $ipinfo[3]
$j++
$w = $j.ToString() + \'/\' + $iplist.Count.ToString()
$percent = "{0:0.0%}" -f ($j/$iplist.Count)
Write-Progress -Activity "Process test proxy address" -Status "请耐心等待,Current $ipregion $w , $percent" -PercentComplete ($j/($iplist.count) * 100)
try
{
$testproxy = "http://{0}:{1}" -f $ipaddress, $ipport
$testresult = Invoke-WebRequest -Uri $testurl -Proxy $testproxy -TimeoutSec 3 -ErrorAction Stop
if ($testresult.StatusCode -eq 200)
{
Write-Host $testproxy
$obj = New-Object psobject
$obj |Add-Member -MemberType NoteProperty -Name IPAddress -Value $ipaddress -Force
$obj |Add-Member -MemberType NoteProperty -Name Port -Value $ipport -Force
$obj |Add-Member -MemberType NoteProperty -Name Region -Value $ipregion -Force
$proxylist +=$obj
}
}
catch
{
}
}
}
}
$proxylist |select IPAddress,Port,Region -Unique |ogv
#endregion
以上是关于Powershell + Selenium 爬虫--代理(03)的主要内容,如果未能解决你的问题,请参考以下文章
使用 PowerShell 和 Selenium 4 查找元素
pathon selenium爬虫里面如何定位一个button元素?