VC CWebBrowser2 获取网页文本内容，该怎么解决

Posted 2023-05-11

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了VC CWebBrowser2 获取网页文本内容，该怎么解决相关的知识，希望对你有一定的参考价值。

参考技术A CWebBrowser2先用Document获取IhtmlDocument2指针，得到网页元素是用IHTMLDocument2中的all方法得到IHTMLElementCollection ，再用IHTMLElementCollection 的item方法遍历通过ID判断得到元素的接口指针IHTMLElement。
最后pElem->get_innerText(&bstr);

Python获取网页Html文本

Python爬虫基础

　　1.获取网页文本

　　　　　　通过urllib2包，根据url获取网页的html文本内容并返回

#coding:utf-8
import requests, json, time, re, os, sys, time
import urllib2

#设置为utf-8模式
reload(sys)
sys.setdefaultencoding( "utf-8" )

def getHtml(url):
    response = urllib2.urlopen(url)
    html = response.read()
    #可以根据编码格式进行编码
    #html = unicode(html,\'utf-8\')
    return html 
url = \'https://www.cnblogs.com/\'
print getHtml(url)

或者

def getHtml(url):
    #使用将urllib2.Request()实例化,需要访问的URL地址则作为Request实例的参数
    request = urllib2.Request(url)
    #Request对象作为urlopen()方法的参数,发送给服务器并接收响应的类文件对象
    response = urllib2.urlopen(request)
    #类文件对象支持文件对象操作方法
    #如read()方法读取返回文件对象的全部内容并将其转换成字符串格式并赋值给html
    html = response.read()
    #可以根据编码格式进行编码
    #html = unicode(html,\'utf-8\')
    return html 
    
url = \'https://www.cnblogs.com/\'
print getHtml(url)

再添加ua和超时时间：

def getHtml(url):
    #构造ua
    ua_header = {"User-Agent":"Mozzila/5.0(compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
    #url连同headers一起构造Request请求,这个请求将附带IE9.0浏览器的User-Agent
    request = urllib2.Request(url,headers=ua_header)
    #设置超时时间
    response = urllib2.urlopen(request,timeout=60)
    html = response.read()
    return html
    
url = \'https://www.cnblogs.com/\'
print getHtml(url)

添加header属性：

def getHtml(url):
    ua = {"User-Agent":"Mozzila/5.0(compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
    request = urllib2.Request(url)
    #也可以通过Request.add_header()添加/修改一个特定的header
    request.add_header("Connection","keep-alive") 
    response = urllib2.urlopen(request)
    html = response.read()
    #查看响应码
    print \'相应码为:\',response.code
    #也可以通过Request.get_header()查看header信息
    print "Connection:",request.get_header("Connection")
    #或者
    print request.get_header(header_name = "Connection")
    #print html 
    return html

添加随机ua

#coding:utf-8
import requests, json, time, re, os, sys, time
import urllib2
import random


#设置为utf-8模式
reload(sys)
sys.setdefaultencoding( "utf-8" )

def getHtml(url):
    #定义ua池,每次随机取出一个值
    ua_list = ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 6.1; rv2.0.1) Gecko/20100101 Firefox/4.0.1","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11","Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"]
    user_agent = random.choice(ua_list)
    #print user_agent
    request = urllib2.Request(url)
    request.add_header("Connection","keep-alive")
    request.add_header("User-Agent",user_agent)
    response = urllib2.urlopen(request,data=None,timeout=60)
    html = response.read()
    #print \'响应码为:\',response.code
    #print \'URL:\',response.geturl()
    #print \'Info:\',response.info()

以上是关于VC CWebBrowser2 获取网页文本内容，该怎么解决的主要内容，如果未能解决你的问题，请参考以下文章