网络爬虫- 基本使用

Posted 2021-03-10 jr-xiaojian

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了网络爬虫- 基本使用相关的知识，希望对你有一定的参考价值。

get请求

get请求的基本使用

    // 1. 打开浏览器,创建HttpClient对象    
    CloseableHttpClient httpClient = HttpClients.createDefault();
    
    // 2.输入网址，发起get请求创建HttpGet对象
    HttpGet get = new HttpGet("http://112.124.1.187/index.html?typeId=16");
    
    // 3.发情请求，返回响应，使用HttpClient对象发起请求
    CloseableHttpResponse response = httpClient.execute(get);
    
    // 4.解析响应，获取数据
    if(response.getStatusLine().getStatusCode() == 200){
        HttpEntity entity = response.getEntity();
        String content = EntityUtils.toString(entity,"utf-8");
        System.out.println(content);
    }

get请求带参数（可以直接写在地址后，但是构成硬编码）

    // 1. 打开浏览器,创建HttpClient对象
    CloseableHttpClient httpClient = HttpClients.createDefault();
    try {
        // 地址：http://112.124.1.187/index.html?typeId=16.带有参数
        // 创建URIBuilder
        URIBuilder uriBuilder = new URIBuilder("http://112.124.1.187/index.html");
        // 添加参数
        // 多个参数可以连着添加,在后面连着setParameter(key,value)
        uriBuilder.setParameter("typeId","16");
        // 2.输入网址，发起get请求创建HttpGet对象
        HttpGet get = new HttpGet(uriBuilder.build());
        // 3.发情请求，返回响应，使用HttpClient对象发起请求
        CloseableHttpResponse response = null;
        try {
            response = httpClient.execute(get);
            // 4.解析响应，获取数据
            if(response.getStatusLine().getStatusCode() == 200){
                HttpEntity entity = response.getEntity();
                String content = EntityUtils.toString(entity,"utf-8");
                System.out.println(content);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    } catch (URISyntaxException e) {
        e.printStackTrace();
    }

post请求

基本使用与get相同把HttpGet改为HttpPost就可以了。

post请求带参数

    // 1. 打开浏览器,创建HttpClient对象
    CloseableHttpClient httpClient = HttpClients.createDefault();
    
    // 地址：http://112.124.1.187/index.html?typeId=16.带有参数
    // 2.输入网址，发起post请求创建HttpPost对象
    HttpPost post = new HttpPost("http://112.124.1.187/index.html");
    // 2.1 声明List集合，封装表单中的参数
    List<NameValuePair> params = new ArrayList<>();
    // 2.2 添加参数
    params.add(new BasicNameValuePair("typeId","16"));
    // 2.3 创建表单的Entity对象，对参数进行url编码
    UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf-8");
    // 2.4 设置表单的Entity对象到Post请求中
    post.setEntity(formEntity);
    
    // 3.发情请求，返回响应，使用HttpClient对象发起请求
    CloseableHttpResponse response = null;
    try {
        response = httpClient.execute(post);
        // 4.解析响应，获取数据
        if(response.getStatusLine().getStatusCode() == 200){
            HttpEntity entity = response.getEntity();
            String content = EntityUtils.toString(entity,"utf-8");
            System.out.println(content);
        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally{
        if(response != null){
            response.close();
        }
        httpClient.close();
    }

像每一个连接操作一样，HttpClent 连接一次，再断开，再要用时，继续连接，再断开。构成浪费资源现象。需要用到 "池" 这个概念。

HttpClient-连接池

    public static void     main(String[] args) {
        // 创建连接池管理器
        PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
        // 设置最大连接数
        cm.setMaxTotal(10);
        // 设置每个主机最大连接数
        cm.setDefaultMaxPerRoute(2);
    
        // 使用连接池管理器发起请求
        doGet(cm);
        doGet(cm);
    }
    
    private static void doGet(PoolingHttpClientConnectionManager cm) {
        // 从连接池中获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
    
        HttpGet httpGet = new HttpGet("http://112.124.1.187");
        CloseableHttpResponse response = null;
        try {
            response = httpClient.execute(httpGet);
            if(response.getStatusLine().getStatusCode() == 200){
                String content = EntityUtils.toString(response.getEntity(),"utf-8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally{
            if(response != null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            // 不用关闭HttpClient，交由池来管理
            // httpClient.close();
        }
    }

请求参数

这个请求参数不是放在url地址后面的参数，而是你在请求过程中，所涉及到需要事先定好的规则。比如，在请求过程中，有时候因为网络原因，或目标服务器的原因，请求需要更长的时间才能完成，就需要我们自定义相关的时间。

    HttpGet get =     new HttpGet("http://112.124.1.187/index.html?typeId=16");
    // 配置请求信息
    RequestConfig config = RequestConfig.custom().setConnectTimeout(10000)  // 创建连接的最长时间，单位是毫秒
                            .setConnectionRequestTimeout(500)   // 设置获取连接的最长时间，单位是毫秒
                            .setSocketTimeout(10 * 1000)    // 设置数据传输的最长时间，单位是毫秒
                            .build();
    // 将配置给请求
    get.setConfig(config);

以上是关于网络爬虫- 基本使用的主要内容，如果未能解决你的问题，请参考以下文章

网络爬虫学习小组·第一课 | Python安装基本语法与JupyterLab代码编辑器配置

scrapy主动退出爬虫的代码片段(python3)

学习爬虫:《Python网络数据采集》中英文PDF+代码

Python爬虫爬虫的基本原理

Python网络爬虫学习手记——爬虫基础

网络爬虫- 基本使用