不要再吹python爬虫了,我大java明明也可以 | java爬取CSDN知乎文章

Posted lwx-apollo

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了不要再吹python爬虫了,我大java明明也可以 | java爬取CSDN知乎文章相关的知识,希望对你有一定的参考价值。

序言

这段时间一直在学习python,也看了很多python的文章,其中看到很多关于python爬虫的文章。我就在想,明明java也可以做到的事情,为什么大家都觉得爬虫是python的专属功能一样?

我觉得有必要为我大java发个声,趁午休时间搞了个java爬虫给大家分享下

导入相关包

引入爬虫包jsoup

<dependency>
   <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.2</version>
</dependency>

开始爬取

爬取文章列表

/**
 *
 * @param type
 *          爬取关键词
 * @param size
 *          本次爬取文章条数: 不得小于0,不得大于1000
 * @return  爬取的文章列表
 * @throws IOException
 */
public static List<CrawlerArticle> searchCSDNList(String type, int size) throws IOException {
   if (size <= 0) {
        size = 100;
    } else if (size > 1000) {
        size = 1000;
    }
    int  num =1;
    //将爬取出来的文章封装到Artcle中,并放到ArrayList里面去
    List<CrawlerArticle> resultList = new ArrayList<CrawlerArticle>(size);
    while (true) {
        if (resultList.size() >= size) {
            break;
        }
//            String url = "https://so.csdn.net/so/search?q=" + type + "&t=blog&p=" + num;
        String url = "https://www.csdn.net/nav/" + type ;

        //获取url地址的http链接Connection
        Connection conn = Jsoup.connect(url)
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0")
                .timeout(1000)
                .method(Connection.Method.GET);
        //获取页面的html文档
        Document doc = conn.get();
        Element body = doc.body();

        Elements articleList = body.getElementsByClass("clearfix");

        for (Element article : articleList) {
            CrawlerArticle articleEntity = new CrawlerArticle();
            //标题
            Elements div_h2_a = article.getElementsByClass("title").select("div h2 a");
            if (div_h2_a != null && div_h2_a.size() > 0) {
                Element linkNode = div_h2_a.get(0);
                //文章url
                articleEntity.setAddress(linkNode.attr("href"));
                articleEntity.setTitle(linkNode.text());
            } else {
                continue;
            }

            Elements subscribeNums = article.getElementsByClass("is_digg click_heart");
            if (subscribeNums != null && subscribeNums.size() > 0) {
                articleEntity.setSubscribeNum(getNum(subscribeNums));
            }else {
                articleEntity.setSubscribeNum(0);
            }

            Elements descNodes = article.getElementsByClass("summary oneline");
            if (descNodes != null && descNodes.size() > 0) {
                Element descNode = descNodes.get(0);
                articleEntity.setSecondTitle(descNode.text());
            }

            //阅读量
            Elements readNums = article.getElementsByClass("read_num");
            if (readNums != null && readNums.size() > 0) {
                articleEntity.setReadNum(getNum(readNums));
            } else {
                continue;
            }

            Elements commonNums = article.getElementsByClass("common_num");
            if (commonNums != null && commonNums.size() > 0) {
                articleEntity.setCommentNum(getNum(commonNums));
            }else {
                articleEntity.setCommentNum(0);
            }
            Elements datetimes = article.getElementsByClass("datetime");
            if (datetimes != null && datetimes.size() > 0) {
                articleEntity.setPublishTime(datetimes.get(0).text());
            } else {
                articleEntity.setPublishTime(MyDateUtils.formatDate(new Date(), "yyyy-MM-dd"));
            }
            articleEntity.setBlogType("CSDN");

            System.out.println("文章原地址:" + articleEntity.getAddress());
            System.out.println("文章阅读数+++++++++++:" + articleEntity.getReadNum());
            //将阅读量大于100的url存储到数据库
            if (articleEntity.getReadNum() > 100) {
                resultList.add(articleEntity);
            }
            if (resultList.size() >= size) {
                break;
            }
        }
        //遍历输出ArrayList里面的爬取到的文章
        System.out.println("文章总数++++++++++++:" + articleList.size());
        num++;
    }
    return resultList;
}

爬取单篇文章

/**
 *
 * @param url
 *          博客url地址
 * @param ipList
 *          代理池列表
 */
private static void search(String url, List<String> ipList) {
   Thread thread = new Thread() {
        @Override
        public void run() {
            Connection conn = null;
            Document doc = null;
            int retries = 0;
            out:
            while (true && retries < 10) {
                int random = new Random().nextInt(ipList.size());
                try {
                    conn = Jsoup.connect(url) 
                            .proxy(ipList.get(random), ipAndPort.get(ipList.get(random)))
                            .userAgent("Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0") 
                            .timeout(1000) 
                            .method(Connection.Method.GET);  
                    doc = conn.get();
                    break out;
                } catch (Exception e) {
                    retries++;
                }
            }

            //获取页面的html文档
            try {
                String s = doc.outerHtml();
                String title = doc.title();
                System.out.println(title);
                //TODO 具体转换成实体类可参考上面爬取文章列表
            } catch (Exception e) {
            }
        }
    };
    thread.start();

}

/**
 * 自建的代理ip池
 */
static Map<String, Integer> ipAndPort = new ConcurrentHashMap<>();
static {
    try {
        InputStream is = CrawlCSDN.class.getClassLoader().getResourceAsStream("ip.txt");
        //以IO流的形式读取
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        String line ;
        while ((line=br.readLine()) != null) {
            String[] split = line.split(SymbolConstants.COLON_SYMBOL);
            if (split.length==2){
                ipAndPort.put(split[0],Integer.valueOf(split[1]));
            }
        }
        br.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

有需要ip池的可以留言或者私信给我,创作不易,喜欢的大佬帮忙一键三连走起!

以上是关于不要再吹python爬虫了,我大java明明也可以 | java爬取CSDN知乎文章的主要内容,如果未能解决你的问题,请参考以下文章

爬虫注意事项

Python爬虫——网页上的字符按照我的想法输出

千万不要误用 java 中的 HashCode 方法

千万不要误用 java 中的 HashCode 方法

请问一下一些内容关于s60和python

python3--网络爬虫--爬取图片