不要再吹python爬虫了,我大java明明也可以 | java爬取CSDN知乎文章

Posted lwx-apollo

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了不要再吹python爬虫了,我大java明明也可以 | java爬取CSDN知乎文章相关的知识,希望对你有一定的参考价值。

序言

这段时间一直在学习python,也看了很多python的文章,其中看到很多关于python爬虫的文章。我就在想,明明java也可以做到的事情,为什么大家都觉得爬虫是python的专属功能一样?

我觉得有必要为我大java发个声,趁午休时间搞了个java爬虫给大家分享下

导入相关包

引入爬虫包jsoup

<dependency>
   <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.2</version>
</dependency>

开始爬取

爬取文章列表

/**
 *
 * @param type
 *          爬取关键词
 * @param size
 *          本次爬取文章条数: 不得小于0,不得大于1000
 * @return  爬取的文章列表
 * @throws IOException
 */
public static List<CrawlerArticle> searchCSDNList(String type, int size) throws IOException 
   if (size <= 0) 
        size = 100;
     else if (size > 1000) 
        size = 1000;
    
    int  num =1;
    //将爬取出来的文章封装到Artcle中,并放到ArrayList里面去
    List<CrawlerArticle> resultList = new ArrayList<CrawlerArticle>(size);
    while (true) 
        if (resultList.size() >= size) 
            break;
        
//            String url = "https://so.csdn.net/so/search?q=" + type + "&t=blog&p=" + num;
        String url = "https://www.csdn.net/nav/" + type ;

        //获取url地址的http链接Connection
        Connection conn = Jsoup.connect(url)
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0")
                .timeout(1000)
                .method(Connection.Method.GET);
        //获取页面的html文档
        Document doc = conn.get();
        Element body = doc.body();

        Elements articleList = body.getElementsByClass("clearfix");

        for (Element article : articleList) 
            CrawlerArticle articleEntity = new CrawlerArticle();
            //标题
            Elements div_h2_a = article.getElementsByClass("title").select("div h2 a");
            if (div_h2_a != null && div_h2_a.size() > 0) 
                Element linkNode = div_h2_a.get(0);
                //文章url
                articleEntity.setAddress(linkNode.attr("href"));
                articleEntity.setTitle(linkNode.text());
             else 
                continue;
            

            Elements subscribeNums = article.getElementsByClass("is_digg click_heart");
            if (subscribeNums != null && subscribeNums.size() > 0) 
                articleEntity.setSubscribeNum(getNum(subscribeNums));
            else 
                articleEntity.setSubscribeNum(0);
            

            Elements descNodes = article.getElementsByClass("summary oneline");
            if (descNodes != null && descNodes.size() > 0) 
                Element descNode = descNodes.get(0);
                articleEntity.setSecondTitle(descNode.text());
            

            //阅读量
            Elements readNums = article.getElementsByClass("read_num");
            if (readNums != null && readNums.size() > 0) 
                articleEntity.setReadNum(getNum(readNums));
             else 
                continue;
            

            Elements commonNums = article.getElementsByClass("common_num");
            if (commonNums != null && commonNums.size() > 0) 
                articleEntity.setCommentNum(getNum(commonNums));
            else 
                articleEntity.setCommentNum(0);
            
            Elements datetimes = article.getElementsByClass("datetime");
            if (datetimes != null && datetimes.size() > 0) 
                articleEntity.setPublishTime(datetimes.get(0).text());
             else 
                articleEntity.setPublishTime(MyDateUtils.formatDate(new Date(), "yyyy-MM-dd"));
            
            articleEntity.setBlogType("CSDN");

            System.out.println("文章原地址:" + articleEntity.getAddress());
            System.out.println("文章阅读数+++++++++++:" + articleEntity.getReadNum());
            //将阅读量大于100的url存储到数据库
            if (articleEntity.getReadNum() > 100) 
                resultList.add(articleEntity);
            
            if (resultList.size() >= size) 
                break;
            
        
        //遍历输出ArrayList里面的爬取到的文章
        System.out.println("文章总数++++++++++++:" + articleList.size());
        num++;
    
    return resultList;

爬取单篇文章

/**
 *
 * @param url
 *          博客url地址
 * @param ipList
 *          代理池列表
 */
private static void search(String url, List<String> ipList) 
   Thread thread = new Thread() 
        @Override
        public void run() 
            Connection conn = null;
            Document doc = null;
            int retries = 0;
            out:
            while (true && retries < 10) 
                int random = new Random().nextInt(ipList.size());
                try 
                    conn = Jsoup.connect(url) 
                            .proxy(ipList.get(random), ipAndPort.get(ipList.get(random)))
                            .userAgent("Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0") 
                            .timeout(1000) 
                            .method(Connection.Method.GET);  
                    doc = conn.get();
                    break out;
                 catch (Exception e) 
                    retries++;
                
            

            //获取页面的html文档
            try 
                String s = doc.outerHtml();
                String title = doc.title();
                System.out.println(title);
                //TODO 具体转换成实体类可参考上面爬取文章列表
             catch (Exception e) 
            
        
    ;
    thread.start();



/**
 * 自建的代理ip池
 */
static Map<String, Integer> ipAndPort = new ConcurrentHashMap<>();
static 
    try 
        InputStream is = CrawlCSDN.class.getClassLoader().getResourceAsStream("ip.txt");
        //以IO流的形式读取
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        String line ;
        while ((line=br.readLine()) != null) 
            String[] split = line.split(SymbolConstants.COLON_SYMBOL);
            if (split.length==2)
                ipAndPort.put(split[0],Integer.valueOf(split[1]));
            
        
        br.close();
     catch (Exception e) 
        e.printStackTrace();
    

有需要ip池的可以留言或者私信给我,创作不易,喜欢的大佬帮忙一键三连走起!

以上是关于不要再吹python爬虫了,我大java明明也可以 | java爬取CSDN知乎文章的主要内容,如果未能解决你的问题,请参考以下文章

爬虫注意事项

Python爬虫——网页上的字符按照我的想法输出

千万不要误用 java 中的 HashCode 方法

千万不要误用 java 中的 HashCode 方法

请问一下一些内容关于s60和python

python3--网络爬虫--爬取图片