不要再吹python爬虫了,我大java明明也可以 | java爬取CSDN知乎文章
Posted lwx-apollo
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了不要再吹python爬虫了,我大java明明也可以 | java爬取CSDN知乎文章相关的知识,希望对你有一定的参考价值。
序言
这段时间一直在学习python,也看了很多python的文章,其中看到很多关于python爬虫的文章。我就在想,明明java也可以做到的事情,为什么大家都觉得爬虫是python的专属功能一样?
我觉得有必要为我大java发个声,趁午休时间搞了个java爬虫给大家分享下
导入相关包
引入爬虫包jsoup
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>
开始爬取
爬取文章列表
/**
*
* @param type
* 爬取关键词
* @param size
* 本次爬取文章条数: 不得小于0,不得大于1000
* @return 爬取的文章列表
* @throws IOException
*/
public static List<CrawlerArticle> searchCSDNList(String type, int size) throws IOException {
if (size <= 0) {
size = 100;
} else if (size > 1000) {
size = 1000;
}
int num =1;
//将爬取出来的文章封装到Artcle中,并放到ArrayList里面去
List<CrawlerArticle> resultList = new ArrayList<CrawlerArticle>(size);
while (true) {
if (resultList.size() >= size) {
break;
}
// String url = "https://so.csdn.net/so/search?q=" + type + "&t=blog&p=" + num;
String url = "https://www.csdn.net/nav/" + type ;
//获取url地址的http链接Connection
Connection conn = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0")
.timeout(1000)
.method(Connection.Method.GET);
//获取页面的html文档
Document doc = conn.get();
Element body = doc.body();
Elements articleList = body.getElementsByClass("clearfix");
for (Element article : articleList) {
CrawlerArticle articleEntity = new CrawlerArticle();
//标题
Elements div_h2_a = article.getElementsByClass("title").select("div h2 a");
if (div_h2_a != null && div_h2_a.size() > 0) {
Element linkNode = div_h2_a.get(0);
//文章url
articleEntity.setAddress(linkNode.attr("href"));
articleEntity.setTitle(linkNode.text());
} else {
continue;
}
Elements subscribeNums = article.getElementsByClass("is_digg click_heart");
if (subscribeNums != null && subscribeNums.size() > 0) {
articleEntity.setSubscribeNum(getNum(subscribeNums));
}else {
articleEntity.setSubscribeNum(0);
}
Elements descNodes = article.getElementsByClass("summary oneline");
if (descNodes != null && descNodes.size() > 0) {
Element descNode = descNodes.get(0);
articleEntity.setSecondTitle(descNode.text());
}
//阅读量
Elements readNums = article.getElementsByClass("read_num");
if (readNums != null && readNums.size() > 0) {
articleEntity.setReadNum(getNum(readNums));
} else {
continue;
}
Elements commonNums = article.getElementsByClass("common_num");
if (commonNums != null && commonNums.size() > 0) {
articleEntity.setCommentNum(getNum(commonNums));
}else {
articleEntity.setCommentNum(0);
}
Elements datetimes = article.getElementsByClass("datetime");
if (datetimes != null && datetimes.size() > 0) {
articleEntity.setPublishTime(datetimes.get(0).text());
} else {
articleEntity.setPublishTime(MyDateUtils.formatDate(new Date(), "yyyy-MM-dd"));
}
articleEntity.setBlogType("CSDN");
System.out.println("文章原地址:" + articleEntity.getAddress());
System.out.println("文章阅读数+++++++++++:" + articleEntity.getReadNum());
//将阅读量大于100的url存储到数据库
if (articleEntity.getReadNum() > 100) {
resultList.add(articleEntity);
}
if (resultList.size() >= size) {
break;
}
}
//遍历输出ArrayList里面的爬取到的文章
System.out.println("文章总数++++++++++++:" + articleList.size());
num++;
}
return resultList;
}
爬取单篇文章
/**
*
* @param url
* 博客url地址
* @param ipList
* 代理池列表
*/
private static void search(String url, List<String> ipList) {
Thread thread = new Thread() {
@Override
public void run() {
Connection conn = null;
Document doc = null;
int retries = 0;
out:
while (true && retries < 10) {
int random = new Random().nextInt(ipList.size());
try {
conn = Jsoup.connect(url)
.proxy(ipList.get(random), ipAndPort.get(ipList.get(random)))
.userAgent("Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0")
.timeout(1000)
.method(Connection.Method.GET);
doc = conn.get();
break out;
} catch (Exception e) {
retries++;
}
}
//获取页面的html文档
try {
String s = doc.outerHtml();
String title = doc.title();
System.out.println(title);
//TODO 具体转换成实体类可参考上面爬取文章列表
} catch (Exception e) {
}
}
};
thread.start();
}
/**
* 自建的代理ip池
*/
static Map<String, Integer> ipAndPort = new ConcurrentHashMap<>();
static {
try {
InputStream is = CrawlCSDN.class.getClassLoader().getResourceAsStream("ip.txt");
//以IO流的形式读取
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line ;
while ((line=br.readLine()) != null) {
String[] split = line.split(SymbolConstants.COLON_SYMBOL);
if (split.length==2){
ipAndPort.put(split[0],Integer.valueOf(split[1]));
}
}
br.close();
} catch (Exception e) {
e.printStackTrace();
}
}
有需要ip池的可以留言或者私信给我,创作不易,喜欢的大佬帮忙一键三连走起!
以上是关于不要再吹python爬虫了,我大java明明也可以 | java爬取CSDN知乎文章的主要内容,如果未能解决你的问题,请参考以下文章