java爬取网站信息和url实例

Posted 土木转行的人才

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了java爬取网站信息和url实例相关的知识,希望对你有一定的参考价值。


https://blog.csdn.net/weixin_38409425/article/details/78616688(出自此為博主)

 

具體代碼如下:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.net.URL;
import java.net.URLConnection;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* 网络爬虫
*
* @author jacke 陈
*
*/
public class SpirderUrl {

public static void spiderURL(String url, String regex, String filename) {

SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");

String time = sdf.format(new Date());
System.out.println(time);

URL realURL = null;
URLConnection connection = null;
BufferedReader br = null;
PrintWriter pw = null;
PrintWriter pw1 = null;

Pattern pattern = Pattern.compile(regex);
try {
realURL = new URL(url);
connection = realURL.openConnection();
// connection.connect();

File fileDir = new File("E:/spider/" + time);
if (!fileDir.exists()) {
fileDir.mkdirs();
}
// 将爬取到的内容放到E盘相应目录下
pw = new PrintWriter(
new FileWriter("E:/spider/" + time + "/" + filename + "_content.txt"), true);
pw1 = new PrintWriter(new FileWriter("E:/spider/" + time + "/" + filename + "_URL.txt"),
true);

br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line = null;

// 读写
while ((line = br.readLine()) != null) {
pw.println(line);
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
pw1.println(matcher.group());
}

}
System.out.println("爬取成功!");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
br.close();
pw.close();
pw1.close();
} catch (IOException e) {
e.printStackTrace();
}

}

}

public static void main(String[] args) {
String url = "https://www.cnblogs.com/csh520mjy/p/";
String regex = "(http|https)://[\\\\w+\\\\.?/?]+\\\\.[A-Za-z]+";
spiderURL(url, regex, "8btc");
}

}

爬取結果:

 

 

 

以上是关于java爬取网站信息和url实例的主要内容,如果未能解决你的问题,请参考以下文章

Python爬虫学习之正则表达式爬取个人博客

requests实例2:亚马逊网站商品网页的爬取

爬虫实例——利用BeautifulSoup库爬取页面信息

python网络爬虫与信息提取mooc------爬取实例

如何爬取URL不变的网站内容

java爬取网站中所有网页的源代码和链接