java爬取网站信息和url实例
Posted 土木转行的人才
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了java爬取网站信息和url实例相关的知识,希望对你有一定的参考价值。
https://blog.csdn.net/weixin_38409425/article/details/78616688(出自此為博主)
具體代碼如下:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.net.URL;
import java.net.URLConnection;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* 网络爬虫
*
* @author jacke 陈
*
*/
public class SpirderUrl {
public static void spiderURL(String url, String regex, String filename) {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
String time = sdf.format(new Date());
System.out.println(time);
URL realURL = null;
URLConnection connection = null;
BufferedReader br = null;
PrintWriter pw = null;
PrintWriter pw1 = null;
Pattern pattern = Pattern.compile(regex);
try {
realURL = new URL(url);
connection = realURL.openConnection();
// connection.connect();
File fileDir = new File("E:/spider/" + time);
if (!fileDir.exists()) {
fileDir.mkdirs();
}
// 将爬取到的内容放到E盘相应目录下
pw = new PrintWriter(
new FileWriter("E:/spider/" + time + "/" + filename + "_content.txt"), true);
pw1 = new PrintWriter(new FileWriter("E:/spider/" + time + "/" + filename + "_URL.txt"),
true);
br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line = null;
// 读写
while ((line = br.readLine()) != null) {
pw.println(line);
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
pw1.println(matcher.group());
}
}
System.out.println("爬取成功!");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
br.close();
pw.close();
pw1.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
public static void main(String[] args) {
String url = "https://www.cnblogs.com/csh520mjy/p/";
String regex = "(http|https)://[\\\\w+\\\\.?/?]+\\\\.[A-Za-z]+";
spiderURL(url, regex, "8btc");
}
}
爬取結果:
以上是关于java爬取网站信息和url实例的主要内容,如果未能解决你的问题,请参考以下文章