简单用Java写了一个爬虫。
Posted SuperShuiShui
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了简单用Java写了一个爬虫。相关的知识,希望对你有一定的参考价值。
功能:
爬虫电影下载地址;
主要技术:
1.Jsoup: 一款Java的html解析器
2.正则表达式
结果演示:
这些链接都是可以使用迅雷或者网页直接下载的~~~
流程:
确定目标--->获取网页源码--->抓取目标信息
获取网页源码:
//解析网页 url_str:目标网站 matchValue:匹配值
public static String parseWeb(String url_str, String matchValue) throws IOException{
URL url = new URL(url_str);
Document document = Jsoup.parse(url, 30000); //parse(URL url ,int timeoutMillis)
Elements elements = document.getElementsByClass(matchValue);
String result = elements.html().toString();
return result;
}
抓取目标信息:
//匹配资源
public static ArrayList<String> matchResources(String result, String regex){
ArrayList<String> strs = new ArrayList<String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(result);
while(matcher.find()){
strs.add(DEFAULT_URL+matcher.group());
}
return strs;
}
主要使用的就是以上两个功能,所以写在了一个工具类中
工具类:
public class MyUtil {
private static final String DEFAULT_URL = "*********"; 不能传播色情信息hhh~~
//解析网页 url_str:目标网站 matchValue:匹配值
public static String parseWeb(String url_str, String matchValue) throws IOException{
URL url = new URL(url_str);
Document document = Jsoup.parse(url, 30000); //parse(URL url ,int timeoutMillis)
Elements elements = document.getElementsByClass(matchValue);
String result = elements.html().toString();
return result;
}
//匹配资源
public static ArrayList<String> matchResources(String result, String regex){
ArrayList<String> strs = new ArrayList<String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(result);
while(matcher.find()){
strs.add(DEFAULT_URL+matcher.group());
}
return strs;
}
//去重
public static ArrayList<String> removeDuplicate(ArrayList<String> strs){
Set set = new HashSet();
set.addAll(strs);
strs.clear();
strs.addAll(set);
return strs;
}
}
接下来定义一个接口:
public interface ICL {
//遍历当前分类,返回该类所有资源所在的url
public ArrayList<String> method1(String kind) throws IOException;
//遍历传入的url集合,逐一调用方法3返回所有资源
public ArrayList<String> method2(List<String> strs) throws IOException;
//返回该url下的下载资源
public String method3(String str) throws IOException;
//打印
public void print(ArrayList<String> links);
}
它的实现类:
public class ClImpl implements ICL{
private static final String DEFAULT_URL = "***********"; //不能传播色情信息~~hh
public ClImpl(){
}
@Override
//遍历当前分类,返回该类所有url kind: 传入相应分类
public ArrayList<String> method1(String kind) throws IOException {
ArrayList<String> urls = new ArrayList<String>();
kind = DEFAULT_URL + kind;
String result = MyUtil.parseWeb(kind, "row col5 clearfix");
urls = MyUtil.matchResources(result, "html/\\\\d{6}/\\\\d{4,}.html");
urls = MyUtil.removeDuplicate(urls);
return urls;
}
@Override
//遍历传入的url集合,逐一调用方法3返回所有资源 urls: 传入的url集合
public ArrayList<String> method2(List<String> urls) throws IOException {
ArrayList<String> links = new ArrayList<String>();
for (String url : urls) {
links.add(method3(url));
}
return links;
}
@Override
//返回该url下的下载资源
public String method3(String url) throws IOException{
ArrayList<String> links = new ArrayList<String>();
String result = MyUtil.parseWeb(url,"download");
links = MyUtil.matchResources(result, "https://.*?mp4");
String link = links.get(0);
link = link.substring(23);
return link;
}
@Override
public void print(ArrayList<String> links) {
for (String string : links) {
System.out.println(string);
}
}
}
所有的功能已经基本完成了,接下来简单写一个测试类,
public class Test01 {
static ArrayList<String> links = new ArrayList<String>();
static ClImpl c = new ClImpl();
static Scanner scan = new Scanner(System.in);
public static void main(String[] args) throws IOException{
boolean flag = true;
while(flag){
System.out.println("1-三上悠亚 2-桥本有菜 3-深田咏美 4-波多野结衣 5-吉泽明步 886-退出");
int key = scan.nextInt();
switch (key) {
case 1:
int indexOfSsyy = 2;
start("av/ssyy/");
turnPage("av/ssyy/",indexOfSsyy);
break;
case 2:
int indexOfQbyq = 2;
start("av/qbyc/");
turnPage("av/qbyc/",indexOfQbyq);
break;
case 3:
int indexOfStym = 2;
start("av/stym/");
turnPage("av/stym/",indexOfStym);
break;
case 4:
int indexOfBdyjy = 2;
start("av/bdyjy/");
turnPage("av/bdyjy/",indexOfBdyjy);
break;
case 5:
int indexOfJzmb = 2;
start("av/jzmb/");
turnPage("av/jzmb/",indexOfJzmb);
break;
case 886:
System.out.println("已退出~~");
flag = false;
break;
default:
System.out.println("输入错误,请重新输入");
break;
}
}
}
public static void start(String kind) throws IOException{
links = c.method1(kind);
links = c.method2(links);
c.print(links);
}
public static void turnPage(String kind, int index) throws IOException{
System.out.println("1-下一页 其他任意数字键-返回上一级");
int menu = scan.nextInt();
switch (menu) {
case 1:
start(kind+"index_"+index+".html");
index++;
turnPage(kind,index);
break;
default:
break;
}
}
}
以上就是这个电影小爬虫的制作流程啦;大家对爬虫感兴趣的话可以以这个为参照,根据需求修改一下,写的比较烂.....
以上是关于简单用Java写了一个爬虫。的主要内容,如果未能解决你的问题,请参考以下文章