java爬虫下载FTP网站目录文件

Posted 2023-02-09 wblearn

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了java爬虫下载FTP网站目录文件相关的知识，希望对你有一定的参考价值。

java爬虫下载FTP网站目录文件

写在前面
ftp网站带目录递归爬取
java多线程爬虫
写在最后

写在前面

爬虫的本质就是自动化的去模拟正常人类发起的网络请求，然后获取网络请求所返回的数据。跟我们人手动去点击一个连接，访问一个网页获取数据，并没有什么本质的区别。下面用java的方式来爬虫

ftp网站带目录递归爬取

爬取的ftp网站地址

http://learning.happymmall.com/

FTP网站带目录递归爬取的思路，可以参考python爬取的思路

同样的，java利用jsoup库也是按照这个思路爬取的，代码贴出来:

public void get_url(String page_url)
        Document doc = null;
        String url = "";
        try 
            // 获取需要爬取数据的地址集
            List<String> nameList = new ArrayList<String>();
            List<String> hrefList = new ArrayList<String>();
            try 
                url = page_url;
                doc = Jsoup.connect(url).timeout(3000).get(); // 设置连接超时时间
             catch (Exception e2) 
                System.out.println(e2);
            
            Elements links = doc.select("a");
            for (Element element : links) 
                String href = element.attr("href");
                hrefList.add(href);
                String name = element.text();
                nameList.add(name);
            

            for(int i=1;i<nameList.size();i++)
                String name = nameList.get(i);
                String href = hrefList.get(i);
                if(name.contains("../"))
                    continue;
                else 
                    if (name.contains("/"))
                        String newHref = url+href;
                        get_url(newHref);
                    else 
                        FileUrlDownloadUtil.downloadFile(url,href,"Z:\\\\tmp");
                    
                
            
         catch (Exception e) 
            e.printStackTrace();

public class  FileUrlDownloadUtil

    /**
     * 说明：根据指定URL将文件下载到指定目标位置
     * @param urlPath
     *            下载路径
     * @param downloadDir
     *            文件存放目录
     * @return 返回下载文件
     */
    @SuppressWarnings("finally")
    public static File downloadFile(String dominUrl,String urlPath, String downloadDir) 
        File file = null;
        try 
            System.setProperty("https.protocols", "TLSv1,TLSv1.1,TLSv1.2,SSLv3");
            // 统一资源
            URL url = new URL(dominUrl+urlPath);
            // 连接类的父类，抽象类
            URLConnection urlConnection = url.openConnection();
            // http的连接类
            HttpURLConnection httpURLConnection = (HttpURLConnection) urlConnection;
            //设置超时
            httpURLConnection.setConnectTimeout(1000*5);
            //设置请求方式，默认是GET
            httpURLConnection.setRequestMethod("GET");
            // 设置字符编码
            httpURLConnection.setRequestProperty("Charset", "UTF-8");
            // 打开到此 URL引用的资源的通信链接（如果尚未建立这样的连接）。
            httpURLConnection.connect();
            // 文件大小
            int fileLength = httpURLConnection.getContentLength();

            // 控制台打印文件大小
            System.out.println("您要下载的文件大小为:" + fileLength / (1024 * 1024) + "MB");

            // 建立链接从请求中获取数据
            URLConnection con = url.openConnection();
            BufferedInputStream bin = new BufferedInputStream(httpURLConnection.getInputStream());
            // 指定文件名称(有需求可以自定义)
            String fullUrl = dominUrl+urlPath;
            String fileFullName=(dominUrl+urlPath).substring("https://sec.sipsik.net/".length(),fullUrl.length());

            // 指定存放位置(有需求可以自定义)
            String path = downloadDir + File.separatorChar + fileFullName;
            file = new File(path);
            // 校验文件夹目录是否存在，不存在就创建一个目录
            if (!file.getParentFile().exists()) 
                file.getParentFile().mkdirs();
            

            OutputStream out = new FileOutputStream(file);
            int size = 0;
            int len = 0;
            byte[] buf = new byte[2048];
            while ((size = bin.read(buf)) != -1) 
                len += size;
                out.write(buf, 0, size);
                // 控制台打印文件下载的百分比情况
                System.out.println("下载了-------> " + len * 100 / fileLength + "%\\n");
            
            // 关闭资源
            bin.close();
            out.close();
            System.out.println("文件下载成功！");
         catch (MalformedURLException e) 
            // TODO Auto-generated catch block
            e.printStackTrace();
         catch (IOException e) 
            // TODO Auto-generated catch block
            e.printStackTrace();
            System.out.println("文件下载失败！");
         finally 
            return file;

java多线程爬虫

我们抓取的目标的数据量，有时是非常庞大的，甚至几千万上亿的数据量，而有些甚至会要求实时的更新，对于上面的这种同步抓取肯定行不通。我们一般会使用并发和分布式来解决速度的问题。

java可以利用线程池ExecutorService来多线程抓取，代码贴出来：

 @Test
    public  void runMethod()
        Document doc = null;
        String url = "https://xxxxxxxxx";
            // 获取需要爬取数据的地址集
            List<String> nameList = new ArrayList<String>();
            try 
                doc = Jsoup.connect(url).timeout(3000).get(); // 设置连接超时时间
             catch (Exception e2) 
                System.out.println(e2);
            
            Elements links = doc.select("a");
            for (Element element : links) 
                String name = element.text();
                nameList.add(name);
            

        ExecutorService executorService = Executors.newCachedThreadPool();
        List<Task> tasks = new ArrayList<>();
            for(int i=1;i<nameList.size();i++)
                tasks.add(new Task(nameList.get(i),url));
            
        try 
            executorService.invokeAll(tasks);
            executorService.shutdown();
         catch (InterruptedException e) 
            e.printStackTrace();

public class Task implements Callable<Object> 
    private String name;
    private String url;
    public Task(String name,String url)
        this.name = name;
        this.url = url;
    
    @Override
    public Object call() throws Exception 
        FileUrlDownloadUtil.downloadFile("https://xxxxxxx", name, "Z:\\\\tmp");
        return null;

关于Java线程池ExecutorService 的理解与使用参考这里

写在最后

相信大家都知道python爬虫很流行，相比较java来说,python的http库类更佳丰富,用java需要几十行代码才能完成的事情,python往往只需要十几行。爬虫讲究的是短糙快，你的目标网站会随时变化，要是用java写，等你启动完IDE，搭建好环境，写代码编译，别人用python可能都爬了10个站了。用恰当的语言做最有效率的事，能帮助你节省不少时间。

以上是关于java爬虫下载FTP网站目录文件的主要内容，如果未能解决你的问题，请参考以下文章