你如何以编程方式在 Java 中下载网页

Posted 2023-03-05

技术标签:

【中文标题】你如何以编程方式在 Java 中下载网页【英文标题】：How do you Programmatically Download a Webpage in Java 【发布时间】：2010-09-19 07:23:18 【问题描述】：

我希望能够获取网页的 html 并将其保存到 String，因此我可以对其进行一些处理。另外，我该如何处理各种类型的压缩。

我将如何使用 Java 来做到这一点？

【问题讨论】：

这基本上是***.com/questions/921262/…的特例 【参考方案1】：

嗯，您可以使用内置库，例如 URL 和 URLConnection，但它们并不能提供太多控制。

~~我个人会使用 Apache HTTPClient 库。~~编辑： HTTPClient 已被设置为生命终结阿帕奇。替换为：HTTP Components

【讨论】：

没有java版本的System.Net.WebRequest？有点，那将是 URL。 :-) 例如：new URL("google.com").openStream() // => InputStream @Jonathan：Daniel 说的大部分内容——尽管 WebRequest 比 URL 给你更多的控制权。 HTTPClient 在功能上更接近，IMO。【参考方案2】：

在 Unix/Linux 机器上，您可以只运行“wget”，但如果您正在编写跨平台客户端，这并不是一个真正的选择。当然，这假设您真的不想对下载的数据在下载到磁盘之间进行太多操作。

【讨论】：

我也会从这种方法开始，如果不够，以后再重构【参考方案3】：

这里有一些使用 Java 的 URL 类测试过的代码。不过，我建议在处理异常或将它们传递到调用堆栈方面做得比我在这里做得更好。

public static void main(String[] args) 
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try 
        url = new URL("http://***.com/");
        is = url.openStream();  // throws an IOException
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) 
            System.out.println(line);
        
     catch (MalformedURLException mue) 
         mue.printStackTrace();
     catch (IOException ioe) 
         ioe.printStackTrace();
     finally 
        try 
            if (is != null) is.close();
         catch (IOException ioe) 
            // nothing to see here

【讨论】：

DataInputStream.readLine() 已被弃用，但除了那个很好的例子。我使用包装在 BufferedReader() 中的 InputStreamReader() 来获取 readLine() 函数。这没有考虑字符编码，所以虽然它看起来适用于 ASCII 文本，但当不匹配时最终会导致“奇怪的字符”。在第 3 行将 DataInputStream 替换为 BufferedReader。并将"dis = new DataInputStream(new BufferedInputStream(is));"替换为"dis = new BufferedReader(new InputStreamReader(is));" @akapelko 谢谢。我更新了答案以删除对已弃用方法的调用。关闭InputStreamReader怎么样？【参考方案4】：

Bill 的回答非常好，但您可能希望对请求做一些事情，例如压缩或用户代理。以下代码显示了如何对您的请求进行各种类型的压缩。

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;

// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) 
    inStr = new GZIPInputStream(conn.getInputStream());
 else if (encoding != null && encoding.equalsIgnoreCase("deflate")) 
    inStr = new InflaterInputStream(conn.getInputStream(),
      new Inflater(true));
 else 
    inStr = conn.getInputStream();

要同时设置用户代理，请添加以下代码：

conn.setRequestProperty ( "User-agent", "my agent name");

【讨论】：

对于那些希望将 InputStream 转换为字符串的人，请参阅this answer。 setFollowRedirects 有帮助，我在我的情况下使用 setInstanceFollowRedirects，在使用它之前我在很多情况下都得到空网页。我假设您尝试使用压缩来更快地下载文件。【参考方案5】：

我会使用像 Jsoup 这样的不错的 HTML 解析器。然后就很简单了：

String html = Jsoup.connect("http://***.com").get().html();

它完全透明地处理 GZIP 和分块响应以及字符编码。它还提供了更多的优势，比如 HTML traversing 和 manipulation 由 CSS 选择器，就像 jQuery 一样。您只需以Document 的形式获取它，而不是以String 的形式获取它。

Document document = Jsoup.connect("http://google.com").get();

你真的 don't 想在 HTML 上运行基本的字符串方法甚至正则表达式来处理它。

另见：

What are the pros and cons of leading HTML parsers in Java?

【讨论】：

好答案。有一点晚。 ;) 很棒的图书馆：）谢谢。为什么以前没有人告诉我关于 .html() 的事情。我非常努力地研究如何轻松存储 Jsoup 获取的 html，这很有帮助。对于新手来说，如果你在android中使用这个库，你需要在不同的线程中使用它，因为它默认运行在同一个应用程序线程上，这将导致应用程序抛出NetworkOnMainThreadException【参考方案6】：

上述所有方法都不会像在浏览器中那样下载网页文本。如今，大量数据通过 html 页面中的脚本加载到浏览器中。上述技术均不支持脚本，它们仅下载 html 文本。 HTMLUNIT 支持 javascripts。因此，如果您希望下载浏览器中的网页文本，那么您应该使用HTMLUNIT。

【讨论】：

【参考方案7】：

Jetty 有一个 HTTP 客户端，可用于下载网页。

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 

    public static void main(String[] args) throws Exception 

        HttpClient client = null;

        try 

            client = new HttpClient();
            client.start();
            
            String url = "http://example.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

         finally 

            if (client != null) 

                client.stop();

该示例打印一个简单网页的内容。

在Reading a web page in Java 教程中，我编写了六个使用 URL、JSoup、HtmlCleaner、Apache HttpClient、Jetty HttpClient 和 HtmlUnit 以编程方式在 Java 中下载网页的示例。

【讨论】：

【参考方案8】：

我使用了这篇文章的实际答案 (url) 并将输出写入文件。

package test;

import java.net.*;
import java.io.*;

public class PDFTest 
    public static void main(String[] args) throws Exception 
    try 
        URL oracle = new URL("http://www.fetagracollege.org");
        BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));

        String fileName = "D:\\a_01\\output.txt";

        PrintWriter writer = new PrintWriter(fileName, "UTF-8");
        OutputStream outputStream = new FileOutputStream(fileName);
        String inputLine;

        while ((inputLine = in.readLine()) != null) 
            System.out.println(inputLine);
            writer.println(inputLine);
        
        in.close();
         catch(Exception e)

【讨论】：

【参考方案9】：

从这个类获得帮助，它获取代码并过滤一些信息。

public class MainActivity extends AppCompatActivity 

    EditText url;
    @Override
    protected void onCreate(Bundle savedInstanceState) 
        super.onCreate( savedInstanceState );
        setContentView( R.layout.activity_main );

        url = ((EditText)findViewById( R.id.editText));
        DownloadCode obj = new DownloadCode();

        try 
            String des=" ";

            String tag1= "<div class=\"description\">";
            String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();

            url.setText( l );
            url.setText( " " );

            String[] t1 = l.split(tag1);
            String[] t2 = t1[0].split( "</div>" );
            url.setText( t2[0] );

        
        catch (Exception e)
        
            Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
        

    
                                        // input, extrafunctionrunparallel, output
    class DownloadCode extends AsyncTask<String,Void,String>
    
        @Override
        protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
        
            String htmlcontent = " ";
            try 
                URL url = new URL( WebAddress[0] );
                HttpURLConnection c = (HttpURLConnection) url.openConnection();
                c.connect();
                InputStream input = c.getInputStream();
                int data;
                InputStreamReader reader = new InputStreamReader( input );

                data = reader.read();

                while (data != -1)
                
                    char content = (char) data;
                    htmlcontent+=content;
                    data = reader.read();
                
            
            catch (Exception e)
            
                Log.i("Status : ",e.toString());
            
            return htmlcontent;

【讨论】：

【参考方案10】：

您很可能需要从安全网页（https 协议）中提取代码。在以下示例中，html 文件被保存到 c:\temp\filename.html 享受！

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;

import javax.net.ssl.HttpsURLConnection;

/**
 * <b>Get the Html source from the secure url </b>
 */
public class HttpsClientUtil 
    public static void main(String[] args) throws Exception 
        String httpsURL = "https://***.com";
        String FILENAME = "c:\\temp\\filename.html";
        BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
        URL myurl = new URL(httpsURL);
        HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
        con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
        InputStream ins = con.getInputStream();
        InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
        BufferedReader in = new BufferedReader(isr);
        String inputLine;

        // Write each line into the file
        while ((inputLine = in.readLine()) != null) 
            System.out.println(inputLine);
            bw.write(inputLine);
        
        in.close(); 
        bw.close();

【讨论】：

【参考方案11】：

使用 NIO.2 强大的 Files.copy(InputStream in, Path target) 来做到这一点：

URL url = new URL( "http://download.me/" );
Files.copy( url.openStream(), Paths.get("downloaded.html" ) );

【讨论】：

以上是关于你如何以编程方式在 Java 中下载网页的主要内容，如果未能解决你的问题，请参考以下文章