使用 Java 从网页中提取数据？

Posted 2023-02-24

技术标签:

【中文标题】使用 Java 从网页中提取数据？【英文标题】：Using Java to pull data from a webpage? 【发布时间】：2011-09-03 18:51:50 【问题描述】：

我正在尝试用 Java 编写我的第一个程序。目标是编写一个程序来浏览网站并为我下载文件。但是，我不知道如何使用 Java 与互联网交互。谁能告诉我要查找/阅读哪些主题或推荐一些好的资源？

【问题讨论】：

你可以使用 Apache 的HttpClient。有点类似的答案here 【参考方案1】：

最简单的解决方案（不依赖于任何第三方库或平台）是创建一个指向您要下载的网页/链接的 URL 实例，并使用流读取内容。

例如：

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;


public class DownloadPage 

    public static void main(String[] args) throws IOException 

        // Make a URL to the web page
        URL url = new URL("http://***.com/questions/6159118/using-java-to-pull-data-from-a-webpage");

        // Get the input stream through URL Connection
        URLConnection con = url.openConnection();
        InputStream is =con.getInputStream();

        // Once you have the Input Stream, it's just plain old Java IO stuff.

        // For this case, since you are interested in getting plain-text web page
        // I'll use a reader and output the text content to System.out.

        // For binary content, it's better to directly read the bytes from stream and write
        // to the target file.


        BufferedReader br = new BufferedReader(new InputStreamReader(is));

        String line = null;

        // read each line and write to System.out
        while ((line = br.readLine()) != null) 
            System.out.println(line);

希望这会有所帮助。

【讨论】：

嗨，当我实现这一点时，我在控制台中获得了 html 文件。如何从网站获得特定价值【参考方案2】：

基础知识

看看这些，或多或少地从头开始构建解决方案：

从基础开始：The Java Tutorial的chapter on Networking，包括Working With URLs 让自己更轻松：Apache HttpComponents（包括 HttpClient）

易于粘合和缝合的东西

您始终可以选择使用exec() 和类似方法从Java 调用外部工具。例如，您可以使用wget 或cURL。

硬核的东西

然后，如果您想研究更成熟的东西，谢天谢地，自动化网络测试的需求为我们提供了非常实用的工具。看：

HtmlUnit（强大而简单） Selenium, Selenium-RC WebDriver/Selenium2（仍在制作中） JBehave 和 JBehave Web

其他一些库是故意编写的，考虑到网络抓取：

JSoup Jaunt

一些解决方法

Java 是一种语言，也是一种平台，上面运行着许多其他语言。其中一些集成了出色的语法糖或库以轻松构建抓取工具。

退房：

Groovy（及其XmlSlurper）或Scala（提供强大的XML支持here和here）

如果您知道 Ruby（JRuby，带有 article on scraping with JRuby and HtmlUnit）或 Python（Python（Jython）的出色库，或者您更喜欢这些语言，那么给他们的 JVM 端口一个机会。

一些补充

其他一些类似的问题：

Scrape data from HTML using Java Options for HTML Scraping

【讨论】：

在那个答案中我没有写一些东西：我真的不建议在 Java 中做这种事情（当然，你可能别无选择，但我只是指出它出去）。这是可行的，并且有很多工具可以做到这一点，但是 Java 固有的冗长使得尝试废弃的 Web 服务变得不那么友好。通常，我宁愿从带有 REPL 的动态语言中执行此操作，或者直接从浏览器的控制台等执行此操作……但是，当然，没有什么能阻止您从那样开始，然后在 Java 中实现解决方案……或其他基于JVM的语言！【参考方案3】：

这是我使用URL 和try with resources 短语来捕获异常的解决方案。

/**
 * Created by mona on 5/27/16.
 */
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
public class ReadFromWeb 
    public static void readFromWeb(String webURL) throws IOException 
        URL url = new URL(webURL);
        InputStream is =  url.openStream();
        try( BufferedReader br = new BufferedReader(new InputStreamReader(is))) 
            String line;
            while ((line = br.readLine()) != null) 
                System.out.println(line);
            
        
        catch (MalformedURLException e) 
            e.printStackTrace();
            throw new MalformedURLException("URL is malformed!!");
        
        catch (IOException e) 
            e.printStackTrace();
            throw new IOException();
        

    
    public static void main(String[] args) throws IOException 
        String url = "https://madison.craigslist.org/search/sub";
        readFromWeb(url);

您还可以根据需要将其保存到文件中，或使用XML 或HTML 库对其进行解析。

【讨论】：

【参考方案4】：

自 Java 11 以来，它使用标准库中的java.net.http.HttpClient 是最方便的方式。

例子：

HttpClient client = HttpClient.newBuilder()
     .version(Version.HTTP_1_1)
     .followRedirects(Redirect.NORMAL)
     .connectTimeout(Duration.ofSeconds(20))
     .proxy(ProxySelector.of(new InetSocketAddress("proxy.example.com", 80)))
     .authenticator(Authenticator.getDefault())
     .build();

HttpRequest request = HttpRequest.newBuilder()
     .uri(URI.create("httpss://foo.com/"))
     .timeout(Duration.ofMinutes(2))
     .GET()
     .build();

HttpResponse<String> response = client.send(request, BodyHandlers.ofString());

System.out.println(response.statusCode());

System.out.println(response.body());

【讨论】：

【参考方案5】：

我的 API 使用以下代码：

try 
        URL url = new URL("https://***.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
        InputStream content = url.openStream();
        int c;
        while ((c = content.read())!=-1) System.out.print((char) c);
     catch (MalformedURLException e) 
        e.printStackTrace();
     catch (IOException ie) 
        ie.printStackTrace();

您可以捕获字符并将它们转换为字符串。

【讨论】：

以上是关于使用 Java 从网页中提取数据？的主要内容，如果未能解决你的问题，请参考以下文章