如何使用HtmlUnit显示所有AJAX请求

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何使用HtmlUnit显示所有AJAX请求相关的知识,希望对你有一定的参考价值。

我想获得网页所有网络电话的列表。这是页面的网址

https://www.upwork.com/o/jobs/browse/?q=Java&sort=renew_time_int%2Bdesc

如果您查看DeveloperConsole-> Network,您将看到以下列表enter image description here

这是我的代码:

public static void main(String[] args) throws IOException {
        final WebClient webClient = configWebClient();
        final List<String> list = new ArrayList<>();
        new WebConnectionWrapper(webClient) {
            @Override
            public WebResponse getResponse(final WebRequest request) throws IOException {
                final WebResponse response = super.getResponse(request);
                list.add(request.getUrl().toString());
                return response;
            }
        };
        webClient.getPage("https://www.upwork.com/ab/find-work/");
        list.forEach(System.out::println); 
    }

    private static WebClient configWebClient() {
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60);

        webClient.getOptions().setjavascriptEnabled(true);
        webClient.waitForBackgroundJavaScriptStartingBefore(5_000);
        webClient.waitForBackgroundJavaScript(3_000);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setUseInsecureSSL(false);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getCookieManager().setCookiesEnabled(true);
        webClient.setAjaxController(new AjaxController());
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getCookieManager().setCookiesEnabled(true);
        return webClient;
    }

这是输出

https://www.upwork.com/o/jobs/browse/?q=Java&sort=renew_time_int%2Bdesc
https://www.upwork.com/o/jobs/browse/?q=Java
https://www.upwork.com:443/o/jobs/browse/js/328ecc3.js?4af40b2
https://www.googletagmanager.com/gtm.js?id=GTM-5XK7SV
https://client.perimeterx.net/PXSs13U803/main.min.js
https://assets.static-upwork.com/components/11.4.0/core.11.4.0.air2.min.js
https://assets.static-upwork.com/global-components/@latest/ugc.js
https://assets.static-upwork.com/global-components/@latest/ugc/ugc.6jcmqb32.js
https://www.upwork.com:443/static/jsui/JobSearchUI/assets/4af40b2/js/55260a3.js

正如您所看到的,它不包含xhr调用。我究竟做错了什么?

答案

您的问题使用两个不同的网址;希望我使用过正确的

  • 正如这里多次提到的那样; .waitForBackground ...方法不是选项,你必须在调用一些Web请求后调用它们
  • AJAX中的A代表异步; webClient.getPage()是一个同步调用,意味着你必须等待所有的javascript完成
  • 在使用htmlUnit时调用页面似乎会产生一些js错误。也许这会导致不执行此页面中的所有javascript代码(仍然有一些不支持HtmlUnit(Rhino)的javascript功能;欢迎任何帮助) public static void main(String[] args) throws IOException { final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60); webClient.getOptions().setThrowExceptionOnScriptError(false); final List<String> list = new ArrayList<>(); new WebConnectionWrapper(webClient) { @Override public WebResponse getResponse(final WebRequest request) throws IOException { final WebResponse response = super.getResponse(request); list.add(request.getHttpMethod() + " " + request.getUrl()); return response; } }; webClient.getPage("https://www.upwork.com/o/jobs/browse/?q=Java&sort=renew_time_int%2Bdesc"); webClient.waitForBackgroundJavaScript(10_000); list.forEach(System.out::println); }

以上是关于如何使用HtmlUnit显示所有AJAX请求的主要内容,如果未能解决你的问题,请参考以下文章

Spring中如何配置重试HtmlUnit请求的次数?

如何用JAVA爬取AJAX加载后的页面

如何使用无头(gui-less)Selenium WebDriver下载文件

HtmlUnit爬取Ajax动态生成的页面内容

htmlunit设置支持js和 ajax

如何将 SOCKS 与 HtmlUnit 一起使用?