如何使用 jsoub 或任何其他方法从网站获取完整的 html 代码
Posted
技术标签:
【中文标题】如何使用 jsoub 或任何其他方法从网站获取完整的 html 代码【英文标题】:how to get full html code from website using jsoub or any other 【发布时间】:2021-08-18 22:17:07 【问题描述】:我正在尝试从网站获取 html 代码,如果网站代码像这样小:(https://abdelftahzowail.github.io/WriteUpsideDown/)我会得到完整的代码,但如果网站代码像这样大:(https://www.pixel4k.com/page/1?s=deadpool)我没有得到完整的代码
我尝试了Jsoup
和HttpURLConnection
并没有给我完整的代码
这是我的代码
Thread thread = new Thread(() ->
try
Document doc;
doc = Jsoup.connect(editText.getText().toString())
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36")
.maxBodySize(0)
.timeout(0)
.get();
Log.i("IMPORTANT !!!!","doc ( "+editText.getText().toString()+" )\n"+doc);
catch (Exception e)
Log.i("IMPORTANT !!!!","error : "+e);
);
thread.start();
这是我从这个网站获得的代码 (https://www.pixel4k.com/page/1?s=deadpool)
<!doctype html>
<html class="no-js" lang="en-US" prefix="og: http://ogp.me/ns#">
<head>
<meta charset="UTF-8">
<title>You searched for deadpool - 4k Wallpapers ,Hd Wallpapers,Desktop Wallpapers, Free Backgrounds Download, Widescreen Wallpapers</title>
<link rel="icon" href="https://www.pixel4k.com/wp-content/uploads/2018/09/favicon.ico" type="image/x-icon">
<link rel="apple-touch-icon" href="apple-touch-icon.png">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black">
<link rel="stylesheet" type="text/css" media="all" href="https://www.pixel4k.com/wp-content/themes/pxxx/style.css">
<link rel="pingback" href="https://www.pixel4k.com/xmlrpc.php">
<meta name="google-site-verification" content="xHAo1q6wJG7bz-iw00VylrwaMabFjK_xSyU1jakgwaQ">
<meta name="wot-verification" content="317f71c46e1fb6060ce1">
<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" type="f8f50ad6803275492fa5ce1d-text/javascript"></script>
<script type="f8f50ad6803275492fa5ce1d-text/javascript">(adsbygoogle=window.adsbygoogle||[]).push(google_ad_client:"ca-pub-2555268506534283",enable_page_level_ads:true);</script> <!--[if lt IE 9]>
<script src="https://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<meta name="robots" content="noindex,follow">
<link rel="next" href="https://www.pixel4k.com/search/deadpool/page/2">
<meta property="og:locale" content="en_US">
<meta property="og:type" content="object">
<meta property="og:title" content="You searched for deadpool - 4k Wallpapers ,Hd Wallpapers,Desktop Wallpapers, Free Backgrounds Download, Widescreen Wallpapers">
<meta property="og:url" content="https://www.pixel4k.com/search/deadpool">
<meta property="og:site_name" content="4k Wallpapers ,Hd Wallpapers,Desktop Wallpapers, Free Backgrounds Download, Widescreen Wallpapers">
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:title" content="You searched for deadpool - 4k Wallpapers ,Hd Wallpapers,Desktop Wallpapers, Free Backgrounds Download, Widescreen Wallpapers">
<script type="application/ld+json">"@context":"https:\/\/schema.org","@type":"Person","url":"https:\/\/www.pixel4k.com\/","sameAs":[],"@id":"#person","name":"Mika"</script>
<link rel="dns-prefetch" href="//ajax.googleapis.com">
<link rel="dns-prefetch" href="//www.pixel4k.com">
<link rel="alternate" type="application/rss+xml" title="4k Wallpapers ,Hd Wallpapers,Desktop Wallpapers, Free Backgrounds Download, Widescreen Wallpapers » Feed" href="https://www.pixel4k.com/feed">
<link rel="alternate" type="application/rss+xml" title="4k Wallpapers ,Hd Wallpapers,Desktop Wallpapers, Free Backgrounds Download, Widescreen Wallpapers » Comments Feed" href="https://www.pixel4k.com/comments/feed">
<link rel="alternate" type="application/rss+xml" title="4k Wallpapers ,Hd Wallpapers,Desktop Wallpapers, Free Backgrounds Download, Widescreen Wallpapers » Search Results for “deadpool” Feed" href="https://www.pixel4k.com/search/deadpool/feed/rss2/">
<style type="text/css">img.wp-smiley,img.emojidisplay:inline!important;border:none!important;box-shadow:none!important;height:1em!important;width:1em!important;margin:0 .07em!important;vertical-align:-.1em!important;background:none!important;padding:0!important</style>
<link rel="stylesheet" id="wp-block-library-css" href="https://www.pixel4k.com/wp-includes/css/dist/block-library/style.min.css?ver=5.3.8" type="text/css" media="all">
<style id="rocket-lazyload-inline-css" type="text/css">.rll-youtube-playerposition:relative;padding-bottom:56.23%;height:0;overflow:hidden;max-width:100%;background:#000;margin:5px.rll-youtube-player iframeposition:absolute;top:0;left:0;width:100%;height:100%;z-index:100;background:0 0.rll-youtube-player imgbottom:0;display:block;left:
但这个应用程序 (https://play.google.com/store/apps/details?id=com.teejay.trebedit&hl=en&gl=US) 获取完整代码
我该怎么办?
【问题讨论】:
我怀疑线程在你得到整个文档之前就死了。你能把Thread.sleep(20 * 1_000 )
放在thread.start之后看看它是否会改变行为吗?
@spi 没有做任何改变
【参考方案1】:
您正在获取所有数据(您的两个 url 和您的代码生成完整的 html),但是 android 记录器在您调用它时不会输出所有内容。
如果您尝试编写文件而不是日志语句,您很可能会注意到所有数据都可用。
参照。 What is the size limit for Logcat and how to change its capacity?
【讨论】:
【参考方案2】:我在 Java 中搜索了 String
的最大长度。据 this question 中的 Takahiko Kawasaki 所说,最大长度为 65536 个字符。
由于您使用的方法将网页的 HTML 代码写入 String
,这意味着如果您尝试下载的网页小于 65.536 字节,您的代码将按预期工作。
我不知道你在获得网页的HTML代码后需要做什么,所以下面的建议可能不足以满足你的需要,但是:你是否尝试过将HTML代码存储在StringBuffer
中String
?
【讨论】:
以上是关于如何使用 jsoub 或任何其他方法从网站获取完整的 html 代码的主要内容,如果未能解决你的问题,请参考以下文章
如何从 HTML 输入类型“文件”或任何其他方式获取文件夹目录?