如何绕过使用 php cURL 库编写网站的 Oracle ADF 环回脚本?

Posted

技术标签:

【中文标题】如何绕过使用 php cURL 库编写网站的 Oracle ADF 环回脚本?【英文标题】:how to bypass Oracle ADF loopback script for scripting website using php cURL library? 【发布时间】:2019-05-28 11:33:58 【问题描述】:

我正在抓取一个具有 Oracle ADF loopback 脚本的网站,该脚本不断将我重定向到我的同一页面,那么如何绕过它?

以下是我的 php 代码。

<?php
    $url = 'https://www.mywebsite.com/faces/index.jspx';
    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . '/cookie.txt');
    curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . '/cookie.txt');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $header[] = 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/71.0.3578.98 Safari/537.36';
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    $data = curl_exec($ch);
    curl_close($ch);
    if (curl_errno($ch))  // check for execution errors
      echo 'Scraper error: ' . curl_error($ch);
      exit;
    
    echo $data;
?>

当我运行上面的代码时,我被重定向到同一页面,

它还添加了一些查询字符串参数,例如 ?_afrLoop=39478247795404&amp;_afrWindowMode=0&amp;_afrWindowId=null

在实际站点中_afrWindowId 有一些随机的字母数字字符串,但我得到的是null

手动停止页面重定向后,我得到的页面包含以下 Oracle 环回脚本

导致重定向,我该怎么做。

回送脚本:

    <html lang="el-GR"><head><script>
/*
** Copyright (c) 2008, Oracle and/or its affiliates. All rights reserved.
*/

/**
 * This is the loopback script to process the url before the real page loads. It introduces
 * a separate round trip. During this first roundtrip, we currently do two things: 
 * - check the url hash portion, this is for the PPR Navigation. 
 * - do the new window detection
 * the above two are both controled by parameters in web.xml
 * 
 * Since it's very lightweight, so the network latency is the only impact. 
 * 
 * here are the list of will-pass-in parameters (these will replace the param in this whole
 * pattern: 
 *        viewIdLength                           view Id length (characters), 
 *        loopbackIdParam                        loopback Id param name, 
 *        loopbackId                             loopback Id,
 *        loopbackIdParamMatchExpr               loopback Id match expression, 
 *        windowModeIdParam                      window mode param name, 
 *        windowModeParamMatchExpr               window mode match expression, 
 *        clientWindowIdParam                    client window Id param name, 
 *        clientWindowIdParamMatchExpr           client window Id match expression, 
 *        windowId                               window Id, 
 *        initPageLaunch                         initPageLaunch, 
 *        enableNewWindowDetect                  whether we want to enable new window detection
 *        jsessionId                             session Id that needs to be appended to the redirect URL
 *        enablePPRNav                           whether we want to enable PPR Navigation
 *
 */

var id = null; 
var query = null; 
var href = document.location.href; 
var hashIndex = href.indexOf("#"); 
var hash = null;

/* process the hash part of the url, split the url */
if (hashIndex > 0) 
 
  hash = href.substring(hashIndex + 1); 
  /* only analyze hash when pprNav is on (bug 8832771) */
  if (false && hash && hash.length > 0) 
   
    hash = decodeURIComponent(hash); 
    if (hash.charAt(0) == "@") 
     
      query = hash.substring(1); 
     
    else 
     
      var state = hash.split("@"); 
      id = state[0]; 
      query = state[1]; 
     
   
  href = href.substring(0, hashIndex); 
 

/* process the query part */
var queryIndex = href.indexOf("?"); 
if (queryIndex > 0) 

  /* only when pprNav is on, we take in the query from the hash portion */
  query = (query || (id && id.length>0))? query: href.substring(queryIndex); 
  href = href.substring(0, queryIndex); 
 

var jsessionIndex = href.indexOf(';');
if (jsessionIndex > 0)

  href = href.substring(0, jsessionIndex);


/* we will replace the viewId only when pprNav is turned on (bug 8832771) */
if (false) 

  if (id != null && id.length > 0) 
   
    href = href.substring(0, href.length - 11) + id;
   


var isSet = false; 
if (query == null || query.length == 0) 
 
  query = "?"; 
 
else if (query.indexOf("_afrLoop=") >= 0) 
 
  isSet = true; 
  query = query.replace(/_afrLoop=[^&]*/, "_afrLoop=39279593944826"); 
 
else 
 
  query += "&"; 
 
if (!isSet) 
 
  query = query += "_afrLoop=39279593944826"; 
 

/* below is the new window detection logic */
var initWindowName = "_afr_init_"; // temporary window name set to a new window
var windowName = window.name;

// if the window name is "_afr_init_", treat it as redirect case of a new window
if ((true) && (!windowName || windowName==initWindowName || 
    windowName!="null"))  
 
  /* append the _afrWindowMode param */
  var windowMode;
  if (true) 
  
    /* this is the initial page launch case, 
       also this could be that we couldn't detect the real windowId from the server side */
    windowMode=0;
  
  else if ((href.indexOf("/__ADFvDlg__") > 0) || (query.indexOf("__ADFvDlg__") >= 0))
  
    /* this is the dialog case */
    windowMode=1;
  
  else 
  
    /* this is the ctrl-N case */
    windowMode=2;
  

  if (query.indexOf("_afrWindowMode=") >= 0) 
   
    query = query.replace(/_afrWindowMode=[^&]*/, "_afrWindowMode="+windowMode); 
   
  else 
   
    query = query += "&_afrWindowMode="+windowMode; 
   

  /* append the _afrWindowId param */
  var clientWindowId;
  /* in case we couldn't detect the windowId from the server side */
  if (!windowName || windowName == initWindowName) 
  
    clientWindowId = "null";

    // set window name to an initial name so we can figure out whether a page is loaded from
    // cache when doing Ctrl+N with IE
    window.name = initWindowName;
  
  else 
  
    clientWindowId = windowName;
    

  if (query.indexOf("_afrWindowId=") >= 0) 
   
    query = query.replace(/_afrWindowId=\w*/, "_afrWindowId="+clientWindowId); 
   
  else 
   
    query = query += "&_afrWindowId="+clientWindowId; 
   



var sess = "";

if (sess.length > 0)
  href += sess; 

/* if pprNav is on, then the hash portion should have already been processed */
if ((false) || (hash == null))
  document.location.replace(href + query);
else 
  document.location.replace(href + query + "#" + hash);
</script>
</head>
</html>

【问题讨论】:

停用 ADF 项目的环回功能对您有用吗? @MrAdibou 我无法停用,因为我正在抓取我不拥有的其他网站。 【参考方案1】:

爬取ADF页面的正确方法是在URL中传入一个参数

*domain.com*?org.apache.myfaces.trinidad.outputMode=webcrawler

来自脚本的所有 GET 请求。请记住,当您切换到爬虫模式时,页面看起来会有所不同,因为它不是供人类使用的,但它应该包含您需要抓取的所有原始细节。

虽然,这是一个老问题,OP 可能早就转向更好的事情,想在这里回答这个问题以帮助其他遇到同样问题的人。

【讨论】:

Ashvin 我正在使用 php cURL 库,我不能像你所说的那样设置输出模式,我认为你可以在 ADF 中设置它,但不能在 php 中设置。 我指的是您提出请求的 URL 参数。 有趣...这是官方记录的地方吗?或者它是一个未发布的“功能”? 我在 repo 中找不到这个了,很可能已经停产了。从几年前我在那里投稿的那一刻起,我就记住了这一点。但是如果有人仍然需要这种能力,电子邮件模式似乎是壁橱,因为我记得它内联所有 CSS 并关闭所有类型的动态脚本。电子邮件参数似乎是来自github.com/apache/myfaces-trinidad/blob/master/trinidad-impl/… 的“org.apache.myfaces.trinidad.agent.email”

以上是关于如何绕过使用 php cURL 库编写网站的 Oracle ADF 环回脚本?的主要内容,如果未能解决你的问题,请参考以下文章

使用PHP的cURL库进行网页抓取

PHP 抓取函数curl 实践

群晖 WebStation PHP 使用 curl 进行 http 请求(群晖 WebStation php 安装第三方库)

群晖 WebStation PHP 使用 curl 进行 http 请求(群晖 WebStation php 安装第三方库)

群晖 WebStation PHP 使用 curl 进行 http 请求(群晖 WebStation php 安装第三方库)

win7 wamp 64位 php环境如何开启curl服务?