如何在 R 中使用 httr 对 shibboleth 多主机名网站进行身份验证

Posted

技术标签:

【中文标题】如何在 R 中使用 httr 对 shibboleth 多主机名网站进行身份验证【英文标题】:how to authenticate a shibboleth multi-hostname website with httr in R 【发布时间】:2016-04-22 03:50:00 【问题描述】:

注意:ipums international 和 ipums usa 可能使用相同的系统。 ipums usa 允许更快的注册。如果您想测试您的代码,请尝试https://usa.ipums.org/usa-action/users/request_access 注册!

我正在尝试使用 R 语言和 httr 以编程方式从 https://international.ipums.org/ 下载文件。我需要使用 httr 而不是 RCurl,因为我需要在身份验证后将大文件 not 下载到 RAM 中,而是直接下载到磁盘。 this is currently only possible with httr as far as i know

下面的可重现代码记录了我从登录页面 (https://international.ipums.org/international-action/users/login) 到主要的身份验证后页面的最大努力。任何提示或提示将不胜感激!谢谢!

my_email <- "email@address.com"
my_password <- "password"

tf <- tempfile()

# use httr, because i need to download a large file after authentication
# and only httr supports that with its `write_disk()` option
library(httr)

# turn off ssl verify, otherwise the subsequent GET command will fail
set_config( config( ssl_verifypeer = 0L ) )

GET( "https://international.ipums.org/Shibboleth.sso/Login?target=https%3A%2F%2Finternational.ipums.org%2Finternational-action%2Fmenu" )

# connect to the starting login page of the website
( a <- GET( "https://international.ipums.org/international-action/users/login" , verbose( info = TRUE ) ) )

# which takes me through to a lot of websites, but ultimately (in my browser) lands at
shibboleth_url <- "https://live.identity.popdata.org:443/idp/Authn/UserPassword"

# construct authentication information?
base_values <- list( "j_username" = my_email , "j_password" = my_password )
idp_values <- list( "j_username" = my_email , "j_password" = my_password ,  "_idp_authn_lc_key"=subset( a$cookies , domain == "live.identity.popdata.org" )$value , "JSESSIONID" = subset( a$cookies , domain == "#HttpOnly_live.identity.popdata.org" )$value )
ipums_values <- list( "j_username" = my_email , "j_password" = my_password ,  "_idp_authn_lc_key"=subset( a$cookies , domain == "live.identity.popdata.org" )$value , "JSESSIONID" = subset( a$cookies , domain == "international.ipums.org" )$value)

# i believe this is where the main login should happen, but it looks like it's failing
GET( shibboleth_url , query = idp_values )
POST( shibboleth_url , body = base_values )
writeBin( GET( shibboleth_url , query = idp_values )$content , tf )

readLines( tf )
# The MPC account authentication system has encountered an error
# This error can sometimes occur if you did not close your browser after logging out of an application previously.  It may also occur for other reasons.  Please close your browser and try your action again."                                                                      

writeBin( GET( "https://live.identity.popdata.org/idp/profile/SAML2/Redirect/SSO" , query = idp_values )$content , tf )
POST( "https://live.identity.popdata.org/idp/profile/SAML2/Redirect/SSO" , body = idp_values )
readLines( tf )
# same error as above

# return to the main login page..
writeBin( GET( "https://international.ipums.org/international-action/menu" , query = ipums_values )$content , tf )
readLines( tf )
# ..not logged in

【问题讨论】:

你有没有为此考虑过 RSelenium? @Thomas 嗨,我不知道从哪里开始。我愿意接受它,只要它可以下载任意大的文件post-authenticationhttr 可以但RCurl 不能) 如果没有实际的帐户,那就不要尝试了 :( @cyberj0g 查看顶部的编辑,您应该可以轻松获得帐户 【参考方案1】:

您必须使用set_cookies() 将您的cookies 发送到服务器:

library(httr)
library(rvest)
#my_email <- "xxx"
#my_password <- "yyy"
tf <- tempfile()
set_config( config( ssl_verifypeer = 0L ) )

# Get first page
p1 <- GET( "https://international.ipums.org/international-action/users/login" , verbose( info = TRUE ) )

# Post Login credentials
b2 <- list( "j_username" = my_email , "j_password" = my_password )
c2 <- c(JSESSIONID=p1$cookies[p1$cookies$domain=="#HttpOnly_live.identity.popdata.org",]$value,
           `_idp_authn_lc_key`=p1$cookies[p1$cookies$domain=="live.identity.popdata.org",]$value)
p2 <- POST(p1$url,body = b2, set_cookies(.cookies = c2), encode="form" )

# Parse hidden fields
h2 <- read_html(p2$content)
form <-  h2 %>% html_form() 

# Post hidden fields
b3 <- list( "RelayState"=form[[1]]$fields[[1]]$value, "SAMLResponse"=form[[1]]$fields[[2]]$value)
c3 <- c(JSESSIONID=p1$cookies[p1$cookies$domain=="#HttpOnly_live.identity.popdata.org",]$value,
           `_idp_session`=p2$cookies[p2$cookies$name=="_idp_session",]$value,
           `_idp_authn_lc_key`=p2$cookies[p2$cookies$name=="_idp_authn_lc_key",]$value)
p3 <- POST( form[[1]]$url , body=b3, set_cookies(.cookies = c3), encode = "form")

# Get interesting page
c4 <- c(JSESSIONID=p3$cookies[p1$cookies$domain=="international.ipums.org" && p3$cookies$name=="JSESSIONID",]$value,
           `_idp_session`=p3$cookies[p3$cookies$name=="_idp_session",]$value,
           `_idp_authn_lc_key`=p3$cookies[p3$cookies$name=="_idp_authn_lc_key",]$value)
p4 <- GET( "https://international.ipums.org/international-action/menu", set_cookies(.cookies = c4) )
writeBin(p4$content , tf )
readLines( tf )[55]

因为结果是

[1] "    <li class=\"lastItem\"><a href=\"/international-action/users/logout\">Logout</a></li>"

我想你已经登录了...

【讨论】:

完美。谢谢。归功于github.com/ajdamico/asdfree/commit/…【参考方案2】:

@HubertL 在正确的方向上做了很多步骤,但是,我认为,他的答案并不完整。

首先,当您实施自动 Web 授权时,需要重点关注的是在“正常”手动工作流程中使用的 cookie。您可以在任何现代浏览器中使用开发工具轻松监视它们:

在这里,我们看到JSESSIONID_shibsession* cookie,第一个保存网站的JSP 会话ID,第二个很可能仅用于shibboleth 授权。服务器可能以某种方式绑定它们,但JSESSIONID 不需要授权,您在打开网站后立即获得它。所以,我们必须获得_shibsession* cookie,我们的JSESSIONID 才能被授权。这就是带有许多重定向的 Shibboleth 授权过程的意义所在。查看代码中的 cmets。

login_ipums = function(user, password)

  require(httr)
  require(rvest)

  set_config( config( ssl_verifypeer = 0L ) )

  #important - httr preserves cookies on subsequent requests to the same host, we don't need that because of sessions expiration
  handle_reset("https://usa.ipums.org/")

  #set login and password
  login1 = GET( "https://usa.ipums.org/usa-action/users/login" )
  form_auth = list( "j_username" = user , "j_password" = password )

  l1_cookies=login1$cookies$value
  names(l1_cookies)=login1$cookies$name

  #receive auth tokens as html hidden fields in a form
  login2 = POST(login1$url, body = form_auth, set_cookies(.cookies=l1_cookies), encode="form")
  login2_form = read_html(login2$content) %>% html_form() 

  l2_cookies=login2$cookies$value
  names(l2_cookies)=login2$cookies$name

  #submit the form back (browser submits it back automatically with JS)
  login3 = POST(login2_form[[1]]$url, body=list(RelayState=login2_form[[1]]$fields$RelayState$value, 
                                                SAMLResponse=login2_form[[1]]$fields$SAMLResponse$value), 
                set_cookies(.cookies=l2_cookies), 
                encode="form")

  #now we have what we came for - _shibsession_* and JSESSION id cookie
  login_cookies = login3$cookies$value
  names(login_cookies)=login3$cookies$name

  return=login_cookies

调用login_ipums 后,我们将获得以下 cookie:

> cookies=login_ipums(my_email, my_password)
> names(cookies)
[1] "JSESSIONID"      
[2] "_idp_authn_lc_key"             
[3] "_shibsession_7573612e69..."

在这里,JSESSIONID_shibsession_* 都用于站点范围的授权。 _idp_authn_lc_key 可能不需要,但离开它不会有什么坏处。

现在,您可以轻松下载这样的文件:

cookies=login_ipums(my_email, my_password)
target = GET("https://usa.ipums.org/usa-action/downloads/extract_files/usa_00001.dat.gz",
         set_cookies(.cookies=cookies),
         write_disk("file.bin", overwrite = TRUE))

重要提示:如您所见,我使用的是 IPUMS USA,而不是 International。要使用您的帐户检查该代码,请将 usa 替换为 international 在任何地方,包括 URL 中的 *-action

【讨论】:

太棒了,谢谢! @HubertL 的代码对我有用,看起来你的也一样。谢谢!! 好的,但请注意@HubertL 的代码从不明确设置_shibsession_*(实际授权令牌)cookie,它(可能无意)依赖于httr cookie 持久性机制,这可能是生产中的问题.

以上是关于如何在 R 中使用 httr 对 shibboleth 多主机名网站进行身份验证的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 R 中的 httr 包使用 Localytics 中的数据提取数据?

使用 httr R 包发送 POST 请求

使用 httr 将 curl 命令转换为 R(特别是 '--data-binary @')

在 Windows 7 64 上的 R studio 版本 0.99.489 中运行库(httr)时出错

r中具有多个标头的httr请求

R爬虫总结 | RCurl/httr(请求)→XML/xml2/rvest(解析)