从 LinkedIn 网络抓取公司详细信息 --- 无法在内部获取正文标签

Posted

技术标签:

【中文标题】从 LinkedIn 网络抓取公司详细信息 --- 无法在内部获取正文标签【英文标题】:Web scraping company details from LinkedIn --- not able to get body tag inside 【发布时间】:2019-12-05 11:48:14 【问题描述】:

我正在尝试从 LinkedIn 帐户获取公司信息,但无法获取正文中的任何内容。你能告诉我有什么问题吗?

我需要得到

company 
website
industry
employes
etc.

但我做不到。我收到的唯一 html 如下所示:

代码:

import requests

import webbrowser,html5lib
from bs4 import BeautifulSoup
linkdine_company_about=requests.get('https://www.linkedin.com/company/exxonmobil')
html=BeautifulSoup(linkdine_company_about.text,'html.parser')
print(html)

运行:

<pre>
exxonmobil
https://www.linkedin.com/company/exxonmobil
      <html><head>
      <script type="text/javascript">
      window.onload = function () 
          // Parse the tracking code from cookies.
          var trk = "bf";
          var trkInfo = "bf";
          var cookies = document.cookie.split("; ");
          for (var i = 0; i < cookies.length; ++i) 
              if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) 
                  trk = cookies[i].substring(8);
               else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) 
                  trkInfo = cookies[i].substring(8);
              
          
          if (window.location.protocol == "http:") 
              // If "sl" cookie is set, redirect to https.
              for (var i = 0; i < cookies.length; ++i) 
                  if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) 
                      window.location.href = "https:" +
                          window.location.href.substring(window.location.protocol.length);
                      return;
                  
              
          
          // Get the new domain. For international domains such as
          // fr.linkedin.com, we convert it to www.linkedin.com
          var domain = "www.linkedin.com";
          if (domain != location.host) 
              var subdomainIndex = location.host.indexOf(".linkedin");
              if (subdomainIndex != -1) 
                  domain = "www" + location.host.substring(subdomainIndex);
              
          
          window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +
              "&originalReferer=" + document.referrer.substr(0, 200) +
              "&sessionRedirect=" + encodeURIComponent(window.location.href);
      
    </script>
    </head></html>
***
Process finished with exit code 0
</pre>

【问题讨论】:

【参考方案1】:

你只需要传递headers,在这里你就可以做到。

不要忘记用你自己的替换Cookie

import requests
from bs4 import BeautifulSoup

headers = 
    'Host': 'www.linkedin.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Cookie': '', # replace with your own cookies.
    'Upgrade-Insecure-Requests': '1',
    'Cache-Control': 'max-age=0',
    'TE': 'Trailers'


r = requests.get(
    'https://www.linkedin.com/company/exxonmobil', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify)

【讨论】:

以上是关于从 LinkedIn 网络抓取公司详细信息 --- 无法在内部获取正文标签的主要内容,如果未能解决你的问题,请参考以下文章

爬虫-根据公司名抓取相关员工的linkedin数据

获取由用户管理的公司的详细信息 - Ruby on Rails

LinkedIn API:有关公司和团体的信息

抓取链接/ Href CSS

来自 *** 公司的网络抓取公司描述

领英Linkedin信息搜集工具InSpy