用户传输 10TB 文件时 s3cmd 性能极差

Posted

技术标签:

【中文标题】用户传输 10TB 文件时 s3cmd 性能极差【英文标题】:s3cmd performance is extremely poor when a user transfers 10TB file 【发布时间】:2021-10-16 11:55:18 【问题描述】:

我正在尝试使用 s3cmd 将 10TB 文件传输到 COS(云对象存储)。

为了传输我正在使用以下命令的文件:

python3 cloud-s3.py --upload s3cmd /data/10TB.txt pr-bucket1 --multipart-chunk-size-mb 1024 --limit-rate 100M --no-check-md5强>

传输此文件大约需要 55 小时。

是否有其他可用参数可以提高其性能?

另一方面,亚马逊 AWS 需要大约 22 小时才能完成相同文件的传输。

为什么 s3cmd 的性能这么差?是这样设计的吗?

谁能帮我解决这个问题?

这是我在 cloud-s3.py 文件中的内容:

#!/usr/bin/python

import sys
import argparse
import subprocess

def main(argv):
    parser = argparse.ArgumentParser(description='Cloud project. Prereq:pip3 ')
    parser.add_argument("-i", "--install", help="Command to install either s3cmd / aws cli.", choices=["s3cmd", "aws_cli"], dest='installation', type= str)
    parser.add_argument("-c", "--configure", help="Command to configure either s3cmd / aws cli.", choices=["s3cmd", "aws_cli"], dest='configure', type= str)
    parser.add_argument("-u", "--upload", help="Command to transfer file to the bucket. protocol, file path and bucket name are required. Upload supports GPG encryption.", nargs=3, type=str)
    parser.add_argument("-l", "--list", help="Command to list the bucket items. protocol, bucket name is required.", nargs=2, type=str)
    parser.add_argument("-e", "--encrypt", help="Flag to send an encrypted file. The encryption password needs to be given while configuring s3cmd. Other users would need to use gpg -d <file> to decrypt it. And should enter the password you supplied.", action='store_true', dest='encryption')
    parser.add_argument("-d", "--disable-multipart", help="Flag to disable multipart transfer for the current transfer. FYI, By default the multipart transfer is enabled for files larger than the default multipart chunk size. Refer .s3cfg text file.", dest='disable_multipart', action='store_true')
    parser.add_argument("-s", "--multipart-chunk-size-mb", help="Size of each chunk of a multipart upload. Files bigger than SIZE are automatically uploaded as multithreaded-multipart, smaller files are uploaded using the traditional method. SIZE is in Mega-Bytes, default chunk size is 15MB, minimum allowed chunk size is 5MB, maximum is 5GB.", dest='chunk_size', type=str, nargs=1)
    parser.add_argument("--sync", help="Conditional Transfer. Only files that doesn't exist at the destination in the same version are transferred. Note: sync doesn't support GPG encryption.", dest='sync_data', nargs=3, type=str)
    parser.add_argument("--limit-rate", help="Limit the upload or download speed to amount bytes per second.  Amount may be expressed in bytes, kilobytes with the k suffix, or megabytes with the m suffix", dest='limit_rate', nargs=1, type=str)
    parser.add_argument("--no-check-md5", help="Do not check MD5 sums when comparing files for [sync]. Only size will be compared. May significantly speed up transfer but may also miss some changed files.", dest='no_checksum', action='store_true')

    argument = parser.parse_args()
    install = argument.installation
    config = argument.configure
    upload = argument.upload
    list_bucket = argument.list
    encrypt_enabled = argument.encryption
    disable_multipart = argument.disable_multipart
    chunk_size = argument.chunk_size
    sync = argument.sync_data
    limit_rate = argument.limit_rate
    no_checksum = argument.no_checksum

    if install == 's3cmd':
        print("s3 cmd")
        subprocess.call('sudo pip3 install s3cmd', shell=True)
    elif install == 'aws cli':
        print("aws cli")
    if config == "s3cmd":
        print("config s3 cmd")
        subprocess.run('s3cmd --configure', shell=True)
    elif config == "aws_cli":
        print("config aws cli")
    if upload:
        print("upload")
        protocol = argument.upload[0]
        filename = argument.upload[1]
        bucketname = "s3://"
        bucketname += argument.upload[2]
        print("protocol = ", protocol)
        print("filename = ", filename)
        print("bucket = ", bucketname)
        upload_list = [protocol, "put", filename, bucketname]
        if encrypt_enabled :
            upload_list.append("-e")
        if disable_multipart :
            upload_list.append("--disable-multipart")
        if chunk_size :
            upload_list.append("--multipart-chunk-size-mb")
            upload_list.append(argument.chunk_size[0])
        if limit_rate :
            upload_list.append("--limit-rate")
            upload_list.append(argument.limit_rate[0])
        print("\n Print upload list :\n")
        print(upload_list)
        subprocess.run(upload_list)

    if list_bucket:
        print("list")
        protocol = argument.list[0]
        bucketname = "s3://"
        bucketname += argument.list[1]
        subprocess.run([protocol, "ls", bucketname])

    if sync:
        print("executing s3 sync")
        protocol = argument.sync_data[0]
        filename = argument.sync_data[1]
        bucketname = "s3://"
        bucketname += argument.sync_data[2]
        print("protocol = ", protocol)
        print("filename = ", filename)
        print("bucket = ", bucketname)
        sync_list = [protocol, "sync", filename, bucketname]
        if disable_multipart :
            sync_list.append("--disable-multipart")
        if chunk_size :
            sync_list.append("--multipart-chunk-size-mb")
            sync_list.append(argument.chunk_size[0])
        if limit_rate :
            sync_list.append("--limit-rate")
            sync_list.append(argument.limit_rate[0])
        if no_checksum :
            sync_list.append("--no-check-md5")
        print("\n Print sync list :\n")
        print(sync_list)
        subprocess.run(sync_list)

if __name__ == "__main__":
   main(sys.argv[1:])

【问题讨论】:

为什么要使用s3cmd?最后一个 releases 是在 2020 年和 2018 年。请改用 AWS Command-Line Interface (CLI)。 【参考方案1】:

上传到 s3 的一般建议是,如果您有超过 100MB 的文件,请使用 s3 分段上传。

您可以使用另一种方式来加速上传,即 S3 加速。 https://aws.amazon.com/s3/transfer-acceleration/

但这是一种极端情况,即使您有 100Mbps 的连接,也需要大约 23 小时才能上传 1TB。

在这里通过互联网上传到 S3 不是一个好的选择,S3 确实有一些其他产品可以上传这种规模的数据,例如 AWS snawball。

【讨论】:

以上是关于用户传输 10TB 文件时 s3cmd 性能极差的主要内容,如果未能解决你的问题,请参考以下文章

性能测试基础-传输速率,带宽,吞吐量区别

s3cmd 同步是不是根据列表计算每个文件或提出请求定价

S3CMD 超时

S3cmd mv 命令在复制后不删除源文件

bash & s3cmd 无法正常工作

【ceph】s3cmd 创建bucket名称大小写问题