在大文件上使用 rsync 的 gsutil int 错误
Posted
技术标签:
【中文标题】在大文件上使用 rsync 的 gsutil int 错误【英文标题】:gsutil int error using rsync on large files 【发布时间】:2015-05-06 03:37:15 【问题描述】:环境:
Windows 2012 R2 服务器 [服务器] Python:2.7.9 GSutil:4.9 在提升的命令提示符下以系统身份运行(对所有文件的完全访问权限) Bucket 也被命名为 [server]背景:尝试使用 gsutil 将约 5TB 的数据备份到 GCS。
执行:从以下命令开始:
python d:\gsutil\gsutil -m rsync -R d:\data\ gs://[server]
大部分数据已复制,482 个大文件除外。试过了:
python d:\gsutil\gsutil rsync -R d:\data\ gs://[server]
...并且在之前无法复制的第一个文件上同步失败。 运行以下:
python d:\gsutil\gsutil -d rsync -R d:\data\ gs://[server]
收到以下信息:
Copying file://d:\data\CDC-Exp-Mar-1-2015\CDC-148_170_sample_6\CDC-148_170_sample_6_trimQ20_filter50.blast [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this you and any
users that download such composite files will need to have a compiled
crcmod installed (see "gsutil help crcmod").
DEBUG 0304 10:01:53.209000 oauth2_client.py] GetAccessToken: checking cache for key [key]
DEBUG 0304 10:01:53.209000 oauth2_client.py] FileSystemTokenCache.GetToken: key=[key] present (cache_file=c:\windows\temp\oauth2_client-tokencache._.[key])
DEBUG 0304 10:01:53.209000 oauth2_client.py] GetAccessToken: token from cache: AccessToken(token=[token], expiry=2015-03-04
18:00:44.617000Z)
INFO 0304 10:01:53.224000 base_api.py] Calling method storage.objects.insert with StorageObjectsInsertRequest: <StorageObjectsInsertRequest
bucket: u'[server]'
object: <Object
acl: []
bucket: u'[server]'
contentLanguage: 'en'
contentType: 'application/octet-stream'
name: u'CDC-Exp-Mar-1-2015/CDC-148_170_sample_6/CDC-148_170_sample_6_trimQ20_filter50.blast'>>
INFO 0304 10:01:53.224000 base_api.py] Making http POST to https://www.googleapis.com/resumable/upload/storage/v1/b/[server]/o?fields=generation%2Ccrc32c%2Cmd5Hash%2Csize&alt=json&prettyPrint=True
&uploadType=resumable
INFO 0304 10:01:53.240000 base_api.py] Headers: 'X-Upload-Content-Length': '144853423157',
'X-Upload-Content-Type': 'application/octet-stream',
'accept': 'application/json',
'accept-encoding': 'gzip, deflate',
'content-length': '189',
'content-type': 'application/json',
'user-agent': 'apitools gsutil/4.9 (win32)'
INFO 0304 10:01:53.240000 base_api.py] Body:
"bucket": "[server]", "contentType": "application/octet-stream", "name": "CDC-Exp-Mar-1-2015/CDC-148_170_sample_6/CDC-148_170_sample_6_trimQ20_filter50.blast", "contentLanguage": "en"
connect: (www.googleapis.com, 443)
send: 'POST /resumable/upload/storage/v1/b/[server]/o?fields=generation%2Ccrc32c%2Cmd5Hash%2Csize&alt=json&prettyPrint=True&uploadType=resumable HTTP/1.1\r\nHost: www.googleapis.com\r\ncontent-len
gth: 189\r\naccept-encoding: gzip, deflate\r\naccept: application/json\r\nuser-agent: apitools gsutil/4.9 (win32)\r\nx-upload-content-length: 144853423157\r\nx-upload-content-type: application/octet-s
tream\r\ncontent-type: application/json\r\nauthorization: Bearer [token]\r\n\r\n"bucket": "[server]", "contentType": "a
pplication/octet-stream", "name": "CDC-Exp-Mar-1-2015/CDC-148_170_sample_6/CDC-148_170_sample_6_trimQ20_filter50.blast", "contentLanguage": "en"'
reply: 'HTTP/1.1 200 OK\r\n'
header: Location: https://www.googleapis.com/resumable/upload/storage/v1/b/[server]/o?fields=generation%2Ccrc32c%2Cmd5Hash%2Csize&alt=json&prettyPrint=True&uploadType=resumable&upload_id=AEnB2UqXH
kYq0s8RJk87LK8Bx-sHU60uRvytO8NBnV-dFQAEo1uBPm-bDlGnnGqpx4hMyaa5qgQtMMq0kXWL_ezfo6G1jMyGKw
header: Vary: Origin
header: Vary: X-Origin
header: Cache-Control: no-cache, no-store, max-age=0, must-revalidate
header: Pragma: no-cache
header: Expires: Fri, 01 Jan 1990 00:00:00 GMT
header: Date: Wed, 04 Mar 2015 17:01:53 GMT
header: Content-Length: 0
header: Server: UploadServer ("Built on Feb 18 2015 18:10:26 (1424311826)")
header: Content-Type: text/html; charset=UTF-8
header: Alternate-Protocol: 443:quic,p=0.08
connect: (www.googleapis.com, 443)
send: 'POST /resumable/upload/storage/v1/b/[server]/o?fields=generation%2Ccrc32c%2Cmd5Hash%2Csize&alt=json&prettyPrint=True&uploadType=resumable HTTP/1.1\r\nHost: www.googleapis.com\r\ncontent-len
gth: 189\r\naccept-encoding: gzip, deflate\r\naccept: application/json\r\nuser-agent: apitools gsutil/4.9 (win32)\r\nx-upload-content-length: 144853423157\r\nx-upload-content-type: application/octet-s
tream\r\ncontent-type: application/json\r\nauthorization: Bearer [token]\r\n\r\n"bucket": "[server]", "contentType": "a
pplication/octet-stream", "name": "CDC-Exp-Mar-1-2015/CDC-148_170_sample_6/CDC-148_170_sample_6_trimQ20_filter50.blast", "contentLanguage": "en"'
reply: 'HTTP/1.1 200 OK\r\n'
header: Location: https://www.googleapis.com/resumable/upload/storage/v1/b/[server]/o?fields=generation%2Ccrc32c%2Cmd5Hash%2Csize&alt=json&prettyPrint=True&uploadType=resumable&upload_id=AEnB2Urlx
0WvbB5z9k9uvC9Qv4DeW4cCFLfn559_20nZKChCqSukmPYZcmZm7a_kwCrqubbRqF2an1HOv_lrMcPkdfpDinluQg
header: Vary: Origin
header: Vary: X-Origin
header: Cache-Control: no-cache, no-store, max-age=0, must-revalidate
header: Pragma: no-cache
header: Expires: Fri, 01 Jan 1990 00:00:00 GMT
header: Date: Wed, 04 Mar 2015 17:01:53 GMT
header: Content-Length: 0
header: Server: UploadServer ("Built on Feb 18 2015 18:10:26 (1424311826)")
header: Content-Type: text/html; charset=UTF-8
header: Alternate-Protocol: 443:quic,p=0.08
INFO 0304 10:01:53.631000 base_api.py] Response of type Object: <Object
acl: []>
DEBUG: Exception stack trace:
Traceback (most recent call last):
File "d:\gsutil\gslib\__main__.py", line 524, in _RunNamedCommandAndHandleExceptions
debug_level, parallel_operations)
File "d:\gsutil\gslib\command_runner.py", line 272, in RunNamedCommand
return_code = command_inst.RunCommand()
File "d:\gsutil\gslib\commands\rsync.py", line 967, in RunCommand
fail_on_error=True)
File "d:\gsutil\gslib\command.py", line 1148, in Apply
arg_checker, should_return_results, fail_on_error)
File "d:\gsutil\gslib\command.py", line 1219, in _SequentialApply
worker_thread.PerformTask(task, self)
File "d:\gsutil\gslib\command.py", line 1654, in PerformTask
results = task.func(cls, task.args, thread_state=self.thread_gsutil_api)
File "d:\gsutil\gslib\commands\rsync.py", line 866, in _RsyncFunc
headers=cls.headers)
File "d:\gsutil\gslib\copy_helper.py", line 2360, in PerformCopy
allow_splitting=allow_splitting)
File "d:\gsutil\gslib\copy_helper.py", line 1695, in _UploadFileToObject
dst_obj_metadata, preconditions, gsutil_api, logger)
File "d:\gsutil\gslib\copy_helper.py", line 1539, in _UploadFileToObjectResumable
progress_callback=progress_callback)
File "d:\gsutil\gslib\cloud_api_delegator.py", line 248, in UploadObjectResumable
tracker_callback=tracker_callback, progress_callback=progress_callback)
File "d:\gsutil\gslib\gcs_json_api.py", line 956, in UploadObjectResumable
apitools_strategy=apitools_transfer.RESUMABLE_UPLOAD)
File "d:\gsutil\gslib\gcs_json_api.py", line 804, in _UploadObject
additional_headers, progress_callback)
File "d:\gsutil\gslib\gcs_json_api.py", line 861, in _PerformResumableUpload
additional_headers=addl_headers)
File "d:\gsutil\gslib\third_party\storage_apitools\transfer.py", line 790, in StreamMedia
additional_headers=additional_headers, use_chunks=False)
File "d:\gsutil\gslib\third_party\storage_apitools\transfer.py", line 749, in __StreamMedia
additional_headers=additional_headers)
File "d:\gsutil\gslib\third_party\storage_apitools\transfer.py", line 826, in __SendMediaBody
body=body_stream)
File "d:\gsutil\gslib\third_party\storage_apitools\http_wrapper.py", line 103, in __init__
self.body = body
File "d:\gsutil\gslib\third_party\storage_apitools\http_wrapper.py", line 124, in body
self.headers['content-length'] = str(len(self.__body))
OverflowError: long int too large to convert to int
尝试将我们的 .boto 文件更改为状态 rsync_buffer_lines = 64000
但这没有任何效果。
感谢任何帮助。
【问题讨论】:
您使用的是 32 位 python 吗?我怀疑这是与此相关的 gsutil/apitools 错误。 根据调试日志中的“win32”看起来确实如此。确认这是 gsutil 版本 4.8 到 4.10 中的错误。如果可能,将为 gsutil 4.11 准备好修复程序。 是的,根据How to install gsutil 的建议,我们正在使用 32 位 python。谢谢看,我等gsutil 4.11。 仅供参考,gsutil 4.14 的下载中存在此错误的不同版本(使用 32 位 python 处理大文件);它将在 gsutil 4.15 中修复,但在此之前,storage.googleapis.com/prerelease/… 提供了一个预发布版本 【参考方案1】:4.11 版结合将 gsutil 移动到 C:\ 驱动器(如此处所述:gsutil doesn't work, when executed from drive D, on Windows #238)清除了我们看到的所有问题。谢谢!
【讨论】:
以上是关于在大文件上使用 rsync 的 gsutil int 错误的主要内容,如果未能解决你的问题,请参考以下文章
继续中断 rsync 后 .gstmp 文件上的 gsutil rsync 错误