在 IBM Data Science Experience(IBM DSX) 中将 zip 文件导入 Python Notebook
Posted
技术标签:
【中文标题】在 IBM Data Science Experience(IBM DSX) 中将 zip 文件导入 Python Notebook【英文标题】:Import a zip file to Python Notebook in IBM Data Science Experience(IBM DSX) 【发布时间】:2017-09-19 02:36:43 【问题描述】:我有一个 zip 文件 train.zip(1.1GB),我想将它导入 Python Notebook,解压缩,然后着手处理它。我使用 Inert StringIO Object 选项将其作为 String IO 对象导入。
from io import StringIO
import requests
import json
import pandas as pd
# @hidden_cell
# This function accesses a file in your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def get_object_storage_file_with_credentials_xxxxxx(container, filename):
"""This functions returns a StringIO object containing
the file content from Bluemix Object Storage."""
url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
data = 'auth': 'identity': 'methods': ['password'],
'password': 'user': 'name': 'member_xxxxxx','domain': 'id': 'xxxxxxx',
'password': 'xxxxx),(xxxxx'
headers1 = 'Content-Type': 'application/json'
resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
resp1_body = resp1.json()
for e1 in resp1_body['token']['catalog']:
if(e1['type']=='object-store'):
for e2 in e1['endpoints']:
if(e2['interface']=='public'and e2['region']=='dallas'):
url2 = ''.join([e2['url'],'/', container, '/', filename])
s_subject_token = resp1.headers['x-subject-token']
headers2 = 'X-Auth-Token': s_subject_token, 'accept': 'application/json'
resp2 = requests.get(url=url2, headers=headers2)
return StringIO(resp2.text)
# Your data file was loaded into a StringIO object and you can process the data.
# Please read the documentation of pandas to learn more about your possibilities to load your data.
# pandas documentation: http://pandas.pydata.org/pandas-docs/stable/io.html
data_1 = get_object_storage_file_with_credentials_20e75635ab104e58bd1a6e91635fed51('DefaultProjectxxxxxxxx', 'train.zip')
这给出了一个输出:
data_1
<_io.StringIO at 0x7f8a288cd3a8>
但是当我尝试使用 Zipfile 解压缩它时,我遇到了以下错误:
from zipfile import ZipFile
file = ZipFile(data_1)
BadZipFile: File is not a zip file
如何在 IBM DSX 中访问该文件?
【问题讨论】:
Unzip buffer with Python?的可能重复 【参考方案1】:您可以使用下面显示的功能从对象存储中保存一个 zip 文件。 credentials
参数是插入到 DSX 笔记本中的代码的字典。这个函数是also on gist
import zipfile
from io import BytesIO
import requests
import json
import pandas as pd
def get_zip_file(credentials):
url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
data = 'auth': 'identity': 'methods': ['password'], 'password': 'user': 'name': credentials['username'],'domain': 'id': credentials['domain_id'], 'password': credentials['password']
headers1 = 'Content-Type': 'application/json' resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
resp1_body = resp1.json()
for e1 in resp1_body['token']['catalog']:
if(e1['type']=='object-store'):
for e2 in e1['endpoints']:
if(e2['interface']=='public'and e2['region']==credentials['region']): url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']]) s_subject_token = resp1.headers['x-subject-token'] headers2 = 'X-Auth-Token': s_subject_token, 'accept': 'application/json'
s_subject_token = resp1.headers['x-subject-token']
headers2 = 'X-Auth-Token': s_subject_token, 'accept': 'application/json'
r = requests.get(url=url2, headers=headers2, stream=True)
z = zipfile.ZipFile(BytesIO(r.content))
z.extractall()# save zip contents to disk
return(z)
z = get_zip_file(credentials)
【讨论】:
【参考方案2】:ZipFile 构造函数需要一个文件名,而不是文件内容。 请参阅此处以获取解决方案: Unzip buffer with Python?
【讨论】:
以上是关于在 IBM Data Science Experience(IBM DSX) 中将 zip 文件导入 Python Notebook的主要内容,如果未能解决你的问题,请参考以下文章
学习笔记之Intermediate Python for Data Science | DataCamp
text 快速浏览Pandas for Data Science
在博客园使用LaTex编辑论文级别data science文章