在 IBM Data Science Experience(IBM DSX) 中将 zip 文件导入 Python Notebook

Posted 2023-03-06

技术标签:

【中文标题】在 IBM Data Science Experience(IBM DSX) 中将 zip 文件导入 Python Notebook【英文标题】：Import a zip file to Python Notebook in IBM Data Science Experience(IBM DSX) 【发布时间】：2017-09-19 02:36:43 【问题描述】：

我有一个 zip 文件 train.zip(1.1GB)，我想将它导入 Python Notebook，解压缩，然后着手处理它。我使用 Inert StringIO Object 选项将其作为 String IO 对象导入。

from io import StringIO
import requests
import json
import pandas as pd

# @hidden_cell
# This function accesses a file in your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def get_object_storage_file_with_credentials_xxxxxx(container, filename):
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage."""

    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = 'auth': 'identity': 'methods': ['password'],
            'password': 'user': 'name': 'member_xxxxxx','domain': 'id': 'xxxxxxx',
            'password': 'xxxxx),(xxxxx'
    headers1 = 'Content-Type': 'application/json'
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', container, '/', filename])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = 'X-Auth-Token': s_subject_token, 'accept': 'application/json'
    resp2 = requests.get(url=url2, headers=headers2)
    return StringIO(resp2.text)

# Your data file was loaded into a StringIO object and you can process the data.
# Please read the documentation of pandas to learn more about your possibilities to load your data.
# pandas documentation: http://pandas.pydata.org/pandas-docs/stable/io.html
data_1 = get_object_storage_file_with_credentials_20e75635ab104e58bd1a6e91635fed51('DefaultProjectxxxxxxxx', 'train.zip')

这给出了一个输出：

data_1
<_io.StringIO at 0x7f8a288cd3a8>

但是当我尝试使用 Zipfile 解压缩它时，我遇到了以下错误：

from zipfile import ZipFile
file = ZipFile(data_1)

BadZipFile: File is not a zip file

如何在 IBM DSX 中访问该文件？

【问题讨论】：

Unzip buffer with Python?的可能重复 【参考方案1】：

您可以使用下面显示的功能从对象存储中保存一个 zip 文件。 credentials 参数是插入到 DSX 笔记本中的代码的字典。这个函数是also on gist

import zipfile
from io import BytesIO
import requests
import json
import pandas as pd

def get_zip_file(credentials):

    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens']) 
    data = 'auth': 'identity': 'methods': ['password'], 'password': 'user': 'name': credentials['username'],'domain': 'id': credentials['domain_id'], 'password': credentials['password'] 
    headers1 = 'Content-Type': 'application/json' resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1) 
    resp1_body = resp1.json() 
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'): 
            for e2 in e1['endpoints']:   
                if(e2['interface']=='public'and e2['region']==credentials['region']): url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']]) s_subject_token = resp1.headers['x-subject-token'] headers2 = 'X-Auth-Token': s_subject_token, 'accept': 'application/json' 

    s_subject_token = resp1.headers['x-subject-token']
    headers2 = 'X-Auth-Token': s_subject_token, 'accept': 'application/json'
    r = requests.get(url=url2, headers=headers2, stream=True)

    z = zipfile.ZipFile(BytesIO(r.content))
    z.extractall()# save zip contents to disk

    return(z)

z = get_zip_file(credentials)

【讨论】：

【参考方案2】：

ZipFile 构造函数需要一个文件名，而不是文件内容。请参阅此处以获取解决方案： Unzip buffer with Python?

【讨论】：

以上是关于在 IBM Data Science Experience(IBM DSX) 中将 zip 文件导入 Python Notebook的主要内容，如果未能解决你的问题，请参考以下文章