如何在 Google Colab 上以流模式加载数据集?



【中文标题】如何在 Google Colab 上以流模式加载数据集?【英文标题】:How to load a dataset in streaming mode on Google Colab? 【发布时间】:2021-10-07 09:00:26 【问题描述】:

我正在尝试节省一些磁盘空间以在 Google Colab 上使用 CommonVoice French 数据集 (19G),因为我的笔记本总是因磁盘空间不足而崩溃。我从HuggingFace 文档中看到,我们可以以流模式加载数据集,因此我们可以iterate over it directly without having to download the entire dataset.。我尝试在 Google Colab 中使用该模式,但无法使其正常工作 - 而且我在 SO 上没有找到任何关于此问题的信息。

!pip install datasets
!pip install 'datasets[streaming]'
!pip install aiohttp

common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)


ImportError                               Traceback (most recent call last)
<ipython-input-24-489f8a0ca4e4> in <module>()
----> 1 common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

/usr/local/lib/python3.7/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
    811         if not config.AIOHTTP_AVAILABLE:
    812             raise ImportError(
--> 813                 f"To be able to use dataset streaming, you need to install dependencies like aiohttp "
    814                 f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance'
    815             )

ImportError: To be able to use dataset streaming, you need to install dependencies like aiohttp using "pip install 'datasets[streaming]'" or "pip install aiohttp" for instance

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Google Colab 不允许流式加载数据集是否有原因?



datasets, aiohttp 运行点子。但收到不同的错误:``` File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/datasets/load.py", line 835, in load_dataset use_auth_token=use_auth_token, File " /home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py",第 139 行,在 _get_extraction_protocol 中引发 NotImplementedError(f"urlpath 处文件的提取协议不是尚未实施") ``` 哦不!谢谢。我想这可以回答它...... :( 也许您可以就它如何解决您的问题发表评论,以便其他观众也能从中受益。 【参考方案1】:

写一个答案以方便将来参考。根据@kkgarg 的评论,似乎还没有实现流功能。

!pip install aiohttp
!pip install datasets
from datasets import load_dataset, load_metric

common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)


/usr/local/lib/python3.7/dist-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self, urlpath)
    137         elif path.endswith(".zip"):
    138             return "zip"
--> 139         raise NotImplementedError(f"Extraction protocol for file at urlpath is not implemented yet")
    141     def download_and_extract(self, url_or_urls):

NotImplementedError: Extraction protocol for file at https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz is not implemented yet

表示尚未实现或支持流式传输功能。也许是因为使用 common_voice 意味着需要解压缩文件并且流媒体不支持(?)。因为功能肯定已经实现,因为它在文档中......


以上是关于如何在 Google Colab 上以流模式加载数据集?的主要内容,如果未能解决你的问题,请参考以下文章

如何将 MNIST 数据加载到 Google Colab Jupyter Notebook 中? [关闭]

深入学习Google Colab:加载大型图像数据集的时间很长,如何加速这个过程?


使用Google Colab时如何从Google drive中加载自定义的包模型和数据集

使用Google Colab时如何从Google drive中加载自定义的包模型和数据集

将本地训练的 TensorFlow 模型导入 Google Colab