如何在 Google Colab 上以流模式加载数据集？

Posted 2023-03-29

技术标签:

【中文标题】如何在 Google Colab 上以流模式加载数据集？【英文标题】：How to load a dataset in streaming mode on Google Colab? 【发布时间】：2021-10-07 09:00:26 【问题描述】：

我正在尝试节省一些磁盘空间以在 Google Colab 上使用 CommonVoice French 数据集 (19G)，因为我的笔记本总是因磁盘空间不足而崩溃。我从HuggingFace 文档中看到，我们可以以流模式加载数据集，因此我们可以iterate over it directly without having to download the entire dataset.。我尝试在 Google Colab 中使用该模式，但无法使其正常工作 - 而且我在 SO 上没有找到任何关于此问题的信息。

!pip install datasets
!pip install 'datasets[streaming]'
!pip install aiohttp

common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

然后，我收到以下错误：

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-24-489f8a0ca4e4> in <module>()
----> 1 common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

/usr/local/lib/python3.7/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
    811         if not config.AIOHTTP_AVAILABLE:
    812             raise ImportError(
--> 813                 f"To be able to use dataset streaming, you need to install dependencies like aiohttp "
    814                 f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance'
    815             )

ImportError: To be able to use dataset streaming, you need to install dependencies like aiohttp using "pip install 'datasets[streaming]'" or "pip install aiohttp" for instance

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Google Colab 不允许流式加载数据集是否有原因？

否则，我错过了什么？

【问题讨论】：

为datasets, aiohttp 运行点子。但收到不同的错误：``` File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/datasets/load.py", line 835, in load_dataset use_auth_token=use_auth_token, File " /home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py"，第 139 行，在 _get_extraction_protocol 中引发 NotImplementedError(f"urlpath 处文件的提取协议不是尚未实施") ``` 哦不！谢谢。我想这可以回答它...... :( 也许您可以就它如何解决您的问题发表评论，以便其他观众也能从中受益。 【参考方案1】：

写一个答案以方便将来参考。根据@kkgarg 的评论，似乎还没有实现流功能。

!pip install aiohttp
!pip install datasets
from datasets import load_dataset, load_metric

common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

触发以下错误：

/usr/local/lib/python3.7/dist-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self, urlpath)
    137         elif path.endswith(".zip"):
    138             return "zip"
--> 139         raise NotImplementedError(f"Extraction protocol for file at urlpath is not implemented yet")
    140 
    141     def download_and_extract(self, url_or_urls):

NotImplementedError: Extraction protocol for file at https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz is not implemented yet

表示尚未实现或支持流式传输功能。也许是因为使用 common_voice 意味着需要解压缩文件并且流媒体不支持（？）。因为功能肯定已经实现，因为它在文档中......

【讨论】：

以上是关于如何在 Google Colab 上以流模式加载数据集？的主要内容，如果未能解决你的问题，请参考以下文章

如何将 MNIST 数据加载到 Google Colab Jupyter Notebook 中？ [关闭]

深入学习Google Colab：加载大型图像数据集的时间很长，如何加速这个过程？

谷歌Colab没有加载

使用Google Colab时如何从Google drive中加载自定义的包模型和数据集

将本地训练的 TensorFlow 模型导入 Google Colab