如何在 Google Colab 上以流模式加载数据集?
Posted
技术标签:
【中文标题】如何在 Google Colab 上以流模式加载数据集?【英文标题】:How to load a dataset in streaming mode on Google Colab? 【发布时间】:2021-10-07 09:00:26 【问题描述】:我正在尝试节省一些磁盘空间以在 Google Colab 上使用 CommonVoice French 数据集 (19G),因为我的笔记本总是因磁盘空间不足而崩溃。我从HuggingFace 文档中看到,我们可以以流模式加载数据集,因此我们可以iterate over it directly without having to download the entire dataset.
。我尝试在 Google Colab 中使用该模式,但无法使其正常工作 - 而且我在 SO 上没有找到任何关于此问题的信息。
!pip install datasets
!pip install 'datasets[streaming]'
!pip install aiohttp
common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)
然后,我收到以下错误:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-24-489f8a0ca4e4> in <module>()
----> 1 common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)
/usr/local/lib/python3.7/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
811 if not config.AIOHTTP_AVAILABLE:
812 raise ImportError(
--> 813 f"To be able to use dataset streaming, you need to install dependencies like aiohttp "
814 f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance'
815 )
ImportError: To be able to use dataset streaming, you need to install dependencies like aiohttp using "pip install 'datasets[streaming]'" or "pip install aiohttp" for instance
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
Google Colab 不允许流式加载数据集是否有原因?
否则,我错过了什么?
【问题讨论】:
为datasets, aiohttp
运行点子。但收到不同的错误:``` File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/datasets/load.py", line 835, in load_dataset use_auth_token=use_auth_token, File " /home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py",第 139 行,在 _get_extraction_protocol 中引发 NotImplementedError(f"urlpath 处文件的提取协议不是尚未实施") ```
哦不!谢谢。我想这可以回答它...... :(
也许您可以就它如何解决您的问题发表评论,以便其他观众也能从中受益。
【参考方案1】:
写一个答案以方便将来参考。根据@kkgarg 的评论,似乎还没有实现流功能。
!pip install aiohttp
!pip install datasets
from datasets import load_dataset, load_metric
common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)
触发以下错误:
/usr/local/lib/python3.7/dist-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self, urlpath)
137 elif path.endswith(".zip"):
138 return "zip"
--> 139 raise NotImplementedError(f"Extraction protocol for file at urlpath is not implemented yet")
140
141 def download_and_extract(self, url_or_urls):
NotImplementedError: Extraction protocol for file at https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz is not implemented yet
表示尚未实现或支持流式传输功能。也许是因为使用 common_voice 意味着需要解压缩文件并且流媒体不支持(?)。因为功能肯定已经实现,因为它在文档中......
【讨论】:
以上是关于如何在 Google Colab 上以流模式加载数据集?的主要内容,如果未能解决你的问题,请参考以下文章
如何将 MNIST 数据加载到 Google Colab Jupyter Notebook 中? [关闭]
深入学习Google Colab:加载大型图像数据集的时间很长,如何加速这个过程?
使用Google Colab时如何从Google drive中加载自定义的包模型和数据集