使用 Dask 从 CSV 文件中采样确切的行数

Posted 2023-03-11

技术标签:

【中文标题】使用 Dask 从 CSV 文件中采样确切的行数【英文标题】：Sample exact number of rows from CSV files using Dask 【发布时间】：2020-06-16 16:31:18 【问题描述】：

我想使用 Dask 创建n 行的子样本。我尝试了两种方法：

1.使用frac:

import dask.dataframe as dd    
read_path = ["test_data\\small1.csv", "test_data\\small2.csv", "huge.csv"]
df = dd.read_csv(read_path)
df = df.sample(frac=0.0001)
df = df.compute()

它的运行速度足够快 - 从 1 亿个数据集中选择 10000 个，持续 16 秒。但它不能保证准确的行数——因为使用了frac，它会被四舍五入。

2.使用for循环：

nrows = 10000
res_df = []
length = csv_loader.get_length()
total_len = sum(length)
start = perf_counter()
inds = random.sample(range(total_len), total_len - nrows - len(length))
min_bound = 0
relative_inds = []
for leng in length:
    relative_inds.append(
        sorted([i - min_bound for i in inds if min_bound <= i < min_bound + leng])
    )
    min_bound += leng
for ind, fil in enumerate(read_path):
    res_df.append(dd.read_csv(fil, skiprows=relative_inds[ind], sample=1000000))

在这里，我计算需要跳过的行的索引，然后使用 skiprows 从 csv 加载。如果我需要从一些小 csv 中读取 0 行，这种方法非常缓慢并且有时会崩溃。但它保证了准确的行数。

有没有使用 Dask 获取准确行数的快速解决方案？

【问题讨论】：

【参考方案1】：

我找到了解决办法：

total_len = get_total_length() #compute len of all data in csvs
frac = nrows / total_len

while int(total_len * frac) != nrows:
    counter = 1
    frac = nrows / (total_len - counter)
    counter += 1

    res_df = dd.read_csv(read_path)
    res_df = res_df.sample(frac=0.0001)
    res_df = res_df.compute()

您可以访问next link观看如何有效计算csv中的行数。

【讨论】：

以上是关于使用 Dask 从 CSV 文件中采样确切的行数的主要内容，如果未能解决你的问题，请参考以下文章