需要在 python 中比较 1.5GB 左右的非常大的文件

Posted

技术标签:

【中文标题】需要在 python 中比较 1.5GB 左右的非常大的文件【英文标题】:Need to compare very large files around 1.5GB in python 【发布时间】:2013-04-13 04:18:13 【问题描述】:
"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"

以上是样本数据。 数据根据电子邮件地址排序,文件非常大,大约 1.5Gb

我想在另一个类似这样的 csv 文件中输出

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","0001HARISH@GMAIL.COM","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",9,0 days

即如果条目第一次出现我需要附加 1 如果它出现第二次我需要附加 2 同样我的意思是我需要计算文件中电子邮件地址的出现次数以及如果电子邮件存在两次或更多次我想要日期之间的差异并记住 日期未排序 所以我们还必须根据特定的电子邮件地址对它们进行排序,我正在寻找使用 numpy 或 pandas 库或任何其他库的 python 解决方案处理这种类型的海量数据而不会出现内存溢出异常我有双核处理器,centos 6.3,内存为 4GB

【问题讨论】:

将它们放入数据库中。按名称排序,然后按日期。 这听起来像是遵循 Map-Reduce 方法的事情 【参考方案1】:

使用内置的sqlite3数据库:您可以根据需要插入数据、排序和分组,使用大于可用RAM的文件没有问题。

【讨论】:

【参考方案2】:

确保您拥有 0.11,阅读这些文档:http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables,以及这些食谱:http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore(尤其是“合并数百万行”

这是一个似乎可行的解决方案。这是工作流程:

    按块从 csv 读取数据并附加到 hdfstore 遍历存储,创建另一个存储组合器

本质上,我们从表中取出一个块,并与文件其他每个部分的一个块合并。组合器函数不会减少,而是计算该块中所有元素之间的函数(以天为单位的差异),消除重复项,并在每次循环后获取最新数据。几乎有点像递归减少。

这应该是 O(num_of_chunks**2) 内存和计算时间 在您的情况下,chunksize 可以说是 1m(或更多)

processing [0] [datastore.h5]
processing [1] [datastore_0.h5]
    count                date  diff                        email
4       1 2011-06-24 00:00:00     0           0000.ANU@GMAIL.COM
1       1 2011-06-24 00:00:00     0          00000.POO@GMAIL.COM
0       1 2010-07-26 00:00:00     0           00000000@11111.COM
2       1 2013-01-01 00:00:00     0         0000650000@YAHOO.COM
3       1 2013-01-26 00:00:00     0       00009.GAURAV@GMAIL.COM
5       1 2011-10-29 00:00:00     0          0000MANNU@GMAIL.COM
6       1 2011-11-21 00:00:00     0    0000PRANNOY0000@GMAIL.COM
7       1 2011-06-26 00:00:00     0  0000PRANNOY0000@YAHOO.CO.IN
8       1 2012-10-25 00:00:00     0          0000RAHUL@GMAIL.COM
9       1 2011-05-10 00:00:00     0            0000SS0@GMAIL.COM
12      1 2010-12-09 00:00:00     0         0001HARISH@GMAIL.COM
11      2 2010-12-12 00:00:00     3         0001HARISH@GMAIL.COM
10      3 2010-12-22 00:00:00    13         0001HARISH@GMAIL.COM
14      1 2012-11-28 00:00:00     0           000AYUSH@GMAIL.COM
15      2 2012-11-29 00:00:00     1           000AYUSH@GMAIL.COM
17      3 2012-12-08 00:00:00    10           000AYUSH@GMAIL.COM
18      4 2012-12-12 00:00:00    14           000AYUSH@GMAIL.COM
13      5 2013-01-25 00:00:00    58           000AYUSH@GMAIL.COM
import pandas as pd
import StringIO
import numpy as np
from time import strptime
from datetime import datetime

# your data
data = """
"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
"""


# read in and create the store
data_store_file = 'datastore.h5'
store = pd.HDFStore(data_store_file,'w')

def dp(x, **kwargs):
    return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ]

chunksize=5
reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'],
                     header=0,usecols=['email','date'],parse_dates=['date'],
                     date_parser=dp, chunksize=chunksize)

for i, chunk in enumerate(reader):
    chunk['indexer'] = chunk.index + i*chunksize

    # create the global index, and keep it in the frame too
    df = chunk.set_index('indexer')

    # need to set a minimum size for the email column
    store.append('data',df,min_itemsize='email' : 100)

store.close()

# define the combiner function
def combiner(x):

    # given a group of emails (the same), return a combination
    # with the new data

    # sort by the date
    y = x.sort('date')

    # calc the diff in days (an integer)
    y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days))
    y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64')  
    
    return y

# reduce the store (and create a new one by chunks)
in_store_file = data_store_file
in_store1 = pd.HDFStore(in_store_file)

# iter on the store 1
for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)):
    print "processing [%s] [%s]" % (chunki,in_store_file)

    out_store_file = 'datastore_%s.h5' % chunki
    out_store = pd.HDFStore(out_store_file,'w')

    # iter on store 2
    in_store2 = pd.HDFStore(in_store_file)
    for df2 in in_store2.select('data',chunksize=chunksize):

        # concat & drop dups
        df = pd.concat([df1,df2]).drop_duplicates(['email','date'])

        # group and combine
        result = df.groupby('email').apply(combiner)
            
        # remove the mi (that we created in the groupby)
        result = result.reset_index('email',drop=True)
            
        # only store those rows which are in df2!
        result = result.reindex(index=df2.index).dropna()

        # store to the out_store
        out_store.append('data',result,min_itemsize='email' : 100)
    in_store2.close()
    out_store.close()
    in_store_file = out_store_file

in_store1.close()

# show the reduced store
print pd.read_hdf(out_store_file,'data').sort(['email','diff'])

【讨论】:

我的第一个想法也是将数据放入数据库中,但它无法解决我需要跟踪每个电子邮件地址的第一次出现并计数以及获取日期差异的方法那?? @Jeff 我的文件有超过 2000 万行,如果我开始比较那些唯一值将超过 500 万行,我将得到 500 万 * 2000 万的复杂性,这可能需要数月时间才能解决问题如何从 n*n 中减少复杂性这样我就可以处理如此大量的音量 所以电子邮件第一次出现时,它的电子邮件是随后出现的参考日期吗?对于第二封电子邮件很简单,计数是 1,天数是天数,那么第三封电子邮件呢。天数是否会更新为第 3 天和第 1 天之间的差异,还是以某种方式涉及的天数(也许第 3 天是当前日期和第 3 天的最大值 - 参考日期?)【参考方案3】:

另一种可能的(系统管理员)方式,避免数据库和 SQL 查询以及对运行时进程和硬件资源的大量要求。

20/04 更新添加了更多代码和简化方法:-

    Convert the timestamp 到秒(来自 Epoch)并使用 UNIX sort,使用电子邮件和这个新字段(即:sort -k2 -k4 -n -t, < converted_input_file > output_file) 初始化 3 个变量,EMAILPREV_TIMECOUNT 对每一行进行交互,如果遇到新电子邮件,请添加“1,0 天”。更新PREV_TIME=timestampCOUNT=1EMAIL=new_email 下一行:3 种可能的情况 a) 如果相同的电子邮件,不同的时间戳:计算天数,递增 COUNT=1,更新 PREV_TIME,添加“Count, Difference_in_days” b) 如果相同的电子邮件,相同的时间戳:递增 COUNT,添加“COUNT,0 天” c) 如果是新邮件,从 3 开始。

替代 1. 是添加一个新字段 TIMESTAMP 并在打印出该行时将其删除。

注意:如果 1.5GB 太大而无法一次排序,请将其拆分为更小的卡盘,使用电子邮件作为拆分点。您可以在不同的机器上并行运行这些块

/usr/bin/gawk -F'","' '  
    split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " "); 
    for (i=1; i<=12; i++) mdigit[month[i]]=i; 
    print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
)' < input.txt |  /usr/bin/sort -k2 -k7 -n -t, > output_file.txt

输出文件.txt:

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1280102400 "DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439",1291852800 "DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",1292112000 "DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006",1292976000 ...

您将输出通过管道传输到 Perl、Python 或 AWK 脚本以处理步骤 2 到 4。

【讨论】:

这个 gawk script 对我的排序输出执行 COUNT 和 DAYS 计算 成功了,谢谢伙计:)!我已经应用了拆分,只有 [root@amanka Desktop]# gawk 'BEGIN OFS = ",";计数 = 0; PREV_TIME=0;电子邮件=0; while(( getline line 0 ) split(line, a , ",") if (EMAIL != a[2]) EMAIL = a[2];计数 = 1; PREV_TIME = a[7];打印行,“1,0 天” else if (PREV_TIME == a[7]) COUNT = COUNT + 1;打印行,计数,“0 天”; else DAYS = ((a[7] - PREV_TIME)/(60*60*24)); PREV_TIME = a[7];计数 = 计数 + 1;打印行,计数,天“天”; ' 不客气。我很想知道 1) gawk+sort 使用了多少内存? 2) 处理 1.5gb 文件需要多少时间? 我不确定内存,但它确实花费了大约 10-12 分钟的时间,输出排序大约需要 15 分钟,而且它比我想到的任何其他语言解决方案都快得令人难以置信甚至在 python 中单次解析文件 O(n) 大约需要 35 分钟,使用 shell 脚本减少到一半 我有更多关于同一个文件的问题,如果你能帮助我,你可以在***.com/questions/16186224/… 询问问题,那将是很好的

以上是关于需要在 python 中比较 1.5GB 左右的非常大的文件的主要内容,如果未能解决你的问题,请参考以下文章

为啥我的程序在字节数组中保存 1.5Gb 的内存?

使用 Python 处理不适合内存的文件

我有问题要从谷歌数据存储区下载带有线程的1.5 gb数据和C#中的最快方式

比较Python中两个文件的非重复文件内容格式

AWS Lambda (Python) 无法在 S3 中解压缩和存储文件

请求 response.iter_content() 获取不完整的文件(1024MB 而不是 1.5GB)?