使用 Python/numpy 过滤 CSV 数据

Posted

技术标签:

【中文标题】使用 Python/numpy 过滤 CSV 数据【英文标题】:Filtering CSV data using Python/numpy 【发布时间】:2014-01-27 11:50:20 【问题描述】:

我正在处理 CSV 文件。

            id     gender       disease       read      write    science 
  1.        11       male      cancer, diabetes 34         46         39  
  2.        20       male      diabetes         60         52         61  
  3.        12       male      diabetes         37         44         39  
  4.        16       male      cancer           47         31         36  
  5.         7       male      diabetes         57         54         47  
  6.        21       male      diabetes         44         44         50  
  7.        15       male      diabetes         39         39         26  
  8.        22       male      diabetes         42         39         56  
  9.         9       male      cancer           48         49         44  
 10.        18       male      diabetes         50         33         44  
 11.         5       male      diabetes         47         40          .  
 12.        14       male      diabetes         47         41         42  
 13.         3       male      diabetes         63         65         63  
 14.        24       male         fever         52         62         47  
 15.         8     female      diabetes         39         44         44  
 16.         1     female      cancer           34         44         39  
 17.         4     female      diabetes         44         50         39  
 18.         2     female      diabetes         39         41         42  
 19.        19     female      cancer           28         46         44  
 20.        17     female      diabetes         47         57         44  
 21.         6     female      diabetes         47         41         40  
 22.        10     female      diabetes         47         54         53  
 23.        13     female      diabetes         47         46         47  
 24.        23     female      diabetes         65         65         58  
 25.        25     female    Breast cancer         47         44         42  

我想获取人们患有癌症的所有行。有些人患有糖尿病和癌症,所以我也必须对其进行过滤。 结果应该是:

1.         11       male      cancer, diabetes 34         46         39  
4.         16       male      cancer           47         31         36
9.         9       male      cancer           48         49         44  
19.        19     female      cancer           28         46         44 
25.        25     female    Breast cancer         47         44         42


import pandas as pd                     
import numpy as np

ppl_ve_cancer = pd.read_csv(join(dirname(__file__), 'data.csv'))
delta= pd.DataFrame.from_records(ppl_ve_cancer )
disease= delta['disease']

现在,如何过滤“疾病列表”,过滤后如何获取他们所在行的数据(id,gender,read,write,science)

【问题讨论】:

【参考方案1】:

这是一种更以 pandas 为中心的方式:首先,您将所有数据作为数据框读取,创建一个 has cancer 列,然后对其进行过滤=

import StringIO
import pandas

datastring = StringIO.StringIO("""\
id,gender,disease,read,write,science
11,male,"cancer,diabetes",34,46,39
20,male,diabetes,60,52,61
12,male,diabetes,37,44,39
16,male,cancer,47,31,36
7,male,diabetes,57,54,47
21,male,diabetes,44,44,50
15,male,diabetes,39,39,26
22,male,diabetes,42,39,56
9,male,cancer,48,49,44
18,male,diabetes,50,33,44
5,male,diabetes,47,40,-999
14,male,diabetes,47,41,42
3,male,diabetes,63,65,63
24,male,fever,52,62,47
8,female,diabetes,39,44,44
1,female,cancer,34,44,39
4,female,diabetes,44,50,39
2,female,diabetes,39,41,42
19,female,cancer,28,46,44
17,female,diabetes,47,57,44
6,female,diabetes,47,41,40
10,female,diabetes,47,54,53
13,female,diabetes,47,46,47
23,female,diabetes,65,65,58
25,female,"Breast cancer",47,44,42
""")

df = pandas.read_csv(datastring, na_values=-999)

# create the `has cancer` column
df['has cancer'] = df.disease.apply(lambda row: 'cancer' in row)

# print the filtered data
print(df[df['has cancer']].to_string())


    id  gender          disease  read  write  science has cancer
0   11    male  cancer,diabetes    34     46       39       True
3   16    male           cancer    47     31       36       True
8    9    male           cancer    48     49       44       True
15   1  female           cancer    34     44       39       True
18  19  female           cancer    28     46       44       True
24  25  female    Breast cancer    47     44       42       True

【讨论】:

【参考方案2】:

这个answer 将完全满足您的需求。你只需要df[df['A'].str.contains("hello")]

import pandas as pd                     
import numpy as np

ppl_ve_cancer = pd.read_csv(join(dirname(__file__), 'data.csv'))
delta = pd.DataFrame.from_records(ppl_ve_cancer )
query = delta['disease'].str.contains('cancer')
delta_filtered = delta[query]
print delta_filtered

【讨论】:

【参考方案3】:

这将获取您的 CSV 文件,按其中包含癌症的行对其进行过滤,并生成您可以立即使用或存储以供以后使用的变量。

with open("input.csv") as I:
    for line in I:
        if "cancer" in line: #get lines with Cancer
            line = line.replace("\n","") #filter out new line symbols
            pid,gender,disease,read,write,science = line.split('\t') #split lines by tabs then assign to separate variables for later use
            print pid,gender,disease,read,write,science

输入:

id  gender  disease          read    write   science
11  male    cancer, diabetes 34  46  39
20  male    diabetes     60  52  61
12  male    diabetes     37  44  39
16  male    cancer           47  31  36

输出:

11 male cancer, diabetes 34 46 39
16 male cancer           47 31 36

【讨论】:

谢谢!我需要对象作为结果,以便以后轻松执行其他算法,我尝试了:results = [t for t in delta if t["disease"] == 'cancer'] 但没有成功。跨度>

以上是关于使用 Python/numpy 过滤 CSV 数据的主要内容,如果未能解决你的问题,请参考以下文章

Python/Numpy(CSV):查找值,附加另一个 csv

Python numpy 按条件过滤二维数组

Python NumPy 将 FFT 转换为文件

Python numpy 多维数组由另一个数组值过滤

使用 numpy 从过滤后的排序数组返回索引

如何使用多列过滤器过滤CSV数据