检测 csv 上的重复数据

Posted

技术标签:

【中文标题】检测 csv 上的重复数据【英文标题】:Detect duplicate data on csv 【发布时间】:2021-10-28 22:29:54 【问题描述】:

使用使用fast-csv包的node.js,我目前有这个解析功能,它读取csv文件,更改标题,遍历每一行并根据行数据触发事件。

validateRows: (filePath, payload, validators) => new Promise((resolve, reject) =>         
        const invalidRecords = [];
        const validRecords = [];

        fs.createReadStream(filePath)
            .pipe(csv.parse(
                headers: (headers) => mapHeaderToRelated(headers, payload), delimiter: ";", discardUnmappedColumns: true
            ))
            .validate((data, cb) => 
                const errors = validators.reduce((err, func) => [...err, ...func(data)], []);

                if (errors.length > 0) 
                    return cb(null, false, errors);
                

                return cb(null, true);
            )
            .on("error", (error) => 
                console.log("There is some error");
                reject(error);
            )
            .on("data", (row) => 
                validRecords.push(row);
            )
            .on("data-invalid", (row, rowNumber, reason) => 
                invalidRecords.push(
                    data: row,
                    rowNumber: rowNumber,
                    reason: reason
                );
            )
            .on("end", (rowCount) => 
                console.log(`Parsed $rowCount rows. Valid Count: $validRecords.length Invalid Count: $invalidRecords.length`);

                resolve(
                    invalidRecords,
                    validRecords
                );
            );
    ),

我需要检测多次检查数字的记录。如果有重复,比如多行有相同的电话号码,应该认为是无效的,并推送到无效记录数组中

Example CSV:

| name   | surname | gender  | phone  | 
| ------ | ------- | -------- | -----  |
| John   | Doe     | Male     | 123456 |
| Joh    | Deo     | Unknown  | 123456 |
| Jane   | Doe     | Female   | 999999 |

我想要解析后的 CSV 的输出:


 validRecords: [ 
   
     name: Jane
     surname: Doe
     gender: Female
     phone: 99999   
   
 ]

 invalidRecords: [ 
   
     data: 
       name: John
       surname: Doe
       gender: Male
       phone: 123456 
     
     rowNumber: 1,
     reason: ["Duplicate data"]
   ,
   
     data: 
       name: Joh
       surname: Deo
       gender: Male
       phone: 123456 
     
     rowNumber: 2,
     reason: ["Duplicate data"]
   
 ]
]

我该如何解决这个问题?

【问题讨论】:

请发布您的尝试minimal reproducible example,使用[<>] sn-p 编辑器记录输入和预期输出。 对不起,我想不出一个最小的例子 【参考方案1】:

我已经使用以下和一些辅助函数扩展了我的 on("end") 事件。现在可以了。

.on("end", (rowCount) => 
                console.log(`Parsed $rowCount rows. Valid Count: $validCustomers.length Invalid Count: $invalidCustomers.length`);
                
                const allCustomers = [...invalidCustomers, ...validCustomers];

                const duplicateNumbers = findDuplicatePhoneNumbers(allCustomers);

                flagDuplicateCustomers(allCustomers, duplicateNumbers);
                
                // Valid but duplicate customers are pushed to the invalid customers and reason set to "Duplicate"
                const validButDuplicateCustomers = getDuplicateCustomers(validCustomers);
                validButDuplicateCustomers.forEach((c) => 
                    invalidCustomers.push(
                        data: c,
                        reason: ["Duplicate"]
                    );
                );
                
                // Add reason "Duplicate" for Invalid and Duplicate customers
                const invalidAndDuplicateCustomers = getDuplicateCustomers(invalidCustomers);
                invalidAndDuplicateCustomers.forEach((c) => 
                    if (c.reason) 
                        c.reason = [...c.reason, "Duplicate"];
                    
                );
                
                const validAndNotDuplicate = getNonDuplicateCustomers(validCustomers);

                resolve(
                    invalidCustomers: invalidCustomers,
                    validCustomers: validAndNotDuplicate
                );
            );

辅助方法是

const getDuplicateCustomers = (customers) => customers.filter((customer) => customer.isDuplicate);

const getNonDuplicateCustomers = (records) => records.filter((record) => !record.isDuplicate);

const findDuplicatePhoneNumbers = (customers) => 
    let duplicates = [];

    const sortedCustomers = customers.sort((a, b)=> a.customer_phone - b.customer_phone);

    sortedCustomers.forEach((customer, index, array) => 
        const nextCustomer = array[index + 1];
  
        if (!nextCustomer) 
            return;
        
  
        if (customer.customer_phone === nextCustomer.customer_phone) 
            duplicates.push(customer);
        
    );

    const duplicatePhoneNumbers = duplicates.map((customer) => customer.customer_phone);
    const uniqueDuplicatePhoneNumbers = [...new Set(duplicatePhoneNumbers)];


    return uniqueDuplicatePhoneNumbers;
;

const flagDuplicateCustomers = (customers, duplicateNumbers) => 
    if (!duplicateNumbers) 
        return;
    
    
    if (duplicateNumbers.length === 0) 
        return;
    

    const duplicateCustomers = customers.filter((customer) => duplicateNumbers.includes(customer.customer_phone));

    duplicateCustomers.forEach((customer) => 
        customer.isDuplicate = true;
    );
;

【讨论】:

以上是关于检测 csv 上的重复数据的主要内容,如果未能解决你的问题,请参考以下文章

努力从 QTableWidget 导出 csv 数据 [重复]

写时复制会防止阵列上的数据重复吗?

写时复制会防止阵列上的数据重复吗?

将 CSV 导入到 postgreSQL 中的表中,忽略重复项 - 亚马逊 AWS/RDS

将 CSV 导入数据表 [重复]

如何在火花中将数据帧转换为csv [重复]