如果数据类型错误，如何跳过加载到 Pandas 数据框的 excel 文件的行（检查类型）

Posted 2023-03-11

技术标签:

【中文标题】如果数据类型错误，如何跳过加载到 Pandas 数据框的 excel 文件的行（检查类型）【英文标题】：How to skip the lines of an excel file loaded to a Pandas dataframe if data types are wrong (checking types) 【发布时间】：2021-07-28 18:52:40 【问题描述】：

我刚刚编写了这个代码：

import os
import pandas as pd

files = os.listdir(path)

#AllData = pd.DataFrame() 

for f in files:
    info = pd.read_excel(f, "File")
    info.fillna(0)
    try:
        info['Country'] = info['Country'].astype('str')
    except ValueError:
        continue
    try:
        info['Name'] = info['Name'].astype('str')
    except ValueError:
        continue
    try:
        info['Age'] = info['Age'].astype('int')
    except ValueError as error:
        continue
        
    writer = pd.ExcelWriter("Output.xlsx")
    info.to_excel(writer, "Sheet 1")
    writer.save()

它读取一些 excel 文件，选择一个名为“文件”的工作表并将其所有数据放入数据框中。完成后，它会返回所有记录。

我想要的是检查每一列的所有值的类型，如果类型不是我想要的这一列，则跳过阅读源中的行。最后我想在输出中记录适合我想要的类型的数据。

我尝试使用astype，但没有按预期工作。

因此，读取源代码 - 检查 astype - 如果不是 astype - 跳过行并继续运行代码。

【问题讨论】：

此代码无效。不可能将循环分配给这样的变量。 【参考方案1】：

我首先要说类型检查和类型转换是两个不同的东西。

Pandas 的astype 用于类型转换（它将一种类型“转换”为另一种类型，它不会检查值是否属于某种类型）。

但是，如果您想要不保留无法转换为数字类型的行，您可以这样做：

info['Age'] = pd.to_numeric(info['Age'], errors='coerce')
info = info.dropna()

请注意，您不必在此处使用 try-except 块。在这里，我们使用to_numeric，因为我们可以传递errors='coerce'，所以如果它不能被转换，则值将是NaN，然后我们使用dropna()来删除包含NaNs的行.

关于类型检查的更新：

在这里，我将添加一些您在评论中询问的有关如何检查 pandas 数据帧中的类型的信息：

如何获取 pandas 为每一列推断的类型？如何查看整个dataframe所有值的类型？一些有用的类型检查函数在 Python 中检查类型的方法

如何获取 pandas 为每一列推断的类型？

columns_dtypes = df.dtypes

它会输出如下内容：

Country     object
Name        object
Age        int64
dtype: object

请注意，您的“年龄”列包含一些 Nan 值，dtype 可能是 float64。

当一列包含字符串时，dtype 将是 object，当您将 excel 文件加载到示例中的数据框时。请参阅下文了解如何检查对象是否为 Python 字符串（类型为 str）。

列出所有 dtype 的 Pandas 文档：https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html?highlight=basics#dtypes

关于 Pandas dtypes 的其他有用信息：what are all the dtypes that pandas recognizes?

如何查看整个dataframe所有值的类型？

有很多方法可以做到这一点。

这是一种方法。我选择这个代码是因为它清晰简单：

# Iterate over all the columns
for (column_name, column_data) in info.iteritems():
    print("column_name: ", column_name)
    print("column_data: ", column_data.values)

    # Iterate over all the values of this column
    for column_value in column_data.values:
        # print the value and its type
        print(column_value, type(column_value))
        # So here you can check the type and do something with that
        # For example, log the error to a log file

一些有用的类型检查函数：

如何测试object（如上面输出中的df.dtypes 返回）是否为字符串？ isinstance(object_to_test, str) 见：How to find out if a Python object is a string?

现在，如果您有一列包含字符串（如“hello”、“world”等）并且其中一些字符串是 int，并且您想检查这些字符串是代表数字还是代表int你可以使用这些功能：

如何检查字符串是否为int？

def str_is_int(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

如何判断字符串是否为数字？

def str_is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

Python 的字符串有一个方法 isdigit()，但它不能用于检查 int 或 number，因为它会以 one = "+1" 或 minus_one = "-1" 失败。

最后，这里有两种在 Python 中检查“类型”的常用方法：

object_to_test = 1

print( type(object_to_test) is int)
print( type(object_to_test) in (int, float) ) # Check is is one of those types

print( isinstance(object_to_test, int) )

如果object_to_test 的类型为str 或str 的任何子类，isinstance(object_to_test, str) 将返回True。

如果object_to_test 仅属于str 类型（不包括str 的任何子类），type(object_to_test) is str 将返回True

还有一个名为 pandas-stubs 的库可能对类型安全有用：https://github.com/VirtusLab/pandas-stubs。

【讨论】：

我还需要检查类型以记录日志文件，以了解电子表格中的错误在哪里。

以上是关于如果数据类型错误，如何跳过加载到 Pandas 数据框的 excel 文件的行（检查类型）的主要内容，如果未能解决你的问题，请参考以下文章