需要去除字母的 CSV 列号数据 - Pandas

Posted 2023-03-12

技术标签:

【中文标题】需要去除字母的 CSV 列号数据 - Pandas【英文标题】：Need to strip CSV Column Number Data of Letters - Pandas 【发布时间】：2021-07-30 04:52:15 【问题描述】：

我正在处理一个 .csv，其中包含数字数据包含字母的列。我想去掉字母，以便该列可以是浮点数或整数。

我尝试了以下方法：

使用循环/定义过程去除字符串数据的对象列，在“MPG”列中，只留下数值。

它应该打印至少有一个以字符'mpg'结尾的条目的列的名称

在 JUPYTER 笔记本单元中编码：

第 1 步：

MPG_cols = []
for colname in df.columns[df.dtypes == 'object']:  
    if df[colname].str.endswith('mpg').any(): 
        MPG_cols.append(colname)
print(MPG_cols)

使用.str，所以我可以使用逐元素字符串方法只想考虑字符串列

这给了我输出：

[力量]。 #目前为止很好

第 2 步：

#define the value to be removed using loop

def remove_mpg(pow_val):
    """For each value, take the number before the 'mpg'
    unless it is not a string value. This will only happen
    for NaNs so in that case we just return NaN.
    """
    if isinstance(pow_val, str):
        i=pow_val.replace('mpg', '') 
        return float(pow_val.split(' ')[0]) 
    else:
                    return np.nan

    position_cols = ['Vehicle_type'] 

for colname in MPG_cols:
    df[colname] = df[colname].apply(remove_mpg)

df[Power_cols].head()

我得到的错误：

ValueError                                Traceback (most recent call last)
<ipython-input-37-45b7f6d40dea> in <module>
     15 
     16 for colname in MPG_cols:
---> 17     df[colname] = df[colname].apply(remove_mpg)
     18 
     19 df[MPG_cols].head()

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in        apply(self, func, convert_dtype, args, **kwds)
   3846             else:
   3847                 values = self.astype(object).values
-> 3848                 mapped = lib.map_infer(values, f,     convert=convert_dtype)
   3849 
   3850         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-37-45b7f6d40dea> in remove_mpg(pow_val)
      8     if isinstance(pow_val, str):
      9         i=pow_val.replace('mpg', '')
---> 10         return float(pow_val.split(' ')[0])
     11     else:
     12                     return np.nan

ValueError: could not convert string to float: 'null'

我将类似的代码应用于不同的列，它在该列上有效，但不是在这里。

任何指导将不胜感激。

最佳，

【问题讨论】：

您能否在问题中包含 DataFrame 的示例？这将帮助我们重现您遇到的问题。 【参考方案1】：

这会起作用的，

import pandas as pd
pd.to_numeric(pd.Series(['$2', '3#', '1mpg']).str.replace('[^0-9]', '', regex=True))

0 2 1 3 2 1 数据类型：int64

完整的解决方案，

for i in range(df.shape[1]):
    if(df.iloc[:,i].dtype == 'object'):
        df.iloc[:,i] = pd.to_numeric(df.iloc[:,i].str.replace('[^0-9]', '', regex=True))
df.dtypes

选择不更改的列

for i in range(df.shape[1]):
    # 'colA', 'colB' are columns which should remain same.
    if((df.iloc[:,i].dtype == 'object') & df.column[i] not in ['colA','colB']):
        df.iloc[:,i] = pd.to_numeric(df.iloc[:,i].str.replace('[^0-9]', '', regex=True))
df.dtypes

【讨论】：

哇，谢谢！ -- 好的，范围解决方案完成了工作 --- MPG、MPH 和货币列中的美元符号，但它工作得有点太好了。我有一列对象数据，其中包括汽车型号名称，并且代码覆盖了该数据。有没有办法只恢复那一列，还是我需要从原始 .csv 导入那一列？如果只有1列，可以if( (df.iloc[:,i].dtype == 'object') & (df.columns[i] != 'Column') )，如果是多列，可以用not in []。我会更新答案，让我知道它是否解决了问题下午蟑螂，非常感谢您的回复。我不是一个刚开始的程序员，所以我有点困惑。您是说使用您提供的代码 sn-p 来恢复丢失的数据列吗？或者你是说我应该运行你的 sn-p 而不是我做的，以排除该列？我目前的需要是恢复丢失的数据列——找回品牌和型号名称。您必须重新运行代码并将数据加载回 python。始终保留原始数据。我认为你只需要加载数据，一旦覆盖信息不可用，你总是可以创建新的数据帧，一旦你得到结果，然后才替换/覆盖。我希望它有所帮助。【参考方案2】：

我认为您需要重新审视函数 remove_mpg 的逻辑，一种调整方式如下：

import re
import numpy as np
def get_me_float(pow_val):
    my_numbers = re.findall(r"(\d+.*\d+)mpg", pow_val)
    if len(my_numbers) > 0 :
        return float(my_numbers[0])
    else:
        return np.nan

例如，需要测试功能。

my_pow_val=['34mpg','34.6mpg','0mpg','mpg','anything']
for each_pow in my_pow_val:
    print(get_me_float(each_pow))

输出：

34.0
34.6
nan
nan

南

【讨论】：

【参考方案3】：

为什么不使用converters 参数到read_csv 函数在加载csv 文件时去除多余的字符？

def strip_mpg(s):
    return float(s.rstrip(' mpg'))

df = read_csv(..., converters='Power':strip_mpg, ...)

【讨论】：

以上是关于需要去除字母的 CSV 列号数据 - Pandas的主要内容，如果未能解决你的问题，请参考以下文章