TypeError:连接 csv 文件时,“str”对象不是迭代器

Posted

技术标签:

【中文标题】TypeError:连接 csv 文件时,“str”对象不是迭代器【英文标题】:TypeError: 'str' object is not an iterator when concatenating csv files 【发布时间】:2017-09-07 04:54:06 【问题描述】:

我有一组 csv 文件要连接。为此,我编写了一个函数来完成这项工作。但是,我发现我的最终 csv(将所有 csv 分组)在前两行中具有重复的标题,然后在每次连接新的 csv 时重复标题。

如下:

    from_line   all_chars_in_the_same_row   page_number words   char    left    top right   bottom
    from_line   all_chars_in_same_row   page_number words   char    left    top right   bottom
0   0   ['m', 'i', 'i', 'l', 'm', 'u', 'i', 'l', 'i', 'l']  1841729699_001  [[mi, il, mu, il, il]]  m   38  104 2456    2492
1   0   ['m', 'i', 'i', 'l', 'm', 'u', 'i', 'l', 'i', 'l']  1841729699_001  [[mi, il, mu, il, il]]  i   40  102 2442    2448

然后在将其与新的 csv 文件连接时:

2048    49  ['L', 'A', 'C', 'H', 'E', 'T', 'E', 'U', 'R', 'D', 'É', 'C', 'L', 'A', 'R', 'E', 'A', 'V', 'O', 'I', 'R', 'P', 'R', 'I', 'S', 'C', 'O', 'N', 'N', 'A', 'I', 'S', 'S', 'A', 'N', 'C', 'E', 'D', 'E', 'S', 'C', 'O', 'N', 'D', 'I', 'T', 'I', 'O', 'N', 'S', 'G', 'É', 'N', 'É', 'R', 'A', 'L', 'E', 'S', 'D', 'E', 'V', 'E', 'N', 'T', 'E', 'S', 'T', 'I', 'P', 'U', 'L', 'É', 'E', 'S', 'A', 'U', 'V', 'E', 'R', 'S', 'O', '.'] 1841729699_001  [[lacheteur, declare, avoir, pris, connaissance, des, conditions, generales, de, vente, stipulees, au, verso.]] 0   2364    2366    3426    3429
    from_line   all_chars_in_same_row   page_number words   char    left    top right   bottom
0   0   ['m', 'i', 'i', 'l', 'm', 'u', 'i', 'l', 'i', 'l']  1841729699_001  [[mi, il, mu, il, il]]  m   38  104 2456    2492
1   0   ['m', 'i', 'i', 'l', 'm', 'u', 'i', 'l', 'i', 'l']  1841729699_001  [[mi, il, mu, il, il]]  i   40  102 2442    2448

等等。我的功能如下:

import os
import glob
import pandas

def concatenate(indir="files",outfile="concatenated.csv"):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    dfList=[]
    colnames=[" ","from_line","all_chars_in_the_same_row","page_number","words","char","left","top","right","bottom"]
    for filename in fileList:

        print(filename)
        df=pandas.read_csv(filename,header=None)
        dfList.append(df)
    concatDf=pandas.concat(dfList,axis=0)
    concatDf.columns=colnames
    concatDf.to_csv(outfile,index=None)

为了避免每次连接新文件时在前两行和标题中出现重复的标题:

header = next(filename)

如下:

import os
import glob
import pandas

def concatenate(indir="files",outfile="concatenated.csv"):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    dfList=[]
    colnames=[" ","from_line","all_chars_in_the_same_row","page_number","words","char","left","top","right","bottom"]
    for filename in fileList:

        print(filename)
        header=next(filename)# l got an error in this line
        df=pandas.read_csv(header,header=None)
        dfList.append(df)
    concatDf=pandas.concat(dfList,axis=0)
    concatDf.columns=colnames
    concatDf.to_csv(outfile,index=None)

我收到以下错误:

  File "<input>", line 13, in concatenate
TypeError: 'str' object is not an iterator

EDIT1 做完这些改动后

​​>
import os
import glob
import pandas

def concatenate(indir="files",outfile="concatenated.csv"):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    dfList=[]
    colnames=[" ","from_line","all_chars_in_the_same_row","page_number","words","char","left","top","right","bottom"]
    for filename in fileList:

        print(filename)
        with open(filename) as f:
             header=next(f)
             df = pandas.read_csv(header, header=None)
             dfList.append(df)
    concatDf = pandas.concat(dfList, axis=0)
    concatDf.columns = colnames
    concatDf.to_csv(outfile, index=None)

我收到以下错误:

Traceback (most recent call last):
  File "/usr/lib/python3.5/code.py", line 91, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "<input>", line 15, in concatenate
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 389, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 730, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 923, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 1390, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4184)
  File "pandas/parser.pyx", line 667, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:8449)
FileNotFoundError: File b',from_line,all_chars_in_same_row,page_number,words,char,left,top,right,bottom\n' does not exist

**EDIT2**

运行此代码后,我得到了两个第一列重复

import os
import pandas as pd
import glob

fileList=glob.glob("file*.csv")
colNames=[" ","from_line","all_chars_in_the_same_row","page_number","words","char","left","top","right","bottom"]

final_df = pd.DataFrame(columns=colNames)
for fileName in fileList:
    df=pd.read_csv(fileName,skiprows=0) # skip first row w/ headers since you want to set column names yourself
    df.columns = colNames

    final_df = pd.concat([final_df, df], axis=0)


print(final_df)





        from_line
0   0   0
1   1   0
2   2   0
3   3   0
4   4   0
5   5   0
6   6   0
7   7   0
8   8   0
9   9   0
10  10  1
11  11  1
12  12  2

但是在原始的 csv 文件中我有这个:

    from_line
0   0
1   0
2   0
3   0
4   0
5   0
6   0
7   0
8   0
9   0

【问题讨论】:

【参考方案1】:

glob.glob 返回给定文件夹中给定正则表达式的文件名字符串列表。 字符串对象不是迭代器,但文件是(它的生成器)。

尝试使用以下命令打开文件: f = 打开(文件名)

您的代码将如下所示:

fileList=glob.glob("*.csv")
for filename in fileList:
    with open(filename) as f:
        header = next(f)
        ...

请注意,此代码还有许多其他问题 风格和更好的方法。但至于错误,这应该可以解决它

import os
import pandas as pd
import glob

file_list=glob.glob("file*.csv")
col_names=[" ","from_line","all_chars_in_the_same_row","page_number","words","char","left","top","right","bottom"]

final_df = pd.DataFrame(columns=col_names)
for filename in file_list:
    df=pd.read_csv(filename,skiprows=0)
    df.columns = col_names

    final_df = pd.concat([final_df, df], axis=0)

# print all
...

【讨论】:

@Solo,我看不到你的代码。但是我需要一个迭代器,因为我有一组 csv 文件要连接。 f=open(filename) 如何遍历所有 csv 文件 “字符串对象不可迭代” 是可迭代的,但字符串不是迭代器。 @timgeb:正确。请原谅我将编辑答案的措辞【参考方案2】:

更新删除了第一列,这是一个冗余的索引列。

import os
import pandas as pd
import glob
os.chdir('/home/max')

fileList=glob.glob("file*.csv")
colNames=["redundant_index_column","from_line","all_chars_in_the_same_row","page_number"]#,"words","char","left","top","right","bottom"]

final_df = pd.DataFrame(columns=colNames)
for fileName in fileList:
    df=pd.read_csv(fileName,skiprows=0)
    df.columns = colNames

    print(df)
    final_df = pd.concat([final_df, df], axis=0).reset_index(drop=True)


final_df = final_df.drop(['redundant_index_column'], axis=1)

print(final_df)

【讨论】:

它有效,但第一列重复如下: from_line 0 0 0 1 1 0 2 2 0 3 3 0 4 4 0 5 5 0 6 6 0 7 7 0 8 8 0 9 9 0 10 10 1 11 11 1 12 12 2 你的意思是你的第一列显示为两列吗?我用一些任意的 3 行 CSV 对此进行了测试,但这在我的测试中没有发生。你能详细说明一下吗? 我提供了一个我的 csv 文件的样本和结局连接的 csv 文件。是的,第一列是重复的。这是 colnames=[" ", 中的这一列 感谢您的澄清。这不是第一列的重复。您看到的第一个“列”是 pandas 行索引。 我可以禁用它吗?因为这些文件是之前用 pandas 处理过的。所以我只需要保留熊猫的一列索引

以上是关于TypeError:连接 csv 文件时,“str”对象不是迭代器的主要内容,如果未能解决你的问题,请参考以下文章

python - TypeError:不可排序的类型:str()> float()

将当前url写入csv TypeError时:只能将列表(不是“元组”)连接到列表

Pandas 合并错误 TypeError:“int”和“str”实例之间不支持“>”

TypeError:需要一个类似字节的对象,而不是“str”套接字编程

TypeError:在 Python3 中写入文件时需要一个类似字节的对象,而不是“str”

TypeError:在 Python3 中写入文件时需要一个类似字节的对象,而不是“str”