列数据中的python pandas read_csv定界符

Posted 2023-03-11

技术标签:

【中文标题】列数据中的python pandas read_csv定界符【英文标题】：python pandas read_csv delimiter in column data 【发布时间】：2015-09-03 02:56:16 【问题描述】：

我有这种类型的 CSV 文件：

12012;My Name is Mike. What is your's?;3;0 
1522;In my opinion: It's cool; or at least not bad;4;0
21427;Hello. I like this feature!;5;1

我想将此数据输入 da pandas.DataFrame。但是read_csv(sep=";") 由于第 2 行用户生成的消息列中的分号而引发异常（在我看来：这很酷；或者至少还不错）。所有剩余的列始终具有数字 dtypes。

最方便的管理方法是什么？

【问题讨论】：

你能解释更多关于你的问题吗？你的预期输出是什么？我的目的是将这个 csv 数据解析成一个 DataFrame。但它会抛出异常，因为一列中有一个分号，pandas 认为它应该将它分成两列。谁在生成这些模棱两可的文件，有什么办法可以动天动地让它们保持清醒？ 【参考方案1】：

处理不带引号的分隔符总是很麻烦。在这种情况下，由于看起来损坏的文本被三个正确编码的列包围，我们可以恢复。 TBH，我只是使用标准的 Python 阅读器并从中构建一个 DataFrame：

import csv
import pandas as pd

with open("semi.dat", "r", newline="") as fp:
    reader = csv.reader(fp, delimiter=";")
    rows = [x[:1] + [';'.join(x[1:-2])] + x[-2:] for x in reader] 
    df = pd.DataFrame(rows)

产生

       0                                              1  2  3
0  12012               My Name is Mike. What is your's?  3  0
1   1522  In my opinion: It's cool; or at least not bad  4  0
2  21427                    Hello. I like this feature!  5  1

然后我们可以立即保存并正确引用：

In [67]: df.to_csv("fixedsemi.dat", sep=";", header=None, index=False)

In [68]: more fixedsemi.dat
12012;My Name is Mike. What is your's?;3;0
1522;"In my opinion: It's cool; or at least not bad";4;0
21427;Hello. I like this feature!;5;1

In [69]: df2 = pd.read_csv("fixedsemi.dat", sep=";", header=None)

In [70]: df2
Out[70]: 
       0                                              1  2  3
0  12012               My Name is Mike. What is your's?  3  0
1   1522  In my opinion: It's cool; or at least not bad  4  0
2  21427                    Hello. I like this feature!  5  1

【讨论】：

工作正常。这是一个很好的解决方法。谢谢！无论如何，有没有办法连接到 pandas 解析器并“即时”进行拆分和连接？对于大型 CSV 文件有更好的解决方案吗？这需要太多时间。

以上是关于列数据中的python pandas read_csv定界符的主要内容，如果未能解决你的问题，请参考以下文章

Python：使用 pandas 导入 csv 文件时出现 ID 错误

python 显示pandas数据帧中的所有列

在 Pandas/Python 中合并两个数据框，保留数据框 1 中的所有列

从 csv 中提取列中的数据，保存为字典（Python、Pandas）

标准化 Python Pandas 数据框中的某些列？

python 使用datetime列查找pandas数据帧中的时间漏洞