使用 pandas 或其他 python 模块读取特定列

Posted 2023-02-23

技术标签:

【中文标题】使用 pandas 或其他 python 模块读取特定列【英文标题】：Read specific columns with pandas or other python module 【发布时间】：2014-11-21 16:21:26 【问题描述】：

我有一个来自 webpage 的 csv 文件。我想阅读下载文件中的一些列（csv版本可以在右上角下载）。

假设我想要 2 列：

59 在标题中是star_name 60，在标题中是ra。

但是，出于某种原因，网页的作者有时会决定移动列。

最后我想要这样的东西，记住值可能会丢失。

data = #read data in a clever way
names = data['star_name']
ras = data['ra']

这将防止我的程序在将来再次更改列时出现故障，如果它们保持名称正确的话。

到目前为止，我已经尝试过使用csv 模块和最近使用pandas 模块的各种方法。两者都没有运气。

编辑（添加了两行 + 我的数据文件的标题。抱歉，它非常长。）

# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,

【问题讨论】：

如果你这样做df = pd.read_csv('data.csv', usecols=['star_name','ra'])会起作用吗？没有。它给了我一个ValueError: 'star_name' is not in list。我尝试使用关键字 names 而不是 usecols 进行类似的操作，但这也不起作用（但运行时没有错误）。我没有看到该表中的任何一列，所以也许这就是问题所在。发布您实际使用的数据的前几行。一个问题似乎是列名以单个空格开头，例如" star_name". @PaulH 是的，它在桌子上不可见，但在下载的版本中是可见的。我在上面添加了几行。 @ajcr 我将尝试在pd.read_csv 中使用skipinitialspace 参数。感谢您的关注。 【参考方案1】：

一个简单的方法是像这样使用pandas 库。

import pandas as pd
fields = ['star_name', 'ra']

df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name

这里的问题是 skipinitialspace 删除了标题中的空格。所以'star_name'变成'star_name'

【讨论】：

感谢skipinitialspace 列也可以用数字指定，例如usecols=[0, 1] 将只读取前两列。 @Daniel Thaagaard Andreasen，如果我还必须有 'utf-8' 我将如何编写它。我还必须使用 skipinitialspace=True【参考方案2】：

根据最新的 pandas 文档，您可以读取 csv 文件，仅选择您想要读取的列。

import pandas as pd

df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)

这里我们使用usecols，它只读取数据框中选定的列。

我们使用low_memory，以便我们在内部分块处理文件。

【讨论】：

如果您希望文件被分块解析，您应该将low_memory 保留为默认值True。 See the docs.【参考方案3】：

以上答案在 python2 中。因此，对于 python 3 用户，我给出了这个答案。您可以使用以下代码：

import pandas as pd
fields = ['star_name', 'ra']

df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)

【讨论】：

【参考方案4】：

以不同的方式解决了上述问题，虽然我会读取整个 csv 文件，但会调整显示部分以仅显示所需的内容。

import pandas as pd

df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]

这可以帮助一些场景学习基础知识和根据数据框中的列过滤数据。

【讨论】：

感谢分享 :) 我现在也会这样做。学习它的基础知识，尝试进入数据分析，@DanielThaagaardAndreasen 可以指导我使用 pandas 的步骤，任何好的资源或教程。当然 :) 我认为在 youtube 上可以找到很多东西。我喜欢这个频道：youtube.com/channel/UCnVzApLJE2ljPZSeQylSEyg 关于答案。在这种情况下，pandas 将首先将整个数据集加载到内存中。如果文件很大，这可能是个问题。【参考方案5】：

我觉得你需要试试这个方法。

import pandas as pd

data_df = pd.read_csv('data.csv')

print(data_df['star_name'])

print(data_df['ra'])

【讨论】：

以上是关于使用 pandas 或其他 python 模块读取特定列的主要内容，如果未能解决你的问题，请参考以下文章