numpy genfromtxt IndexError 使用评论时

Posted 2023-02-21

技术标签:

【中文标题】numpy genfromtxt IndexError 使用评论时【英文标题】：numpy genfromtxt IndexError when using comments 【发布时间】：2021-12-27 01:47:15 【问题描述】：

我正在尝试使用 genfromtxt() 将文本文件中的数据导入 python。我目前拥有的代码是

lowResOmni = np.genfromtxt('omni low res 7-14 to 7-18.txt', dtype=[('year', int), ('SOY', float)
                                                             , ('B', float), ('Bx', float), ('By', float), ('Bz', float)
                                                             , ('plasmaTemp', float), ('ionDensity', float), ('plasmaSpeed', float), ('plasmaPressure', float)
                                                             , ('pFlux1', float), ('pFlux2', float), ('pFlux4', float), ('pFlux10', float)
                                                             , ('DST', int), ('AL', int), ('AU', int)]
                                            , comments="#", skip_header=2, usemask=True
                                            , missing_values=0:'', 1:''
                                                            , 2:999.9, 3:999.9, 4:999.9, 5:999.9
                                                            , 6:9999999., 7:999.9, 8:9999., 9:99.99
                                                            , 10:999999.99, 11:99999.99, 12:99999.99, 13:99999.99
                                                            , 14:99999, 15:99999, 16:99999)

每当 .txt 文件的行以 # 开头时，它都会给我一个IndexError: list index out of range。我已经在删除注释行的 .txt 文件上尝试了这个确切的代码，它工作正常，但我宁愿没有两个单独的文件用于数据和 cmets。

例如

TIME_AT_CENTER_OF_HOUR 1AU_IP_MAG_AVG_B 1AU_IP_BX,_GSE 1AU_IP_BY,_GSM 1AU_IP_BZ,_GSM 1AU_IP_PLASMA_TEMP 1AU_IP_N_(ION) 1AU_IP_PLASMA_SPEED 1AU_IP_FLOW_PRESSURE 1AU_PROTONS>1_MEV 1AU_PROTONS>2_MEV 1AU_PROTONS>4_MEV 1AU_PROTONS>10_MEV 1-H_DST 1-H_AL-INDEX AU-INDEX
Year____Secs-of-year                 nT             nT             nT             nT              Deg_K         Per_cc                Km/s                  nPa   1/(SQcm-ster-s)   1/(SQcm-ster-s)   1/(SQcm-ster-s)    1/(SQcm-ster-s)      nT           nT       nT
  2000 16849800.000000          5.50000       -4.90000      -0.800000       -1.20000            321609.        2.80000             606.000              1.92000           614.000           156.000           25.5000            1.87000     -29         -279      234
  2000 16853400.000000          4.30000       -2.90000       -2.90000       0.400000            200127.        3.40000             611.000              2.42000           625.000           159.000           26.2000            1.91000     -20         -245      164
  2000 16857000.000000          3.90000       -2.10000       -2.50000        1.40000            174932.        3.70000             615.000              2.70000           549.000           142.000           23.2000            1.79000     -12         -264      113
  2000 16860600.000000          3.60000       -1.30000       -2.40000       0.600000            148701.        3.40000             616.000              2.61000           492.000           125.000           20.5000            1.62000     -14         -155      109
  2000 16864200.000000          4.10000       -1.00000       -2.20000       0.500000            116372.        2.70000             614.000              2.20000           485.000           124.000           20.5000            1.73000     -20         -140       89
  2000 16867800.000000          4.30000       -1.40000       -1.00000       -3.60000            96452.0        2.50000             607.000              1.91000           465.000           119.000           19.5000            1.63000     -19         -275      240
#  
# Key Parameter and Survey data (labels K0,K1,K2) are preliminary browse data.
# Generated by CDAWeb on: Mon Nov 15 15:35:02 2021

工作正常，但是

# now we have an error for some reason
TIME_AT_CENTER_OF_HOUR 1AU_IP_MAG_AVG_B 1AU_IP_BX,_GSE 1AU_IP_BY,_GSM 1AU_IP_BZ,_GSM 1AU_IP_PLASMA_TEMP 1AU_IP_N_(ION) 1AU_IP_PLASMA_SPEED 1AU_IP_FLOW_PRESSURE 1AU_PROTONS>1_MEV 1AU_PROTONS>2_MEV 1AU_PROTONS>4_MEV 1AU_PROTONS>10_MEV 1-H_DST 1-H_AL-INDEX AU-INDEX
Year____Secs-of-year                 nT             nT             nT             nT              Deg_K         Per_cc                Km/s                  nPa   1/(SQcm-ster-s)   1/(SQcm-ster-s)   1/(SQcm-ster-s)    1/(SQcm-ster-s)      nT           nT       nT
  2000 16849800.000000          5.50000       -4.90000      -0.800000       -1.20000            321609.        2.80000             606.000              1.92000           614.000           156.000           25.5000            1.87000     -29         -279      234
  2000 16853400.000000          4.30000       -2.90000       -2.90000       0.400000            200127.        3.40000             611.000              2.42000           625.000           159.000           26.2000            1.91000     -20         -245      164
  2000 16857000.000000          3.90000       -2.10000       -2.50000        1.40000            174932.        3.70000             615.000              2.70000           549.000           142.000           23.2000            1.79000     -12         -264      113
  2000 16860600.000000          3.60000       -1.30000       -2.40000       0.600000            148701.        3.40000             616.000              2.61000           492.000           125.000           20.5000            1.62000     -14         -155      109
  2000 16864200.000000          4.10000       -1.00000       -2.20000       0.500000            116372.        2.70000             614.000              2.20000           485.000           124.000           20.5000            1.73000     -20         -140       89
  2000 16867800.000000          4.30000       -1.40000       -1.00000       -3.60000            96452.0        2.50000             607.000              1.91000           465.000           119.000           19.5000            1.63000     -19         -275      240
#  
# Key Parameter and Survey data (labels K0,K1,K2) are preliminary browse data.
# Generated by CDAWeb on: Mon Nov 15 15:35:02 2021

突然发作。

有趣的是，文件底部的 cmets 不会引起问题。

感谢任何输入！

【问题讨论】：

只是为了确定，您使用的是哪个操作系统以及您的文件有什么样的行尾？ LF 还是 CRLF？ @Nullman python 3.7.5 on mac 11.2，.txt文件有LF行结尾，是UTF-8编码你试过skip_header=3吗？我不确定哪个先发生，跳过标题或跳过注释行。您可能需要显示完整的回溯，以便我们（和您）可以看到错误发生的位置。 @hpaulj 好的，所以skip_header=3 起作用了，我想这只是没有考虑注释中以 # 开头的行。但后来我很困惑为什么comments='#' 参数不起作用。它必须是先跳过标题行，然后再检查cmets。 【参考方案1】：

复制并粘贴问题示例：

In [328]: data=np.genfromtxt(txt, dtype=None, skip_header=3)
In [329]: data
Out[329]: 
array([(2000, 16849800., 5.5, -4.9, -0.8, -1.2, 321609., 2.8, 606., 1.92, 614., 156., 25.5, 1.87, -29, -279, 234),
       (2000, 16853400., 4.3, -2.9, -2.9,  0.4, 200127., 3.4, 611., 2.42, 625., 159., 26.2, 1.91, -20, -245, 164),
       (2000, 16857000., 3.9, -2.1, -2.5,  1.4, 174932., 3.7, 615., 2.7 , 549., 142., 23.2, 1.79, -12, -264, 113),
       (2000, 16860600., 3.6, -1.3, -2.4,  0.6, 148701., 3.4, 616., 2.61, 492., 125., 20.5, 1.62, -14, -155, 109),
       (2000, 16864200., 4.1, -1. , -2.2,  0.5, 116372., 2.7, 614., 2.2 , 485., 124., 20.5, 1.73, -20, -140,  89),
       (2000, 16867800., 4.3, -1.4, -1. , -3.6,  96452., 2.5, 607., 1.91, 465., 119., 19.5, 1.63, -19, -275, 240)],
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', '<f8'), ('f10', '<f8'), ('f11', '<f8'), ('f12', '<f8'), ('f13', '<f8'), ('f14', '<i8'), ('f15', '<i8'), ('f16', '<i8')])

只跳过两行：

In [330]: data=np.genfromtxt(txt, dtype=None, skip_header=2)
Traceback (most recent call last):
  File "<ipython-input-330-03af4b10cbea>", line 1, in <module>
    data=np.genfromtxt(txt, dtype=None, skip_header=2)
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py", line 2124, in genfromtxt
    raise ValueError(errmsg)
ValueError: Some errors were detected !
    Line #4 (got 17 columns instead of 16)
    Line #5 (got 17 columns instead of 16)
    Line #6 (got 17 columns instead of 16)
    Line #7 (got 17 columns instead of 16)
    Line #8 (got 17 columns instead of 16)
    Line #9 (got 17 columns instead of 16)

使用 skip_header=2 时，它会尝试读取第 3 行（评论检测必须稍后进行）

In [335]: data=np.genfromtxt(txt, dtype=None, skip_header=2, max_rows=1)
<ipython-input-335-83f145ee8d7c>:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
  data=np.genfromtxt(txt, dtype=None, skip_header=2, max_rows=1)
In [336]: data
Out[336]: 
array([b'Year____Secs-of-year', b'nT', b'nT', b'nT', b'nT', b'Deg_K',
       b'Per_cc', b'Km/s', b'nPa', b'1/(SQcm-ster-s)', b'1/(SQcm-ster-s)',
       b'1/(SQcm-ster-s)', b'1/(SQcm-ster-s)', b'nT', b'nT', b'nT'],
      dtype='|S20')

它在这里只检测到 16 个字段。这会弄乱其他行的字段计数。

【讨论】：

我明白了，所以 skip_header 在 cmets 之前运行。谢谢！

以上是关于numpy genfromtxt IndexError 使用评论时的主要内容，如果未能解决你的问题，请参考以下文章