查找找到值的单元格计数时出现python错误

Posted 2023-02-23

技术标签:

【中文标题】查找找到值的单元格计数时出现python错误【英文标题】：python error when finding count of cells where value was found 【发布时间】：2018-08-04 18:52:51 【问题描述】：

我有以下关于玩具数据的代码，可以在我想要的那一天工作。最后 2 列提供在列 URL 中找到列 Jan 中的值的次数以及在列 URL 中找到列 Jan 中的不同行值的次数

sales = ['account': '3', 'Jan': 'xxx', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf try bbbbb why try',
             'account': '1',  'Jan': 'try', 'Feb': '210', 'URL': '',
             'account': '2',  'Jan': 'bbbbb',  'Feb': '90',  'URL': 'ea2017-104.pdf bbbbb cc for why try' ]
df = pd.DataFrame(sales)
df

df['found_in_column'] = df['Jan'].apply(lambda x: ''.join(df['URL'].tolist()).count(x))
df['distinct_finds'] = df['Jan'].apply(lambda x: sum(df['URL'].str.contains(x)))

为什么相同的代码在最后一种情况下会失败？我怎样才能更改我的代码以避免错误。在我的上一个示例中，第一列中有特殊字符，我觉得它们导致了问题。但是当我查看索引为 3 和 4 的行时，它们也有特殊字符并且代码运行良好

answer2=answer[['Value','non_repeat_pdf']].iloc[0:11]


print(answer2)

            Value                                     non_repeat_pdf
0     effect\nive    Initials: __\nDL_  -1- Date: __\n8/14/2017\n...
1         closing                                               @@@@
2       executing                                               @@@@
3          order,                                               @@@@
4         waives:                                               @@@@
5           right                                               @@@@
6          notice                                               @@@@
7       intention                                               @@@@
8        prohibit                                               @@@@
9         further                                               @@@@
10  participation                                               @@@@

answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))

Out[220]: 
0     1
1     0
2     1
3     0
4     1
5     1
6     0
7     0
8     1
9     0
10    0
Name: Value, dtype: int64

answer2=answer[['Value','non_repeat_pdf']].iloc[10:11]


print(answer2)

            Value non_repeat_pdf
10  participation           @@@@

answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))

Out[212]: 
10    0
Name: Value, dtype: int64

answer2=answer[['Value','non_repeat_pdf']].iloc[11:12]


print(answer2)

       Value non_repeat_pdf
11  1818(e);           @@@@

answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))

Traceback (most recent call last):

  File "<ipython-input-215-2df7f4b2de41>", line 1, in <module>
    answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))

  File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 2355, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)

  File "pandas/_libs/src\inference.pyx", line 1574, in pandas._libs.lib.map_infer

  File "<ipython-input-215-2df7f4b2de41>", line 1, in <lambda>
    answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))

  File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 1562, in contains
    regex=regex)

  File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 254, in str_contains
    stacklevel=3)

  File "C:\Users\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
    msg.file, msg.line)

  File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
    file.write(formatWarning(message, category, filename, lineno, line))

  File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
    file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name

IndexError: list index out of range

更新

我修改了我的代码并从Value 列中删除了所有特殊字符。我仍然收到错误...可能是什么问题。即使出现错误，新列也会添加到我的 answer2 数据框

answer2=answer[['Value','non_repeat_pdf']]

print(answer2)

             Value                                     non_repeat_pdf
0              law    Initials: __\nDL_  -1- Date: __\n8/14/2017\n...
1        concerned                                                   
2           rights                                                   
3                c                                                   
4          violate                                                   
5                8                                                   
6        agreement                                                   
7           voting                                                   
8       previously                                                   
9      supervisory                                                   
10             its                                                   
11        exercise                                                   
12            occs                                                   
13        entities                                                   
14           those                                                   
15        approved                                                   
16          1818h2                                                   
17               9                                                   
18             are                                                   
19          manner                                                   
20           their                                                   
21         affairs                                                   
22               b                                                   
23         solicit                                                   
24         procure                                                   
25        transfer                                                   
26         attempt                                                   
27      extraneous                                                   
28    modification                                                   
29            vote                                                   
           ...                                                ...
1552       closing                                                   
1553       heavily                                                   
1554            pm                                                   
1555    throughout                                                   
1556          half                                                   
1557        window                                                   
1558   sixtysecond                                                   
1559      activity                                                   
1560      sampling                                                   
1561         using                                                   
1562          hour                                                   
1563      violated                                                   
1564          euro                                                   
1565         rates                                                   
1566   derivatives                                                   
1567    portfolios                                                   
1568     valuation                                                   
1569       parties                                                   
1570      numerous                                                   
1571          they                                                   
1572     reference                                                   
1573       because                                                   
1574            us                                                   
1575     important                                                   
1576        moment                                                   
1577      snapshot                                                   
1578           cet                                                   
1579           215                                                   
1580       finance                                                   
1581   supervision                                                   

[1582 rows x 2 columns]

answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))

Traceback (most recent call last):

  File "<ipython-input-298-4dc80361895c>", line 1, in <module>
    answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))

  File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2331, in __setitem__
    self._set_item(key, value)

  File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2404, in _set_item
    self._check_setitem_copy()

  File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 1873, in _check_setitem_copy
    warnings.warn(t, SettingWithCopyWarning, stacklevel=stacklevel)

  File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
    msg.file, msg.line)

  File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
    file.write(formatWarning(message, category, filename, lineno, line))

  File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
    file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name

IndexError: list index out of range

更新2

以下作品

answer2=answer[['Value','non_repeat_pdf']]

xyz= answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
xyz=xyz.to_frame()
xyz.columns=['found_in_all_PDF']
pd.concat([answer2, xyz], axis=1)
Out[305]: 
             Value                                     non_repeat_pdf  \
0              law    Initials: __\nDL_  -1- Date: __\n8/14/2017\n...   
1        concerned                                                      
2           rights                                                      
3                c                                                      
4          violate                                                      
5                8                                                      
6        agreement                                                      
7           voting                                                      
8       previously                                                      
9      supervisory                                                      
10             its                                                      
11        exercise                                                      
12            occs                                                      
13        entities                                                      
14           those                                                      
15        approved                                                      
16          1818h2                                                      
17               9                                                      
18             are                                                      
19          manner                                                      
20           their                                                      
21         affairs                                                      
22               b                                                      
23         solicit                                                      
24         procure                                                      
25        transfer                                                      
26         attempt                                                      
27      extraneous                                                      
28    modification                                                      
29            vote                                                      
           ...                                                ...   
1552       closing                                                      
1553       heavily                                                      
1554            pm                                                      
1555    throughout                                                      
1556          half                                                      
1557        window                                                      
1558   sixtysecond                                                      
1559      activity                                                      
1560      sampling                                                      
1561         using                                                      
1562          hour                                                      
1563      violated                                                      
1564          euro                                                      
1565         rates                                                      
1566   derivatives                                                      
1567    portfolios                                                      
1568     valuation                                                      
1569       parties                                                      
1570      numerous                                                      
1571          they                                                      
1572     reference                                                      
1573       because                                                      
1574            us                                                      
1575     important                                                      
1576        moment                                                      
1577      snapshot                                                      
1578           cet                                                      
1579           215                                                      
1580       finance                                                      
1581   supervision                                                      

      found_in_all_PDF  
0                    6  
1                    1  
2                    4  
3                 1036  
4                    9  
5                   93  
6                    4  
7                    2  
8                    1  
9                    2  
10                   6  
11                   1  
12                   0  
13                   1  
14                   3  
15                   1  
16                   0  
17                  25  
18                  20  
19                   3  
20                  14  
21                   4  
22                 358  
23                   2  
24                   1  
25                   2  
26                   6  
27                   1  
28                   1  
29                   3  
               ...  
1552                 3  
1553                 2  
1554                 0  
1555                 5  
1556                 2  
1557                 3  
1558                 0  
1559                 2  
1560                 1  
1561                 5  
1562                 2  
1563                 7  
1564                 8  
1565                 3  
1566                 0  
1567                 1  
1568                 1  
1569                 4  
1570                 1  
1571                 9  
1572                 2  
1573                 2  
1574                96  
1575                 1  
1576                 1  
1577                 1  
1578                 0  
1579                 0  
1580                 1  
1581                 0  

[1582 rows x 3 columns]

【问题讨论】：

1.我无法重现错误。如果您希望我查看您的错误，请共享数据集并描述您的环境，例如 python、pandas、numpy 版本。 2. 一般来说，你的玩具代码看起来有点无效：2.1 为什么你在每一步（1582 次）连接“non_repeat_pdf”中的所有行 - 之前做一次。2.2.“non_repeat_pdf”看起来很大，可能在 1582 上连接它们太大了，这会产生错误吗？ 3. 你的目标是什么？也许还有其他方法可以做到这一点？ 2.你会如何更改我的玩具代码？在列表s = ''.join(df['URL'].tolist()) df['found_in_column'] = df['Jan'].apply(lambda x,s: s.count(x),s=s) 看起来您在 "Value" 中有字典，在 "non_repeat_pdf" 中有语料库。可以像 str.split(' ') 那样将“non_repeat_pdf”标记为“Value”吗？它们是两列。我通过['pdf_text'].str.split(' ', expand=True) 创建Value 列，然后熔化数据框...这对上述代码有何影响？ 【参考方案1】：

不幸的是，我无法在我的环境中重现完全相同的错误。但我看到的是关于错误正则表达式使用的警告。由于字符串"1818(e);" 中的括号，您的字符串被解释为捕获正则表达式。尝试使用str.contains 和regex=False。

answer2 =pd.DataFrame('Value': 11: '1818(e);', 'non_repeat_pdf': 11: '@@@@')
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x,regex=False)))

输出：

11    0
Name: Value, dtype: int64

【讨论】：

我尝试了更多的东西并更新了问题。请帮忙

以上是关于查找找到值的单元格计数时出现python错误的主要内容，如果未能解决你的问题，请参考以下文章