查找找到值的单元格计数时出现python错误
Posted
技术标签:
【中文标题】查找找到值的单元格计数时出现python错误【英文标题】:python error when finding count of cells where value was found 【发布时间】:2018-08-04 18:52:51 【问题描述】:我有以下关于玩具数据的代码,可以在我想要的那一天工作。最后 2 列提供在列 URL
中找到列 Jan
中的值的次数以及在列 URL
中找到列 Jan
中的不同行值的次数
sales = ['account': '3', 'Jan': 'xxx', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf try bbbbb why try',
'account': '1', 'Jan': 'try', 'Feb': '210', 'URL': '',
'account': '2', 'Jan': 'bbbbb', 'Feb': '90', 'URL': 'ea2017-104.pdf bbbbb cc for why try' ]
df = pd.DataFrame(sales)
df
df['found_in_column'] = df['Jan'].apply(lambda x: ''.join(df['URL'].tolist()).count(x))
df['distinct_finds'] = df['Jan'].apply(lambda x: sum(df['URL'].str.contains(x)))
为什么相同的代码在最后一种情况下会失败?我怎样才能更改我的代码以避免错误。在我的上一个示例中,第一列中有特殊字符,我觉得它们导致了问题。但是当我查看索引为 3 和 4 的行时,它们也有特殊字符并且代码运行良好
answer2=answer[['Value','non_repeat_pdf']].iloc[0:11]
print(answer2)
Value non_repeat_pdf
0 effect\nive Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 closing @@@@
2 executing @@@@
3 order, @@@@
4 waives: @@@@
5 right @@@@
6 notice @@@@
7 intention @@@@
8 prohibit @@@@
9 further @@@@
10 participation @@@@
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[220]:
0 1
1 0
2 1
3 0
4 1
5 1
6 0
7 0
8 1
9 0
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[10:11]
print(answer2)
Value non_repeat_pdf
10 participation @@@@
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[212]:
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[11:12]
print(answer2)
Value non_repeat_pdf
11 1818(e); @@@@
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Traceback (most recent call last):
File "<ipython-input-215-2df7f4b2de41>", line 1, in <module>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 2355, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src\inference.pyx", line 1574, in pandas._libs.lib.map_infer
File "<ipython-input-215-2df7f4b2de41>", line 1, in <lambda>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 1562, in contains
regex=regex)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 254, in str_contains
stacklevel=3)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
更新
我修改了我的代码并从Value
列中删除了所有特殊字符。我仍然收到错误...可能是什么问题。
即使出现错误,新列也会添加到我的 answer2
数据框
answer2=answer[['Value','non_repeat_pdf']]
print(answer2)
Value non_repeat_pdf
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
[1582 rows x 2 columns]
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
Traceback (most recent call last):
File "<ipython-input-298-4dc80361895c>", line 1, in <module>
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2331, in __setitem__
self._set_item(key, value)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2404, in _set_item
self._check_setitem_copy()
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 1873, in _check_setitem_copy
warnings.warn(t, SettingWithCopyWarning, stacklevel=stacklevel)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
更新2
以下作品
answer2=answer[['Value','non_repeat_pdf']]
xyz= answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
xyz=xyz.to_frame()
xyz.columns=['found_in_all_PDF']
pd.concat([answer2, xyz], axis=1)
Out[305]:
Value non_repeat_pdf \
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
found_in_all_PDF
0 6
1 1
2 4
3 1036
4 9
5 93
6 4
7 2
8 1
9 2
10 6
11 1
12 0
13 1
14 3
15 1
16 0
17 25
18 20
19 3
20 14
21 4
22 358
23 2
24 1
25 2
26 6
27 1
28 1
29 3
...
1552 3
1553 2
1554 0
1555 5
1556 2
1557 3
1558 0
1559 2
1560 1
1561 5
1562 2
1563 7
1564 8
1565 3
1566 0
1567 1
1568 1
1569 4
1570 1
1571 9
1572 2
1573 2
1574 96
1575 1
1576 1
1577 1
1578 0
1579 0
1580 1
1581 0
[1582 rows x 3 columns]
【问题讨论】:
1.我无法重现错误。如果您希望我查看您的错误,请共享数据集并描述您的环境,例如 python、pandas、numpy 版本。 2. 一般来说,你的玩具代码看起来有点无效:2.1 为什么你在每一步(1582 次)连接“non_repeat_pdf”中的所有行 - 之前做一次。2.2.“non_repeat_pdf”看起来很大,可能在 1582 上连接它们太大了,这会产生错误吗? 3. 你的目标是什么?也许还有其他方法可以做到这一点? 2.你会如何更改我的玩具代码? 在列表s = ''.join(df['URL'].tolist()) df['found_in_column'] = df['Jan'].apply(lambda x,s: s.count(x),s=s)
看起来您在 "Value" 中有字典,在 "non_repeat_pdf" 中有语料库。可以像 str.split(' ') 那样将“non_repeat_pdf”标记为“Value”吗?
它们是两列。我通过['pdf_text'].str.split(' ', expand=True)
创建Value
列,然后熔化数据框...这对上述代码有何影响?
【参考方案1】:
不幸的是,我无法在我的环境中重现完全相同的错误。但我看到的是关于错误正则表达式使用的警告。由于字符串"1818(e);"
中的括号,您的字符串被解释为捕获正则表达式。尝试使用str.contains 和regex=False
。
answer2 =pd.DataFrame('Value': 11: '1818(e);', 'non_repeat_pdf': 11: '@@@@')
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x,regex=False)))
输出:
11 0
Name: Value, dtype: int64
【讨论】:
我尝试了更多的东西并更新了问题。请帮忙以上是关于查找找到值的单元格计数时出现python错误的主要内容,如果未能解决你的问题,请参考以下文章
将公式插入单元格 VBA Excel 时出现运行时错误 1004
与completionBlock异步下载和保存文件时出现错误的CollectionView单元格图像
在 Google Colab 上执行任何单元格时出现浏览器弹出消息错误