在熊猫数据框列中查找特定文本
Posted
技术标签:
【中文标题】在熊猫数据框列中查找特定文本【英文标题】:Finding specific text in pandas dataframe column 【发布时间】:2020-05-13 11:35:09 【问题描述】:我有一个包含论文引用的列的数据框,我想查找所有引用在整个列中重复的任何引用。
以下是数据框中的一些行:
In [1]:
df4.iloc[0:2]
Out[2]:
**cit2ref** **reference** **_id**
0 NaN All about depression: Diagnosis. (2013). Retrieved December 7, 2016,from All About Depression,
http://www.allaboutdepression.com/dia_03.html Y17-1020
0 NaN American Psychological Association. (2016). Center for epidemiological studies depression (CESD).
Retrieved December 7, 2016, from American Psychological Association,
http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx Y17-1020
更多行:
**cit2ref** **reference** **_id**
0 NaN All about depression: Diagnosis. (2013). Retrieved December 7, 2016, from All About Depression, http://www.allaboutdepression.com/dia_03.html Y17-1020
0 NaN American Psychological Association. (2016). Center for epidemiological studies depression (CESD). Retrieved December 7, 2016, from American Psychological Association, http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx Y17-1020
0 NaN American Psychological Association. (2016). Patient health questionnaire (PHQ-9 %27 PHQ-2). Retrieved December 09, 2016, from http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/patient-health.aspx Y17-1020
0 NaN Beattie, G.S. (2005, November). Social Causes of Depression. Retrieved May 31, 2017, from http:// www.personalityresearch.org/papers/beattie.html Y17-1020
0 Burton (2012) Burton, N. (2012, June 5). Depressive Realism. Retrieved May 31, 2017, from https:// www.psychologytoday.com/blog/hide-and-seek/ 201206/depressive-realism Y17-1020
0 NaN Clark, P., Niblett, T. (1988, October 25). The CN2 induction Algorithm. Retrieved May 10, 2017, from https://pdfs.semanticscholar.org/766f/ e3586bda3f36cbcce809f5666d2c2b96c98c.pdf Y17-1020
0 Choudhury, 2014 De Choudhury, M., Counts, S., Horvits, E., %27 Hoff, A. (2014). Characterizing and Predicting Postpartum Depression from Shared Facebook Data. Y17-1020
0 NaN De Choudhury, M., Gamon, M., Couns, S., %27 Horvitz, E. (2013). Predicting Depression via Social Media. Y17-1020
0 Gotlib and Joormann (2010) Gotlib IH, Kasch KL, Traill S, Joormann J, Arnow BA, Johnson SL. (2010) Coherence and specificity of information-processing biases in depression and social phobia. J Abnorm Psychol. 2004;113(3): 386-98. Y17-1020
0 NaN Gotlib, I. H., %27 Hammen, C. L. (1992). Psychological aspects of depression: Toward a cognitive- interpersonal integration. New York: Wiley. Y17-1020
0 NaN Gotlib IH, Joormann J. Cognition and depression: current status and future directions. Annu Rev Clin Psychol. 2010;6:285-312. Y17-1020
0 NaN Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and Tingshao Zhu. "Predicting Depression of Social Media User on Different Observation Windows." 2015 IEEE/ WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI- IAT) (2015): n. pag. Web. Y17-102
这里的“0”是第一篇论文的索引,它有很多参考文献,有 40k 篇论文,每篇论文大约有 20 篇参考文献。
寻找在其他论文中再次使用的任何参考(这里每篇论文的索引不同)及其索引和重复次数。
尝试了正则表达式和熊猫的排序方法
value_counts(sort=True).sort_index()
和
sort_values()
但这无济于事。
Here is the screenshot of the dataframe with 2 papers as indexed '0' and '1'
【问题讨论】:
您能解释一下您的引用是什么意思吗?是美国心理学会。 (2016 年)。参考? Beattie, G.S.(2005 年 11 月)。 ?您想要实现的目标的示例会有所帮助。 @sammywemmy 'reference' 列值(即整个文本直到 '_id' 列值)是研究论文的参考。通过水平滚动查看整行。 @Chris 添加了更多索引数据帧的图像,但不知道如何在代码/数据帧中编写预期输出,但突出显示了我对问题的期望。cit2ref
有许多 NaN
值,因为它是相同的参考论文,其中值未知,无法删除它们,因为它有助于将参考文献与实际论文对齐。
您可以在编辑完问题后回复此评论,我再看一下。您可以阅读minimal reproducible example 或this link 也可能有用。这些旨在指导您撰写更好的问题。
【参考方案1】:
IIUC,使用pandas.DataFrame.index.groupby
。
使用伪数据框,df
:(请注意,我添加了最后三行用于演示):
print(df)
cit2ref reference _id
0 NaN All about depression: Diagnosis. (2013). Retri... Y17-1020
0 NaN American Psychological Association. (2016). Ce... Y17-1020
0 NaN American Psychological Association. (2016). Pa... Y17-1020
0 NaN Beattie, G.S. (2005, November). Social Causes ... Y17-1020
0 NaN Burton (2012) Burton, N. (2012, June 5). D... Y17-1020
0 NaN Clark, P., Niblett, T. (1988, October 25). The... Y17-1020
0 NaN Choudhury, 2014 De Choudhury, M., Counts, ... Y17-1020
0 NaN De Choudhury, M., Gamon, M., Couns, S., %27 Ho... Y17-1020
0 NaN Gotlib and Joormann (2010) Gotlib IH, Kasch K... Y17-1020
0 NaN Gotlib, I. H., %27 Hammen, C. L. (1992). Psych... Y17-1020
0 NaN Gotlib IH, Joormann J. Cognition and depressio... Y17-1020
0 NaN Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T... Y17-102
1 NaN All about depression: Diagnosis. (2013). Retri... Y17-1020
1 NaN American Psychological Association. (2016). Ce... Y17-1020
1 NaN ***. Not to be grouped-by Y17-102
然后groupby
:
df.index.groupby(df['reference'])
# or
d = k: list(v) for k, v in df.index.groupby(df['reference']).items()
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)
# this looks prettier
index 0
0 All about depression: Diagnosis. (2013). Retri... [0, 1]
1 American Psychological Association. (2016). Ce... [0, 1]
2 American Psychological Association. (2016). Pa... [0]
3 Beattie, G.S. (2005, November). Social Causes ... [0]
4 Burton (2012) Burton, N. (2012, June 5). D... [0]
5 Choudhury, 2014 De Choudhury, M., Counts, ... [0]
6 Clark, P., Niblett, T. (1988, October 25). The... [0]
7 De Choudhury, M., Gamon, M., Couns, S., %27 Ho... [0]
8 Gotlib IH, Joormann J. Cognition and depressio... [0]
9 Gotlib and Joormann (2010) Gotlib IH, Kasch K... [0]
10 Gotlib, I. H., %27 Hammen, C. L. (1992). Psych... [0]
11 Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T... [0]
12 ***. Not to be grouped-by [1]
您可以查看哪些论文出现在哪些索引中。如果要计数,可以使用len
代替list
:
d = k: len(v) for k, v in df.index.groupby(df['reference']).items()
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)
输出:
index 0
0 All about depression: Diagnosis. (2013). Retri... 2
1 American Psychological Association. (2016). Ce... 2
2 American Psychological Association. (2016). Pa... 1
3 Beattie, G.S. (2005, November). Social Causes ... 1
4 Burton (2012) Burton, N. (2012, June 5). D... 1
5 Choudhury, 2014 De Choudhury, M., Counts, ... 1
6 Clark, P., Niblett, T. (1988, October 25). The... 1
7 De Choudhury, M., Gamon, M., Couns, S., %27 Ho... 1
8 Gotlib IH, Joormann J. Cognition and depressio... 1
9 Gotlib and Joormann (2010) Gotlib IH, Kasch K... 1
10 Gotlib, I. H., %27 Hammen, C. L. (1992). Psych... 1
11 Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T... 1
12 ***. Not to be grouped-by 1
【讨论】:
谢谢。这是否会在整个“参考列”中查找每个参考并检查重复值,如果找到则给出计数和索引? 是的。第一部分用于重复索引,查找长度重复项等同于计数。但是,这将包括不重复的项目(请参阅 *** not be groupedby)。 字典理解只是将值从数组转换为列表以获得更漂亮的 repr。使其成为数据框也具有相同的效果,但它比在 dict 上更容易管理结果。 我得到了这个,我不知道索引列中的结果是什么,以及为什么有 521 个新列具有 NaN 值。 imgur.com/pAwBcTv以上是关于在熊猫数据框列中查找特定文本的主要内容,如果未能解决你的问题,请参考以下文章