在 python pandas 中构造一个共现矩阵
Posted
技术标签:
【中文标题】在 python pandas 中构造一个共现矩阵【英文标题】:Constructing a co-occurrence matrix in python pandas 【发布时间】:2014-01-01 16:18:39 【问题描述】:我知道如何在R 中执行此操作。但是,pandas 中是否有任何函数可以将数据帧转换为 nxn 共现矩阵,其中包含同时出现的两个方面的计数。
例如一个矩阵df:
import pandas as pd
df = pd.DataFrame('TFD' : ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'],
'Snack' : ['1', '0', '1', '1', '0', '0'],
'Trans' : ['1', '1', '1', '0', '0', '1'],
'Dop' : ['1', '0', '1', '0', '1', '1']).set_index('TFD')
print df
>>>
Dop Snack Trans
TFD
AA 1 1 1
SL 0 0 1
BB 1 1 1
D0 0 1 0
Dk 1 0 0
FF 1 0 1
[6 rows x 3 columns]
会产生:
Dop Snack Trans
Dop 0 2 3
Snack 2 0 2
Trans 3 2 0
由于矩阵镜像在对角线上,我想会有一种方法来优化代码。
【问题讨论】:
【参考方案1】:这是一个简单的线性代数,您将矩阵与其转置相乘(您的示例包含字符串,不要忘记将它们转换为整数):
>>> df_asint = df.astype(int)
>>> coocc = df_asint.T.dot(df_asint)
>>> coocc
Dop Snack Trans
Dop 4 2 3
Snack 2 3 2
Trans 3 2 4
如果,如 R 回答,你想重置对角线,你可以使用 numpy 的fill_diagonal
:
>>> import numpy as np
>>> np.fill_diagonal(coocc.values, 0)
>>> coocc
Dop Snack Trans
Dop 0 2 3
Snack 2 0 2
Trans 3 2 0
【讨论】:
我应该多看看 numpy。你只是用它的转置矩阵的点积。我想我可以在 pandas 中一步完成df.T.dot(df)
但我得到一个数据类型错误
您的框架中有字符串,需要像@alko 建议的那样进行转换或 df.convert_objects(convert_numeric=True)
@Jeff 是的,我知道这是同时编码和响应
@alko 我如何在上述解决方案中跳过?我不想放弃一整列,因为其中一个观察值是 NaN。
@vagabond df.fillna(0)
怎么样?【参考方案2】:
NumPy 中的演示:
import numpy as np
np.random.seed(3) # for reproducibility
# Generate data: 5 labels, 10 examples, binary.
label_headers = 'Alice Bob Carol Dave Eve'.split(' ')
label_data = np.random.randint(0,2,(10,5)) # binary here but could be any integer.
print('labels:\n0'.format(label_data))
# Compute cooccurrence matrix
cooccurrence_matrix = np.dot(label_data.transpose(),label_data)
print('\ncooccurrence_matrix:\n0'.format(cooccurrence_matrix))
# Compute cooccurrence matrix in percentage
# FYI: http://***.com/questions/19602187/numpy-divide-each-row-by-a-vector-element
# http://***.com/questions/26248654/numpy-return-0-with-divide-by-zero/32106804#32106804
cooccurrence_matrix_diagonal = np.diagonal(cooccurrence_matrix)
with np.errstate(divide='ignore', invalid='ignore'):
cooccurrence_matrix_percentage = np.nan_to_num(np.true_divide(cooccurrence_matrix, cooccurrence_matrix_diagonal[:, None]))
print('\ncooccurrence_matrix_percentage:\n0'.format(cooccurrence_matrix_percentage))
输出:
labels:
[[0 0 1 1 0]
[0 0 1 1 1]
[0 1 1 1 0]
[1 1 0 0 0]
[0 1 1 0 0]
[0 1 0 0 0]
[0 1 0 1 1]
[0 1 0 0 1]
[1 0 0 1 0]
[1 0 1 1 1]]
cooccurrence_matrix:
[[3 1 1 2 1]
[1 6 2 2 2]
[1 2 5 4 2]
[2 2 4 6 3]
[1 2 2 3 4]]
cooccurrence_matrix_percentage:
[[ 1. 0.33333333 0.33333333 0.66666667 0.33333333]
[ 0.16666667 1. 0.33333333 0.33333333 0.33333333]
[ 0.2 0.4 1. 0.8 0.4 ]
[ 0.33333333 0.33333333 0.66666667 1. 0.5 ]
[ 0.25 0.5 0.5 0.75 1. ]]
使用 matplotlib 绘制热图:
import numpy as np
np.random.seed(3) # for reproducibility
import matplotlib.pyplot as plt
def show_values(pc, fmt="%.2f", **kw):
'''
Heatmap with text in each cell with matplotlib's pyplot
Source: http://***.com/a/25074150/395857
By HYRY
'''
from itertools import izip
pc.update_scalarmappable()
ax = pc.get_axes()
for p, color, value in izip(pc.get_paths(), pc.get_facecolors(), pc.get_array()):
x, y = p.vertices[:-2, :].mean(0)
if np.all(color[:3] > 0.5):
color = (0.0, 0.0, 0.0)
else:
color = (1.0, 1.0, 1.0)
ax.text(x, y, fmt % value, ha="center", va="center", color=color, **kw)
def cm2inch(*tupl):
'''
Specify figure size in centimeter in matplotlib
Source: http://***.com/a/22787457/395857
By gns-ank
'''
inch = 2.54
if type(tupl[0]) == tuple:
return tuple(i/inch for i in tupl[0])
else:
return tuple(i/inch for i in tupl)
def heatmap(AUC, title, xlabel, ylabel, xticklabels, yticklabels):
'''
Inspired by:
- http://***.com/a/16124677/395857
- http://***.com/a/25074150/395857
'''
# Plot it out
fig, ax = plt.subplots()
c = ax.pcolor(AUC, edgecolors='k', linestyle= 'dashed', linewidths=0.2, cmap='RdBu', vmin=0.0, vmax=1.0)
# put the major ticks at the middle of each cell
ax.set_yticks(np.arange(AUC.shape[0]) + 0.5, minor=False)
ax.set_xticks(np.arange(AUC.shape[1]) + 0.5, minor=False)
# set tick labels
#ax.set_xticklabels(np.arange(1,AUC.shape[1]+1), minor=False)
ax.set_xticklabels(xticklabels, minor=False)
ax.set_yticklabels(yticklabels, minor=False)
# set title and x/y labels
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
# Remove last blank column
plt.xlim( (0, AUC.shape[1]) )
# Turn off all the ticks
ax = plt.gca()
for t in ax.xaxis.get_major_ticks():
t.tick1On = False
t.tick2On = False
for t in ax.yaxis.get_major_ticks():
t.tick1On = False
t.tick2On = False
# Add color bar
plt.colorbar(c)
# Add text in each cell
show_values(c)
# Proper orientation (origin at the top left instead of bottom left)
ax.invert_yaxis()
ax.xaxis.tick_top()
# resize
fig = plt.gcf()
fig.set_size_inches(cm2inch(40, 20))
def main():
# Generate data: 5 labels, 10 examples, binary.
label_headers = 'Alice Bob Carol Dave Eve'.split(' ')
label_data = np.random.randint(0,2,(10,5)) # binary here but could be any integer.
print('labels:\n0'.format(label_data))
# Compute cooccurrence matrix
cooccurrence_matrix = np.dot(label_data.transpose(),label_data)
print('\ncooccurrence_matrix:\n0'.format(cooccurrence_matrix))
# Compute cooccurrence matrix in percentage
# FYI: http://***.com/questions/19602187/numpy-divide-each-row-by-a-vector-element
# http://***.com/questions/26248654/numpy-return-0-with-divide-by-zero/32106804#32106804
cooccurrence_matrix_diagonal = np.diagonal(cooccurrence_matrix)
with np.errstate(divide='ignore', invalid='ignore'):
cooccurrence_matrix_percentage = np.nan_to_num(np.true_divide(cooccurrence_matrix, cooccurrence_matrix_diagonal[:, None]))
print('\ncooccurrence_matrix_percentage:\n0'.format(cooccurrence_matrix_percentage))
# Add count in labels
label_header_with_count = [ '0 (1)'.format(label_header, cooccurrence_matrix_diagonal[label_number]) for label_number, label_header in enumerate(label_headers)]
print('\nlabel_header_with_count: 0'.format(label_header_with_count))
# Plotting
x_axis_size = cooccurrence_matrix_percentage.shape[0]
y_axis_size = cooccurrence_matrix_percentage.shape[1]
title = "Co-occurrence matrix\n"
xlabel= ''#"Labels"
ylabel= ''#"Labels"
xticklabels = label_header_with_count
yticklabels = label_header_with_count
heatmap(cooccurrence_matrix_percentage, title, xlabel, ylabel, xticklabels, yticklabels)
plt.savefig('image_output.png', dpi=300, format='png', bbox_inches='tight') # use format='svg' or 'pdf' for vectorial pictures
#plt.show()
if __name__ == "__main__":
main()
#cProfile.run('main()') # if you want to do some profiling
(PS:一个neat visualization of a co-occurrence matrix in D3.js。)
【讨论】:
Alice-Bob 产生的值为何与 Bob-Alice 不同? (0.33 对 0.17) 为了规范化共现矩阵,我认为您不应该只将每一行除以对角线条目。我使用了Jaccard similarity(cooccurrence_matrix
是您的“i 和 j”。现在,计算“i 或 j”并将矩阵中的每个条目除以它)。您应该会发现矩阵是对称的 - Alice/Bob 产生的值与 Bob/Alice 相同。【参考方案3】:
如果您有更大的语料库和词频矩阵,使用稀疏矩阵乘法可能会更有效。我使用与此页面上algo
答案相同的矩阵乘法技巧。
import scipy.sparse as sp
X = sp.csr_matrix(df.astype(int).values) # convert dataframe to sparse matrix
Xc = X.T * X # multiply sparse matrix #
Xc.setdiag(0) # reset diagonal
print(Xc.todense()) # to print co-occurence matrix in dense format
Xc
这里将是稀疏 csr 格式的共现矩阵
【讨论】:
只有当 TD 矩阵是二进制时才成立。【参考方案4】:为了进一步阐述这个问题,如果你想从句子中构造共现矩阵,你可以这样做:
import numpy as np
import pandas as pd
def create_cooccurrence_matrix(sentences, window_size=2):
"""Create co occurrence matrix from given list of sentences.
Returns:
- vocabs: dictionary of word counts
- co_occ_matrix_sparse: sparse co occurrence matrix
Example:
===========
sentences = ['I love nlp', 'I love to learn',
'nlp is future', 'nlp is cool']
vocabs,co_occ = create_cooccurrence_matrix(sentences)
df_co_occ = pd.DataFrame(co_occ.todense(),
index=vocabs.keys(),
columns = vocabs.keys())
df_co_occ = df_co_occ.sort_index()[sorted(vocabs.keys())]
df_co_occ.style.applymap(lambda x: 'color: red' if x>0 else '')
"""
import scipy
import nltk
vocabulary =
data = []
row = []
col = []
tokenizer = nltk.tokenize.word_tokenize
for sentence in sentences:
sentence = sentence.strip()
tokens = [token for token in tokenizer(sentence) if token != u""]
for pos, token in enumerate(tokens):
i = vocabulary.setdefault(token, len(vocabulary))
start = max(0, pos-window_size)
end = min(len(tokens), pos+window_size+1)
for pos2 in range(start, end):
if pos2 == pos:
continue
j = vocabulary.setdefault(tokens[pos2], len(vocabulary))
data.append(1.)
row.append(i)
col.append(j)
cooccurrence_matrix_sparse = scipy.sparse.coo_matrix((data, (row, col)))
return vocabulary, cooccurrence_matrix_sparse
用法:
sentences = ['I love nlp', 'I love to learn',
'nlp is future', 'nlp is cool']
vocabs,co_occ = create_cooccurrence_matrix(sentences)
df_co_occ = pd.DataFrame(co_occ.todense(),
index=vocabs.keys(),
columns = vocabs.keys())
df_co_occ = df_co_occ.sort_index()[sorted(vocabs.keys())]
df_co_occ.style.applymap(lambda x: 'color: red' if x>0 else '')
# If not in jupyter notebook, print(df_co_occ)
输出
【讨论】:
这里为什么是window_size=2?你能解释一下吗?以上是关于在 python pandas 中构造一个共现矩阵的主要内容,如果未能解决你的问题,请参考以下文章