SpaCy 的 most_similar() 函数在 GPU 上返回错误
Posted
技术标签:
【中文标题】SpaCy 的 most_similar() 函数在 GPU 上返回错误【英文标题】:SpaCy's most_similar() function returns error on GPU 【发布时间】:2020-10-21 13:59:03 【问题描述】:我正在尝试评估 Spacy 的 most_similar 方法 (https://spacy.io/api/vectors#most_similar) 的性能。我很好奇它是否在 GPU 上运行得更快。函数如下:
def spacy_most_similar(word, topn=10):
ms = nlp_ru.vocab.vectors.most_similar(nlp_ru(word).vector.reshape(1,100), n=topn)
words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
return words, distances
spacy_most_similar("дерево", 10)
适用于 CPU 版本,但在 GPU(使用 CuPy 数组而不是 NumPy)上我收到错误:
TypeError Traceback (most recent call last)
<ipython-input-8-ea5e049ec55b> in <module>()
7 distances = ms[2]
8 return words, distances
----> 9 spacy_most_similar("дерево", 10)
<ipython-input-8-ea5e049ec55b> in spacy_most_similar(word, topn)
3 print(nlp_ru(word).vector.reshape(1,100).shape)
4 ms = nlp_ru.vocab.vectors.most_similar(
----> 5 nlp_ru(word).vector.reshape(1,100), n=topn)
6 words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
7 distances = ms[2]
vectors.pyx in spacy.vectors.Vectors.most_similar()
TypeError: list indices must be integers or slices, not cupy.core.core.ndarray
我也试过这种方法:
def spacy_most_similar(word, topn=10):
ms = nlp_ru.vocab.vectors.most_similar(np.asarray([nlp_ru.vocab.vectors[nlp_ru.vocab.strings[word]]]), n=topn)
words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
return words, distances
spacy_most_similar("дерево", 10)
在 CPU 上一切正常,但对于 GPU 版本(我将 np 更改为 cp):
import cupy as cp
def spacy_most_similar(word, topn=10):
with cp.cuda.Device(0):
nlp_ru.vocab.vectors.data = cp.asarray(nlp_ru.vocab.vectors.data)
ms = nlp_ru.vocab.vectors.most_similar(cp.asarray([nlp_ru.vocab.vectors[nlp_ru.vocab.strings[word]]]), n=topn)
words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
return words, distances
spacy_most_similar("дерево", 10)
我遇到了这样的错误:
TypeError Traceback (most recent call last)
<ipython-input-6-876656d5f75d> in <module>()
7 distances = ms[2]
8 return words, distances
----> 9 spacy_most_similar("дерево", 10)
<ipython-input-6-876656d5f75d> in spacy_most_similar(word, topn)
3 with cp.cuda.Device(0):
4 nlp_ru.vocab.vectors.data = cp.asarray(nlp_ru.vocab.vectors.data)
----> 5 ms = nlp_ru.vocab.vectors.most_similar(cp.asarray([nlp_ru.vocab.vectors[nlp_ru.vocab.strings[word]]]), n=topn)
6 words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
7 distances = ms[2]
vectors.pyx in spacy.vectors.Vectors.most_similar()
TypeError: unhashable type: 'cupy.core.core.ndarray'
您能帮我为 most_similar() 方法构建正确的 CuPy 输入吗?
【问题讨论】:
【参考方案1】:鉴于现有的source code,我怀疑您是否可以在 GPU 上执行most_similar
:
def most_similar(self, queries, *, batch_size=1024, n=1, sort=True):
"""For each of the given vectors, find the n most similar entries
to it, by cosine.
Queries are by vector. Results are returned as a `(keys, best_rows,
scores)` tuple. If `queries` is large, the calculations are performed in
chunks, to avoid consuming too much memory. You can set the `batch_size`
to control the size/space trade-off during the calculations.
queries (ndarray): An array with one or more vectors.
batch_size (int): The batch size to use.
n (int): The number of entries to return for each query.
sort (bool): Whether to sort the n entries returned by score.
RETURNS (tuple): The most similar entries as a `(keys, best_rows, scores)`
tuple.
"""
filled = sorted(list(row for row in self.key2row.values()))
if len(filled) < n:
raise ValueError(Errors.E198.format(n=n, n_rows=len(filled)))
xp = get_array_module(self.data)
norms = xp.linalg.norm(self.data[filled], axis=1, keepdims=True)
norms[norms == 0] = 1
vectors = self.data[filled] / norms
best_rows = xp.zeros((queries.shape[0], n), dtype='i')
scores = xp.zeros((queries.shape[0], n), dtype='f')
# Work in batches, to avoid memory problems.
for i in range(0, queries.shape[0], batch_size):
batch = queries[i : i+batch_size]
batch_norms = xp.linalg.norm(batch, axis=1, keepdims=True)
batch_norms[batch_norms == 0] = 1
batch /= batch_norms
# batch e.g. (1024, 300)
# vectors e.g. (10000, 300)
# sims e.g. (1024, 10000)
sims = xp.dot(batch, vectors.T)
best_rows[i:i+batch_size] = xp.argpartition(sims, -n, axis=1)[:,-n:]
scores[i:i+batch_size] = xp.partition(sims, -n, axis=1)[:,-n:]
if sort and n >= 2:
sorted_index = xp.arange(scores.shape[0])[:,None][i:i+batch_size],xp.argsort(scores[i:i+batch_size], axis=1)[:,::-1]
scores[i:i+batch_size] = scores[sorted_index]
best_rows[i:i+batch_size] = best_rows[sorted_index]
for i, j in numpy.ndindex(best_rows.shape):
breakpoint()
best_rows[i, j] = filled[best_rows[i,j]]
# Round values really close to 1 or -1
scores = xp.around(scores, decimals=4, out=scores)
# Account for numerical error we want to return in range -1, 1
scores = xp.clip(scores, a_min=-1, a_max=1, out=scores)
row2key = row: key for key, row in self.key2row.items()
keys = xp.asarray(
[[row2key[row] for row in best_rows[i] if row in row2key]
for i in range(len(queries)) ], dtype="uint64")
return (keys, best_rows, scores)
注意,filled
已经是一个 CPU 对象,它将被从 numpy 数组中获取的索引正确索引,而不是从 Cupy 数组中获取。错误TypeError: list indices must be integers or slices, not cupy.core.core.ndarray
来自以下两行:
for i, j in numpy.ndindex(best_rows.shape):
best_rows[i, j] = filled[best_rows[i, j]]
如果您认为在 GPU 上找到最相似的单词很有价值,您可以在 https://github.com/explosion/spaCy/issues 上打开一个问题或编写您自己的 most_similar
(我相信这很简单)。
【讨论】:
@Jahjajaka 它回答了你的问题吗?有帮助吗?请考虑***.com/help/someone-answers以上是关于SpaCy 的 most_similar() 函数在 GPU 上返回错误的主要内容,如果未能解决你的问题,请参考以下文章
Gensim Doc2Vec most_similar() 方法未按预期工作
Android 中的 Spacy 版本错误使用 Chaquopy 和 nlp = spacy.load("en_core_web_sm") 错误