如何将字符串列表转换为pytorch中的张量？

Posted 2023-03-12

技术标签:

【中文标题】如何将字符串列表转换为pytorch中的张量？【英文标题】：How to convert a list of strings into a tensor in pytorch? 【发布时间】：2017-11-20 22:05:50 【问题描述】：

我正在研究分类问题，其中我有一个字符串列表作为类标签，我想将它们转换为张量。到目前为止，我已经尝试使用 numpy 模块提供的np.array 函数将字符串列表转换为numpy array。

truth = torch.from_numpy(np.array(truths))

但我收到以下错误。

RuntimeError: can't convert a given np.ndarray to a tensor - it has an invalid type. The only supported types are: double, float, int64, int32, and uint8.

有人可以建议一种替代方法吗？谢谢

【问题讨论】：

获取字符的 ASCII 值/UNICODE 值可能是一种解决方法（ASCII 适合 uint8）好的，谢谢你试试简单地将字符串标签转换为数字或单热向量怎么样？我同意@BiBi - 听起来你想要一个单热编码。是的，我正在使用一种热编码，谢谢@BiBi 【参考方案1】：

truth = [float(truths) for x in truths]
truth = np.asarray(truth)
truth = torch.from_numpy(truth)

【讨论】：

【参考方案2】：

很遗憾，您现在不能。而且我认为这不是一个好主意，因为它会使 PyTorch 变得笨拙。一种流行的解决方法是使用 sklearn 将其转换为数字类型。

这是一个简短的例子：

from sklearn import preprocessing
import torch

labels = ['cat', 'dog', 'mouse', 'elephant', 'pandas']
le = preprocessing.LabelEncoder()
targets = le.fit_transform(labels)
# targets: array([0, 1, 2, 3])

targets = torch.as_tensor(targets)
# targets: tensor([0, 1, 2, 3])

由于您可能需要在真实标签和转换后的标签之间进行转换，因此最好存储变量le。

【讨论】：

【参考方案3】：

如果您不想使用 sklearn，另一种解决方案可能是保留您的原始列表并创建一个额外的索引列表，之后您可以使用它来引用您的原始值。当我必须跟踪我的原始字符串时，我特别需要这个，同时批处理标记化的字符串。

下面的例子：

labels = ['cat', 'dog', 'mouse']
sentence_idx = np.linspace(0,len(labels), len(labels), False)
# [0, 1, 2]
torch_idx = torch.tensor(sentence_idx)
# do what ever you would like from torch eg. pass it to a dataloader
dataset = TensorDataset(torch_idx)
loader = DataLoader(dataset, batch_size=1, shuffle=True)
for batch in iter(loader):
    print(batch[0])
    print(labels[int(batch[0].item())])

# output:
# tensor([0.], dtype=torch.float64)
# cat
# tensor([1.], dtype=torch.float64)
# dog
# tensor([2.], dtype=torch.float64)
# mouse

对于我的具体用例，代码如下所示：

input_ids, attention_masks, labels = tokenize_sentences(tokenizer, sentences, labels, max_length)

# create a indexes tensor to keep track of original sentence index
sentence_idx = np.linspace(0,len(sentences), len(sentences),False )
torch_idx = torch.tensor(sentence_idx)
dataset = TensorDataset(input_ids, attention_masks, labels, torch_idx)
loader = DataLoader(dataset, batch_size=1, shuffle=True)

for batch in loader:
    _, logit = model(batch[0], 
                     token_type_ids=None,
                     attention_mask=batch[1],
                     labels=batch[2])

    pred_flat = np.argmax(logit.detach(), axis=1).flatten()
    print(pred_flat)
    print(batch[2])
    if pred_flat == batch[2]:
        print("\nThe following sentence was predicted correctly:")
            print(sentences[int(batch[3].item())])

【讨论】：

【参考方案4】：

诀窍是首先找出列表中单词的最大长度，然后在第二个循环中用零填充填充张量。请注意，utf8 字符串每个字符占用两个字节。

In[]
import torch

words = ['שלום', 'beautiful', 'world']
max_l = 0
ts_list = []
for w in words:
    ts_list.append(torch.ByteTensor(list(bytes(w, 'utf8'))))
    max_l = max(ts_list[-1].size()[0], max_l)

w_t = torch.zeros((len(ts_list), max_l), dtype=torch.uint8)
for i, ts in enumerate(ts_list):
    w_t[i, 0:ts.size()[0]] = ts
w_t

Out[]
tensor([[215, 169, 215, 156, 215, 149, 215, 157,   0],
        [ 98, 101,  97, 117, 116, 105, 102, 117, 108],
        [119, 111, 114, 108, 100,   0,   0,   0,   0]], dtype=torch.uint8)

【讨论】：

以上是关于如何将字符串列表转换为pytorch中的张量？的主要内容，如果未能解决你的问题，请参考以下文章