ML.NET Cookbook：（18）如何在文本数据上训练模型？

Posted 2021-06-29 dotNET跨平台

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了ML.NET Cookbook：（18）如何在文本数据上训练模型？相关的知识，希望对你有一定的参考价值。

一般来说，所有的ML.NET学习器都希望这些特征是一个浮点向量。因此，如果您的一些数据不是一个float，您需要将其转换为float。

如果我们想学习文本数据，我们需要从文本中“提取特征”。NLP（自然语言处理）的整个研究领域都在处理这个问题。在ML.NET中，我们提供了一些文本特征提取的基本机制：

文本规范化（删除标点符号、变音符号、切换为小写等）
基于分隔符的标记化。
停用词删除。
Ngram和skip-gram提取。
TF-IDF重缩放。
词袋转换。

NET提供了一个名为TextFeaturizer的“一站式”操作，它将上述步骤作为一个大的“文本特征化”来运行。我们已经在文本数据集上对它进行了广泛的测试，我们确信它的性能相当好，而不需要深入研究操作。

但是，我们还提供了一些基本操作，让您可以自定义NLP处理。下面是我们使用它们的例子。

维基百科 detox 数据集:

Sentiment   SentimentText
1 Stop trolling, zapatancas, calling me a liar merely demonstartes that you arer Zapatancas. You may choose to chase every legitimate editor from this site and ignore me but I am an editor with a record that isnt 99% trolling and therefore my wishes are not to be completely ignored by a sockpuppet like yourself. The consensus is overwhelmingly against you and your trollin g lover Zapatancas,  
1 ::::: Why are you threatening me? I'm not being disruptive, its you who is being disruptive.   
0 " *::Your POV and propaganda pushing is dully noted. However listing interesting facts in a netral and unacusitory tone is not POV. You seem to be confusing Censorship with POV monitoring. I see nothing POV expressed in the listing of intersting facts. If you want to contribute more facts or edit wording of the cited fact to make them sound more netral then go ahead. No need to CENSOR interesting factual information. "
0 ::::::::This is a gross exaggeration. Nobody is setting a kangaroo court. There was a simple addition concerning the airline. It is the only one disputed here.

// 创建加载器：定义数据列以及它们在文本文件中的位置。
var loader = mlContext.Data.CreateTextLoader(new[] 
    {
        new TextLoader.Column("IsToxic", DataKind.Boolean, 0),
        new TextLoader.Column("Message", DataKind.String, 1),
    },
    hasHeader: true
);

// 加载数据。
var data = loader.Load(dataPath);

// 检查从文件中读取的消息文本。
var messageTexts = data.GetColumn<string>(data.Schema["Message"]).Take(20).ToArray();

// 应用ML.NET支持的各种文本操作。
var pipeline =
    // 一站式运行全部文本特征化。
    mlContext.Transforms.Text.FeaturizeText("TextFeatures", "Message")

    // 为以后的转换规范化消息文本
    .Append(mlContext.Transforms.Text.NormalizeText("NormalizedMessage", "Message"))

    // NLP管道1：词袋。
    .Append(mlContext.Transforms.Text.ProduceWordBags("BagOfWords", "NormalizedMessage"))

    // NLP管道2：bag of bigrams，使用散列而不是字典索引。
    .Append(mlContext.Transforms.Text.ProduceHashedWordBags("BagOfBigrams","NormalizedMessage", 
                ngramLength: 2, useAllLengths: false))

    // NLP管道3：具有TF-IDF加权的三字符序列包。
    .Append(mlContext.Transforms.Text.TokenizeIntoCharactersAsKeys("MessageChars", "Message"))
    .Append(mlContext.Transforms.Text.ProduceNgrams("BagOfTrichar", "MessageChars", 
                ngramLength: 3, weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf))

    // NLP管道4：词嵌入。
    .Append(mlContext.Transforms.Text.TokenizeIntoWords("TokenizedMessage", "NormalizedMessage"))
    .Append(mlContext.Transforms.Text.ApplyWordEmbedding("Embeddings", "TokenizedMessage",
                WordEmbeddingEstimator.PretrainedModelKind.SentimentSpecificWordEmbedding));

// 让我们训练管道，然后将其应用于相同的数据。
// 请注意，即使在70KB的小数据集上，上面的管道也可能需要一分钟才能完成训练。
var transformedData = pipeline.Fit(data).Transform(data);

// 检查结果数据集的某些列。
var embeddings = transformedData.GetColumn<float[]>(mlContext, "Embeddings").Take(10).ToArray();
var unigrams = transformedData.GetColumn<float[]>(mlContext, "BagOfWords").Take(10).ToArray();

欢迎关注我的个人公众号”My IO“

以上是关于ML.NET Cookbook：（18）如何在文本数据上训练模型？的主要内容，如果未能解决你的问题，请参考以下文章

ML.NET Cookbook：（11）如果我的训练数据不在文本文件中怎么办？

ML.NET Cookbook：如何从CSV加载包含多个列的数据？

ML.NET Cookbook：如何训练回归模型？

ML.NET Cookbook：（17）如何在分类数据上训练模型？

ML.NET Cookbook：如何查看中间过程数据？

ML.NET Cookbook：如何调试实验或预览管道？