向量化字符串

Posted 2023-03-12

技术标签:

【中文标题】向量化字符串【英文标题】：Vectorise a string 【发布时间】：2018-07-27 14:53:57 【问题描述】：

我是一个 python 菜鸟，但试图对一个字符串进行矢量化却没有运气。到目前为止，我从 URL 中的文章中提取数据，现在我试图对那篇文章进行分类，但到目前为止它不起作用。

(不断收到错误：raise AttributeError(attr + " not 找到”)AttributeError: 未找到下层)

似乎也没有任何帮助。

    url = input("Paste the webiste containing the article you want to analise here: ");
print "Analysing Webpage"
#Gets the URL from the extension
#Goose loaded
g = Goose()
#Extract the text and feed it to the classifier
article = g.extract(url=url)
article = article.cleaned_text
article = clean(article)
article =str(article)
print "Vectorising Text"
article = article.split();
vect = CountVectorizer(min_df=0., max_df=1.0)
X = vect.fit_transform(article)
X.toarray()
X = vect.transform(X).toarray()
print X
print "Predicting Political Bias"
loaded_model = pickle.load(open("text_clf_svm.pkl", 'rb'))
predicted_svm = loaded_model.predict(X)
print predicted_svm

非常欢迎任何形式的帮助或指示，并表示感谢 =)

【问题讨论】：

【参考方案1】：

您似乎对文本应用了 fit_transform。这导致与您/某人训练分类器的 X 矩阵不同。您需要同时“对齐”。

在您的情况下，您的 X 矩阵中有“下”这个词，但模型已经在没有这个词的矩阵上进行了训练。

在您的情况下，您可以使用 CountVectorizer 模型来训练模型并且您只需应用 transform，或者您应该使用 fit_transform 但在完整的语料库上训练模型并在以后的生产中使用它。

希望对你有帮助

问候，尼古拉斯

【讨论】：

以上是关于向量化字符串的主要内容，如果未能解决你的问题，请参考以下文章