NLP - 找到具有不一致数字样本的输入变量
Posted
技术标签:
【中文标题】NLP - 找到具有不一致数字样本的输入变量【英文标题】:NLP - Found input variables with inconsistent numbers samples 【发布时间】:2021-02-25 10:30:14 【问题描述】:所以我正在尝试训练一个模型来读取从 Tripadvisor 收集的示例数据集中的问候语,并且在尝试训练模型集时遇到以下错误。
这是数据集的链接 - https://nextit-public.s3-us-west-2.amazonaws.com/rsics.html?fbclid=IwAR0CktLQtuPBaZNk03odCKdrjN3LjYl_ouuFBbWvyj-yQ-BvzJ0v_n9w9xo
这是我的代码;
import streamlit as st
import numpy as np
import pandas as pd
# NLP Pkgs
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import os
# Main Stuff
st.title("Greetings NLP - Presence")
st.subheader("Created using Streamlit - Harshil Parikh ")
# Loading the data into streamlit
@st.cache
def load_data(nrows):
#data = pd.read_csv('/Users/harshilparikh/Desktop/INT/data/selections.csv', nrows=nrows)
dataset = st.cache(pd.read_csv)('/Users/harshilparikh/Desktop/INT/data/selections.csv')
return dataset
data_load_state = st.text('Loading data...')
dataset = load_data(1000)
data_load_state.text('Data loaded.')
#Displaying all data first
if st.checkbox('Show Raw data'):
st.subheader('Raw Data')
st.write(dataset)
# GREETING TAB
st.subheader('Greetings')
greet = st.sidebar.multiselect("Select Greeting", dataset['Greeting'].unique())
select = dataset[(dataset['Greeting'].isin(greet))]
# SEPARATING ONLY TWO COLUMNS FROM THE DATA
greet_select = select[['Greeting','Selected']]
select_check= st.checkbox("Display records with greeting")
if select_check:
st.write(greet_select)
#Text- Preprocessing - Range from 0 to 6758 total feedback
nltk.download('stopwords')
corpus = []
for i in range(0, 6758):
review = re.sub('[^a-zA-Z]', '', str(dataset['Selected'][i]))
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ''.join(review)
corpus.append(review)
#BAG OF WORDS
cv = CountVectorizer(max_features = 6758)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
st.write(X)
st.write(y)
st.write(cv)
#Training sets (800 values)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#X_train[0, 0:10] #First 10 rows of the first column of X_train.
# NLP - Naive Bayes algorithm
classifier = GaussianNB()
classifier.fit(X_train, y_train)
我正在尝试学习简单的 NPL。任何帮助将不胜感激。
我遇到的错误
ValueError:发现样本数量不一致的输入变量:[1, 6759] 追溯: _run_script 中的文件“/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/streamlit/script_runner.py”,第 332 行 exec(代码,模块。dict) 文件“/Users/harshilparikh/Desktop/INT/first.py”,第 90 行,在 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) 文件“/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/model_selection/_split.py”,第2127行,在train_test_split 数组 = 可索引(*数组) 文件“/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/validation.py”,第 292 行,可索引 check_consistent_length(*结果) 文件“/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/validation.py”,第 255 行,在 check_consistent_length raise ValueError("找到数量不一致的输入变量"
【问题讨论】:
我急需帮助! 错误发生在哪里?您可以将错误添加到您的问题中吗? 您的代码是否正确缩进?像这样你只向语料库添加一条评论? 缩进正确,不会引发任何缩进错误 - 我想添加 6758 条评论记录,如果它们有问候消息(1 或 0),然后获取它的模型。希望能解决我的问题@chefhose 再次感谢您的回复 【参考方案1】:你的错误是调用函数train_test_split的时候出现的,x和y需要长度相同,实际情况并非如此。我怀疑问题出在你的 for 循环中。您只需在离开 for 循环后添加最后一条评论,而不是将所有评论添加到您的语料库中。试试这个:
for i in range(0, 6758):
review = re.sub('[^a-zA-Z]', '', str(dataset['Selected'][i]))
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ''.join(review)
corpus.append(review)
【讨论】:
这给我带来了另一个错误 - ValueError: 发现输入变量的样本数量不一致:[6758, 6759] File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3. 8/site-packages/streamlit/script_runner.py”,第 332 行,在 run_script exec(code, module.__dict_) 文件“/Users/harshilparikh/Desktop/INT/first.py”中,行90、在以上是关于NLP - 找到具有不一致数字样本的输入变量的主要内容,如果未能解决你的问题,请参考以下文章
混淆矩阵值错误:找到样本数量不一致的输入变量:[3, 360]
如何解决 Python 中的“ValueError:找到样本数量不一致的输入变量”问题
ValueError:找到样本数量不一致的输入变量:[2,921]