数据集似乎有暗淡 3,预期的估计是 <= 2

Posted

技术标签:

【中文标题】数据集似乎有暗淡 3,预期的估计是 <= 2【英文标题】:Data set appears to have dim 3, and estimator expected is <= 2 【发布时间】:2019-10-26 08:41:11 【问题描述】:

我正在测试以下代码示例。

# Load the required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Plot styling
import seaborn as sns; sns.set()  # for plot styling
# %matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
#Read the csv file
dataset=pd.read_csv('C:\\my_path\\CLV.csv')
#Explore the dataset
dataset.head()#top 5 columns
len(dataset) # of rows
#descriptive statistics of the dataset
dataset.describe().transpose()


#Visualizing the data - displot
plot_income = sns.distplot(dataset["INCOME"])
plot_spend = sns.distplot(dataset["SPEND"])
plt.xlabel('Income / spend')


#Violin plot of Income and Spend
f, axes = plt.subplots(1,2, figsize=(12,6), sharex=True, sharey=True)
v1 = sns.violinplot(data=dataset, x='INCOME', color="skyblue",ax=axes[0])
v2 = sns.violinplot(data=dataset, x='SPEND',color="lightgreen", ax=axes[1])
v1.set(xlim=(0,420))



#Using the elbow method to find the optimum number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km.fit(X)
    wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()


##Fitting kmeans to the dataset with k=4
km4=KMeans(n_clusters=4,init='k-means++', max_iter=300, n_init=10, random_state=0)
y_means = km4.fit_predict(X)
#Visualizing the clusters for k=4
plt.scatter(X[y_means==0,0],X[y_means==0,1],s=50, c='purple',label='Cluster1')
plt.scatter(X[y_means==1,0],X[y_means==1,1],s=50, c='blue',label='Cluster2')
plt.scatter(X[y_means==2,0],X[y_means==2,1],s=50, c='green',label='Cluster3')
plt.scatter(X[y_means==3,0],X[y_means==3,1],s=50, c='cyan',label='Cluster4')
plt.scatter(km4.cluster_centers_[:,0], km4.cluster_centers_[:,1],s=200,marker='s', c='red', alpha=0.7, label='Centroids')
plt.title('Customer segments')
plt.xlabel('Annual income of customer')
plt.ylabel('Annual spend from customer on site')
plt.legend()
plt.show()


########
The plot shows the distribution of the 4 clusters. We could interpret them as the following customer segments:
1.  Cluster 1: Customers with medium annual income and low annual spend
2.  Cluster 2: Customers with high annual income and medium to high annual spend
3.  Cluster 3: Customers with low annual income
4.  Cluster 4: Customers with medium annual income but high annual spend
Cluster 4 straight away is one potential customer segment. However, Cluster 2 and 3 can be segmented further to arrive at a more specific target customer group. Let us now look at how the clusters are created when k=6:
########


##Fitting kmeans to the dataset - k=6
km4=KMeans(n_clusters=6,init='k-means++', max_iter=300, n_init=10, random_state=0)
y_means = km4.fit_predict(X)
#Visualizing the clusters
plt.scatter(X[y_means==0,0],X[y_means==0,1],s=50, c='purple',label='Cluster1')
plt.scatter(X[y_means==1,0],X[y_means==1,1],s=50, c='blue',label='Cluster2')
plt.scatter(X[y_means==2,0],X[y_means==2,1],s=50, c='green',label='Cluster3')
plt.scatter(X[y_means==3,0],X[y_means==3,1],s=50, c='cyan',label='Cluster4')
plt.scatter(X[y_means==4,0],X[y_means==4,1],s=50, c='magenta',label='Cluster5')
plt.scatter(X[y_means==5,0],X[y_means==5,1],s=50, c='orange',label='Cluster6')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1],s=200,marker='s', c='red', alpha=0.7, label='Centroids')
plt.title('Customer segments')
plt.xlabel('Annual income of customer')
plt.ylabel('Annual spend from customer on site')
plt.legend()
plt.show()

当我到达这条线时:

from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km.fit(X)
    wcss.append(km.inertia_)

我收到以下错误:

ValueError: Found array with dim 3. Estimator expected <= 2.

数据可在此处下载:

https://github.com/sowmyacr/kmeans_cluster/blob/master/CLV.csv

数据集应该有 2 个维度,但 Python 似乎认为它有 3 个维度,出于某种原因。有人可以解释这里发生了什么吗?另外,我该如何解决这个问题?谢谢。

【问题讨论】:

scikit-learn expects 2d num arrays for the training dataset for a fit function. The dataset you are passing in is a 3d array you need to reshape the array into a 2d. 这就是我可以指出的所有内容,因为缺少 km.fit(X) 中使用的 X 的定义。您的代码还显示km.fit(X,y),这一定是错字 是的,这是一个错误;我刚刚修好了。怎么可能是3D阵列?我有行和列;这是二维的。我仍然不明白这里有什么问题。即使我运行这个:dataset.shape。我明白了:(303, 2) 当错误没有出现时,为什么要包括所有这些绘图命令?专注于提问和编码。显示回溯,以便我们可以清楚地看到错误发生的位置。如果我们认为问题出现在fit(X) 调用中,那么我们需要查看X。我没有看到它是在哪里创建的。 上述代码在python 3中运行良好,没有上述错误 【参考方案1】:

我得到了它的工作:

x1 = np.array(dataset["INCOME"])
x2 = np.array(dataset["SPEND"])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)

所以,脚本是这样的。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
#Plot styling
import seaborn as sns; sns.set()  # for plot styling
# %matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
#Read the csv file
dataset=pd.read_csv('C:\\path_here\\customer_segmentation.csv')
#Explore the dataset
dataset.head()#top 5 columns
len(dataset) # of rows
#descriptive statistics of the dataset
dataset.describe().transpose()


#Visualizing the data - displot
plot_income = sns.distplot(dataset["INCOME"])
plot_spend = sns.distplot(dataset["SPEND"])
plt.xlabel('Income / spend')


#Violin plot of Income and Spend
f, axes = plt.subplots(1,2, figsize=(12,6), sharex=True, sharey=True)
x1 = sns.violinplot(data=dataset, x='INCOME', color="skyblue",ax=axes[0])
x2 = sns.violinplot(data=dataset, x='SPEND',color="lightgreen", ax=axes[1])
x1.set(xlim=(0,420))


# https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f
# https://pythonprogramminglanguage.com/kmeans-elbow-method/
x1 = np.array(dataset["INCOME"])
x2 = np.array(dataset["SPEND"])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)



#Using the elbow method to find the optimum number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km.fit(X)
    wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()

【讨论】:

以上是关于数据集似乎有暗淡 3,预期的估计是 <= 2的主要内容,如果未能解决你的问题,请参考以下文章

找到暗淡为 3 的数组。估计器预期 <= 2

ValueError:找到暗淡3的数组。估计器预期<= 2。使用numpy数组时

ValueError:找到暗淡 3 的数组。估计器预期 <= 2。>>>

运行以下代码时出现错误(找到暗淡 3 的数组。预计估计器 <= 2)

ValueError:找到暗淡为 3 的数组。预计估计器 <= 2。(Keras,Sklearn)

sklearn KNeighborsClassifier“ValueError:找到暗淡为 4 的数组。预计估计器 <= 2。”