iloc 函数在 iris 数据集中做了啥?
Posted
技术标签:
【中文标题】iloc 函数在 iris 数据集中做了啥?【英文标题】:what does the function iloc do in the iris dataset?iloc 函数在 iris 数据集中做了什么? 【发布时间】:2021-08-14 07:49:09 【问题描述】:谁能解释一下这段代码的粗体部分。我已经阅读了 pandas 和 sklearn 的文档,但仍然有点难以理解它。我想为自己的数据修改此内容,并希望对此有更多了解。
X = df.iloc[0:100, **[0,1]**].values
plt.scatter(**X[:50, 0], X[:50, 1]**,alpha=0.5, c='b', edgecolors='none', label='setosa %2s'%(y[0]))
plt.scatter(**X[50:100, 0], X[50:100, 1]**,alpha=0.5, c='r', edgecolors='none', label='versicolor %2s'%(y[50]))
完整代码如下
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#from sklearn import cross_validation
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from mlclass2 import simplemetrics, plot_decision_2d_lda
df = pd.read_csv('https://archive.ics.uci.edu/ml/'
'machine-learning-databases/iris/iris.data', header=None)
X = df.iloc[0:100, **[0,1]**].values
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', 0, 1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=5)
stdscaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = stdscaler.transform(X)
X_train_scaled = stdscaler.transform(X_train)
X_test_scaled = stdscaler.transform(X_test)
# plot data
plt.scatter(X[:50, 0], X[:50, 1],alpha=0.5, c='b', edgecolors='none', label='setosa %2s'%(y[0]))
plt.scatter(X[50:100, 0], X[50:100, 1],alpha=0.5, c='r', edgecolors='none', label='versicolor %2s'%(y[50]))
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='lower right')
plt.show()
【问题讨论】:
【参考方案1】:.values 只返回删除了轴标签的数据框的值。
.iloc 使用基于整数位置的索引。
代码的.iloc 部分是说,我们的自变量只需要第 0 列和第 1 列的前 100 行,而因变量只需要第 4 行的前 100 行。如果这部分仍然令人困惑,我建议您查看切片符号。简单地说,.iloc 上的切片符号简化为 .iloc[start:stop]。
原始数据框:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
df = pd.read_csv('https://archive.ics.uci.edu/ml/'
'machine-learning-databases/iris/iris.data', header=None)
X = df.iloc[0:100, [0,1]].values
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', 0, 1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=5)
stdscaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = stdscaler.transform(X)
X_train_scaled = stdscaler.transform(X_train)
X_test_scaled = stdscaler.transform(X_test)
print(df)
输出:
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
[150 rows x 5 columns]
iloc[0:100, [0,1]].values - 看看我们在这里如何只返回第 0 列和第 1 列?从索引值 0 开始,到 100 结束,[start:stop]。我们只选择第 0 列和第 1 列,因为 [0,1] 要清楚。
[[5.1 3.5]
[4.9 3. ]
[4.7 3.2]
[4.6 3.1]
[5. 3.6]
[5.4 3.9]
[4.6 3.4]
[5. 3.4]
[4.4 2.9]
[4.9 3.1]
[5.4 3.7]
[4.8 3.4]
[4.8 3. ]
[4.3 3. ]
[5.8 4. ]
[5.7 4.4]
[5.4 3.9]
[5.1 3.5]
df.iloc[0:100, 4].values - 与上面相同,但只选择第 4 列。
['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa']
【讨论】:
以上是关于iloc 函数在 iris 数据集中做了啥?的主要内容,如果未能解决你的问题,请参考以下文章