如何在 Scikit Learn 决策树中根据分类变量拆分节点?
Posted
技术标签:
【中文标题】如何在 Scikit Learn 决策树中根据分类变量拆分节点?【英文标题】:How can you split a node based on a categorical variable in Scikit Learn Decision Tree? 【发布时间】:2018-06-21 03:31:46 【问题描述】:我正在尝试为以下数据集制作决策树:https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice
此数据集包含一些分类变量(例如丈夫的职业:1、2、3、4)。当我创建决策树时,分类值会根据“小于或大于”值进行拆分。换句话说,我的树中有一个节点将数据拆分如下:“职业丈夫
import pandas as pd
import numpy as np
import os
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import seaborn as sns
import graphviz
import pydotplus
import io
from scipy import misc
os.chdir("path") #path containing datacontra.csv file
data = pd.read_csv("datacontra.csv", dtype='Age': np.float64, 'EduW':np.object, 'EduH':np.object, 'Child': np.int64, 'ReliW': np.object, 'WorkW':np.object, 'OccuH': np.object, 'SOLI': np.object, 'MediaExp': np.object, 'T':np.object)
data.describe()
data.head()
data.tail()
data.info()
train, test = train_test_split(data,test_size = 0.05)
print("Training size" + str(len(train)))
print("Test size " + str(len(test)))
train.shape
features = list(data.columns[:9])
label = list(data.columns[9])
print(list(data.columns[:9]))
print(list(data.columns[9]))
X_train = train[features]
print(X_train.shape)
y_train = train[label]
print(y_train.shape)
X_test= test[features]
y_test = test[label]
c = DecisionTreeClassifier()
dt = c.fit(X_train,y_train)
path = ("/Users/sabinekuypers/Documents/Charlotte 461/")
def show_tree(tree, features, path):
f = io.StringIO()
export_graphviz(tree, out_file=f, feature_names = features)
pydotplus.graph_from_dot_data(f.getvalue()).write_png(path)
img = misc.imread(path)
plt.rcParams["figure.figsize"]=(20,20)
plt.imshow(img)
show_tree(dt, features,'dt_tree.png')
y_pred = c.predict(X_test)
y_pred
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100
print("Accuracy: ",round(score,1),"%")
提前谢谢你
【问题讨论】:
【参考方案1】:虽然决策树能够处理分类值,但在 sklearn 中,您必须对它们进行二进制编码。例如,您的特征 Husband's Occupation
[1,2,3,4]
应该成为三个特征,每个特征都针对给定的职业值进行二进制编码。您可以使用 pd.get_dummies
在 pandas 中执行此操作,如下所示:
occ_dummies = pd.get_dummies(df["OccuH"], drop_first=True)
data = pd.concat([data.drop("OccuH", axis=1), occ_dummies], axis=1)
从那里您可以像以前一样继续使用您的数据。
我将对drop_first
kwarg 提出一点意见。使用它的原因是为了避免创建线性依赖关系,如One-hot vs dummy encoding in Scikit-learn 中所述。
【讨论】:
以上是关于如何在 Scikit Learn 决策树中根据分类变量拆分节点?的主要内容,如果未能解决你的问题,请参考以下文章
为决策树中的每个数据点找到对应的叶节点(scikit-learn)