构建决策树分类

Posted

技术标签:

【中文标题】构建决策树分类【英文标题】:Build Decision Tree Classification 【发布时间】:2021-04-22 05:36:44 【问题描述】:

我有两个数据集,partb_data1partb_data2。给定反映客户特征的银行客户样本以及银行是否继续与他们合作(流失)。 退出:流失(如果他已经离开银行,则为 1,如果他继续与银行合作,则为 0)我使用partb_data1 作为训练集,partb_data2 作为测试集

这是我的数据:

> dput(head(partb_data1))
structure(list(RowNumber = 1:6, CustomerId = c(15634602L, 15647311L, 
15619304L, 15701354L, 15737888L, 15574012L), Surname = c("Hargrave", 
"Hill", "Onio", "Boni", "Mitchell", "Chu"), CreditScore = c(619L, 
608L, 502L, 699L, 850L, 645L), Geography = c("France", "Spain", 
"France", "France", "Spain", "Spain"), Gender = c("Female", "Female", 
"Female", "Female", "Female", "Male"), Age = c(42L, 41L, 42L, 
39L, 43L, 44L), Tenure = c(2L, 1L, 8L, 1L, 2L, 8L), Balance = c(0, 
83807.86, 159660.8, 0, 125510.82, 113755.78), NumOfProducts = c(1L, 
1L, 3L, 2L, 1L, 2L), HasCrCard = c(1L, 0L, 1L, 0L, 1L, 1L), IsActiveMember = c(1L, 
1L, 0L, 0L, 1L, 0L), EstimatedSalary = c(101348.88, 112542.58, 
113931.57, 93826.63, 79084.1, 149756.71), Exited = c(1L, 0L, 
1L, 0L, 0L, 1L)), row.names = c(NA, 6L), class = "data.frame")



> dput(head(partb_data2))
structure(list(RowNumber = 8001:8006, CustomerId = c(15629002L, 
15798053L, 15753895L, 15595426L, 15645815L, 15632848L), Surname = c("Hamilton", 
"Nnachetam", "Blue", "Madukwe", "Mills", "Ferrari"), CreditScore = c(747L, 
707L, 590L, 603L, 615L, 634L), Geography = c("Germany", "Spain", 
"Spain", "Spain", "France", "France"), Gender = c("Male", "Male", 
"Male", "Male", "Male", "Female"), Age = c(36L, 32L, 37L, 57L, 
45L, 36L), Tenure = c(8L, 9L, 1L, 6L, 5L, 1L), Balance = c(102603.3, 
0, 0, 105000.85, 0, 69518.95), NumOfProducts = c(2L, 2L, 2L, 
2L, 2L, 1L), HasCrCard = c(1L, 1L, 0L, 1L, 1L, 1L), IsActiveMember = c(1L, 
0L, 0L, 1L, 1L, 0L), EstimatedSalary = c(180693.61, 126475.79, 
133535.99, 87412.24, 164886.64, 116238.39), Exited = c(0L, 0L, 
0L, 1L, 0L, 0L)), row.names = c(NA, 6L), class = "data.frame")

我创建了分类树来预测 churn 。下面是代码:

library(tidyverse)
library(caret)
library(rpart)
library(rpart.plot)

# Split the data into training and test set
train.data <- head(partb_data1, 500)
test.data <- tail(partb_data2, 150)

# Build the model
modelb <- rpart(Exited ~., data = train.data, method = "class")
# Visualize the decision tree with rpart.plot
rpart.plot(modelb)

# Make predictions on the test data
predicted.classes <- modelb %>% 
  predict(test.data, type = "class")
head(predicted.classes)

# Compute model accuracy rate on test data
mean(predicted.classes == test.data$Exited)

### Pruning the tree :

# Fit the model on the training set
modelb2 <- train(
  Exited ~., data = train.data, method = "rpart",
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
)
# Plot model accuracy vs different values of
# cp (complexity parameter)
plot(modelb2)

# Print the best tuning parameter cp that
# maximizes the model accuracy
modelb2$bestTune

# Plot the final tree model
plot(modelb2$finalModel)

# Make predictions on the test data
predicted.classes <- modelb2 %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$Exited)

注意:我已经从partb_data2 制作了测试集。

我遵循的程序正确吗?我必须进行任何更改才能完成我的目标,即分类树?非常欢迎您的帮助!

已编辑!!!

【问题讨论】:

【参考方案1】:

您的 head(partb_data1$Exited, 500) 不是 data.frame。由于$,您获取了partb_data1 数据的子集。它只是一个整数向量,所以不能工作。

类(头(partb_data1$Exited,500)) [1] “整数”

【讨论】:

真的,错误的部分,我将把那部分改成这个新代码:# 将数据拆分为训练和测试集 train.data 我将改变问题的理念,以确保程序的正确性!谢谢@james_rodriguez【参考方案2】:

总是有很多程序选项。

但是您将数据分成训练和测试数据集是对的。它也可以使用交叉验证来代替。您正在对您的训练集使用交叉验证,这通常不是必需的,但也是可能的。

我认为使用完整的数据作为简历也应该可以,但你所做的并没有错。

【讨论】:

以上是关于构建决策树分类的主要内容,如果未能解决你的问题,请参考以下文章

构建决策树分类

Scikit决策树分类特征

02. 基本分类:基于决策树的分类

sklearn实现决策树算法

Python 决策树分类器

scikit,分类列,决策树