如何在 R 中绘制/可视化 C50 决策树?

Posted

技术标签:

【中文标题】如何在 R 中绘制/可视化 C50 决策树?【英文标题】:How to plot/visualize a C50 decision tree in R? 【发布时间】:2014-02-11 22:53:37 【问题描述】:

我正在使用 C50 决策树算法。我能够构建树并获得摘要,但无法弄清楚如何绘制或即树。

我的 C50 模型称为 credit_model

在其他决策树包中,我通常使用 plot(credit_model) 之类的东西。在 rpart 中是 rpart.plot(credit_model)。

C50 算法中的等价物是什么?

【问题讨论】:

目前(2015-08-07)C5.0有绘图功能 这在Visualizing C5.0 Decision Tree? 上有更多的主题,这里的所有答案无论如何都已经过时了,现在有一个绘图功能。 【参考方案1】:

目前,没有内置的。我一直在为partykit 包(例如as.party)开发适配器,但还没有走多远。

最大

【讨论】:

它现在就在那里,对于新人来说......它绘制了树,但不是基于规则的模型......顺便说一句:cran.r-project.org/web/packages/C50/C50.pdf【参考方案2】:

您可以使用以下例程,将决策树直接转换为 GraphViz 点语言(然后使用 GraphViz 绘制它 - 需要之前安装的 GraphViz (http://www.graphviz.org/))。

编辑: 版本 2 包含在下文中,它能够处理多分支树(版本 1 可以处理只有两个拆分的树)。 2.2 版更正了缺失的初始化。

R 中的调用示例:

library(C50)
data(churn)
treeModel <- C5.0(x = churnTrain[, -20], y = churnTrain$churn)
C5.0.graphviz(treeModel, 'C:\\mydotfile.txt')

然后从操作系统(操作系统,例如 Windows 命令提示符):

dot -Tpng 'C:\mydotfile.txt' > 'C:\mygraph.png'

然后您可以将 mygraph.png 文件作为 PNG(位图)打开并在您的应用程序中使用。

更多详情,请看原文:http://r-project-thanos.blogspot.de/2014/09/plot-c50-decision-trees-in-r.html

   C5.0.graphviz <- function( C5.0.model,   filename, fontname ='Arial',
       col.draw ='black',col.font ='blue',col.conclusion ='lightpink',
       col.question = 'grey78', shape.conclusion ='box3d',shape.question ='diamond', 
       bool.substitute = 'None', prefix=FALSE, vertical=TRUE ) 

    library(cwhmisc)  
    library(stringr) 
    treeout <- C5.0.model$output
    treeout<- substr(treeout,   cpos(treeout, 'Decision tree:', start=1)+14,nchar(treeout))
    treeout<- substr(treeout,   1,cpos(treeout, 'Evaluation on training data', start=1)-2)
    variables <- data.frame(matrix(nrow=500, ncol=4)) 
    names(variables) <- c('SYMBOL','TOKEN', 'TYPE' , 'QUERY') 
    connectors <- data.frame(matrix(nrow=500, ncol=3)) 
    names(connectors) <- c('TOKEN', 'START','END')
    theStack <- data.frame(matrix(nrow=500, ncol=1)) 
    names(theStack) <- c('ITEM')
    theStackIndex <- 1
    currentvar <- 1
    currentcon <- 1
    open_connection <- TRUE
    previousindent <- -1
    firstindent <- 4
    substitutes <- data.frame(None=c('= 0','= 1'), yesno=c('no','yes'),
        truefalse=c('false', 'true'),TF=c('F','T'))
    dtreestring<-unlist( scan(text= treeout,   sep='\n', what =list('character')))  

    for (linecount in c(1:length(dtreestring))) 
        lineindent<-0
        shortstring <- str_trim(dtreestring[linecount], side='left')
        leadingspaces <- nchar(dtreestring[linecount]) - nchar(shortstring)
        lineindent <- leadingspaces/4
        dtreestring[linecount]<-str_trim(dtreestring[linecount], side='left') 
            while (!is.na(cpos(dtreestring[linecount], ':   ', start=1)) ) 
                        lineindent<-lineindent + 1 
                        dtreestring[linecount]<-substr(dtreestring[linecount],
                            ifelse(is.na(cpos(dtreestring[linecount], ':   ', start=1)), 1,
                            cpos(dtreestring[linecount], ':   ', start=1)+4),
                            nchar(dtreestring[linecount]) )
                        shortstring <- str_trim(dtreestring[linecount], side='left')
                        leadingspaces <- nchar(dtreestring[linecount]) - nchar(shortstring)
                        lineindent <- lineindent + leadingspaces/4
                        dtreestring[linecount]<-str_trim(dtreestring[linecount], side='left')   
            
        if (!is.na(cpos(dtreestring[linecount], ':...', start=1)))
                lineindent<- lineindent +  1 
        dtreestring[linecount]<-substr(dtreestring[linecount],
                ifelse(is.na(cpos(dtreestring[linecount], ':...', start=1)), 1,
                cpos(dtreestring[linecount], ':...', start=1)+4),
                nchar(dtreestring[linecount]) )
        dtreestring[linecount]<-str_trim(dtreestring[linecount])
        stringlist <- strsplit(dtreestring[linecount],'\\:')
        stringpart <- strsplit(unlist(stringlist)[1],'\\s')
        if (open_connection==TRUE)  
                    variables[currentvar,'TOKEN'] <- unlist(stringpart)[1]
                    variables[currentvar,'SYMBOL'] <- paste('node',as.character(currentvar), sep='')
                    variables[currentvar,'TYPE'] <- shape.question
                    variables[currentvar,'QUERY'] <- 1
                theStack[theStackIndex,'ITEM']<-variables[currentvar,'SYMBOL']
                    theStack[theStackIndex,'INDENT'] <-firstindent 
                    theStackIndex<-theStackIndex+1
                    currentvar <- currentvar + 1
                    if(currentvar>2)   
                    connectors[currentcon - 1,'END'] <- variables[currentvar - 1, 'SYMBOL']
                    
            
        connectors[currentcon,'TOKEN'] <- paste(unlist(stringpart)[2],unlist(stringpart)[3])
        if (connectors[currentcon,'TOKEN']=='= 0') 
            connectors[currentcon,'TOKEN'] <- as.character(substitutes[1,bool.substitute])
        if (connectors[currentcon,'TOKEN']=='= 1') 
            connectors[currentcon,'TOKEN'] <- as.character(substitutes[2,bool.substitute])
        if (open_connection==TRUE)  
                        if (lineindent<previousindent) 
                                theStackIndex <- theStackIndex-(( previousindent- lineindent)  +1 )
                                currentsymbol <-theStack[theStackIndex,'ITEM']
                         else  
                            currentsymbol <-variables[currentvar - 1,'SYMBOL']
         else   
                        currentsymbol <-theStack[theStackIndex-((previousindent -lineindent ) +1    ),'ITEM']
                        theStackIndex <- theStackIndex-(( previousindent- lineindent)    )
        
        connectors[currentcon, 'START'] <- currentsymbol
        currentcon <- currentcon + 1
        open_connection <- TRUE 
        if (length(unlist(stringlist))==2) 
              stringpart2 <- strsplit(unlist(stringlist)[2],'\\s')
                variables[currentvar,'TOKEN']   <- paste(ifelse((prefix==FALSE),'','Class'), unlist(stringpart2)[2]) 
                variables[currentvar,'SYMBOL']  <- paste('node',as.character(currentvar), sep='')
                variables[currentvar,'TYPE']        <- shape.conclusion
                variables[currentvar,'QUERY']   <- 0
                currentvar <- currentvar + 1
                connectors[currentcon - 1,'END'] <- variables[currentvar - 1,'SYMBOL']
            open_connection <- FALSE
        
        previousindent<-lineindent
    
    runningstring <- paste('digraph g ', 'graph ', sep='\n')
    runningstring <- paste(runningstring, ' [rankdir="', sep='')
    runningstring <- paste(runningstring, ifelse(vertical==TRUE,'TB','LR'), sep='' )
    runningstring <- paste(runningstring, '"]', sep='')
    for (lines in c(1:(currentvar-1))) 
        runningline <- paste(variables[lines,'SYMBOL'], '[shape="')
        runningline <- paste(runningline,variables[lines,'TYPE'], sep='' )
        runningline <- paste(runningline,'" label ="', sep='' )
        runningline <- paste(runningline,variables[lines,'TOKEN'], sep='' )
        runningline <- paste(runningline,
            '" style=filled fontcolor=', sep='')
        runningline <- paste(runningline, col.font)
        runningline <- paste(runningline,' color=' )
        runningline <- paste(runningline, col.draw)
        runningline <- paste(runningline,' fontname=')
        runningline <- paste(runningline, fontname)
        runningline <- paste(runningline,' fillcolor=')
        runningline <- paste(runningline,
            ifelse(variables[lines,'QUERY']== 0 ,col.conclusion,col.question))
        runningline <- paste(runningline,'];')
        runningstring <- paste(runningstring,   runningline , sep='\n')
    
  for (lines in c(1:(currentcon-1)))         
    runningline <- paste (connectors[lines,'START'], '->')
    runningline <- paste (runningline, connectors[lines,'END'])
    runningline <- paste (runningline,'[label="')
    runningline <- paste (runningline,connectors[lines,'TOKEN'], sep='')
    runningline <- paste (runningline,'" fontname=', sep='')
    runningline <- paste (runningline, fontname)
    runningline <- paste (runningline,'];')
    runningstring <- paste(runningstring,   runningline , sep='\n')
  
    runningstring <- paste(runningstring,'')
    cat(runningstring)
    sink(filename, split=TRUE)
    cat(runningstring)
    sink()

灭霸

【讨论】:

如何让 graphviz 函数在 R 中可用? 你没有让它可用。您可以从操作系统(Windows、Linux 等)调用 Graphviz 的 dot 命令。输入参数是上面介绍的C5.0.graphviz 函数生成的文本文件,dot 命令的输出文件将是图形文件,在您的应用程序中使用(Word 等) 我收到一个错误dot: can't open 'mytree.txt' 知道为什么吗?【参考方案3】:

这是您正在寻找的功能:

C5.0.graphviz <- function( C5.0.model, filename, fontname ='Arial',col.draw ='black',
                           col.font ='blue',col.conclusion ='lightpink',col.question = 'grey78',
                           shape.conclusion ='box3d',shape.question ='diamond', 
                           bool.substitute = 'None', prefix=FALSE, vertical=TRUE ) 

  library(cwhmisc)  
  library(stringr) 
  treeout <- C5.0.model$output
  treeout<- substr(treeout, cpos(treeout, 'Decision tree:', start=1)+14,nchar(treeout))
  treeout<- substr(treeout, 1,cpos(treeout, 'Evaluation on training data', start=1)-2)
  variables <- data.frame(matrix(nrow=500, ncol=4)) 
  names(variables) <- c('SYMBOL','TOKEN', 'TYPE' , 'QUERY') 
  connectors <- data.frame(matrix(nrow=500, ncol=3)) 
  names(connectors) <- c('TOKEN', 'START','END')
  theStack <- data.frame(matrix(nrow=500, ncol=1)) 
  names(theStack) <- c('ITEM')
  theStackIndex <- 1
  currentvar <- 1
  currentcon <- 1
  open_connection <- TRUE
  previousindent <- -1
  firstindent <- 4
  substitutes <- data.frame(None=c('= 0','= 1'), yesno=c('no','yes'),
                            truefalse=c('false', 'true'),TF=c('F','T'))
  dtreestring<-unlist( scan(text= treeout,   sep='\n', what =list('character'))) 

  for (linecount in c(1:length(dtreestring))) 
    lineindent<-0
    shortstring <- str_trim(dtreestring[linecount], side='left')
    leadingspaces <- nchar(dtreestring[linecount]) - nchar(shortstring)
    lineindent <- leadingspaces/4
    dtreestring[linecount]<-str_trim(dtreestring[linecount], side='left') 
    while (!is.na(cpos(dtreestring[linecount], ':   ', start=1)) ) 
      lineindent<-lineindent + 1 
      dtreestring[linecount]<-substr(dtreestring[linecount],
                                     ifelse(is.na(cpos(dtreestring[linecount], ':   ', start=1)), 1,
                                            cpos(dtreestring[linecount], ':   ', start=1)+4),
                                     nchar(dtreestring[linecount]) )
      shortstring <- str_trim(dtreestring[linecount], side='left')
      leadingspaces <- nchar(dtreestring[linecount]) - nchar(shortstring)
      lineindent <- lineindent + leadingspaces/4
      dtreestring[linecount]<-str_trim(dtreestring[linecount], side='left')  
    
    if (!is.na(cpos(dtreestring[linecount], ':...', start=1)))
      lineindent<- lineindent +  1 
    dtreestring[linecount]<-substr(dtreestring[linecount],
                                   ifelse(is.na(cpos(dtreestring[linecount], ':...', start=1)), 1,
                                          cpos(dtreestring[linecount], ':...', start=1)+4),
                                   nchar(dtreestring[linecount]) )
    dtreestring[linecount]<-str_trim(dtreestring[linecount])
    stringlist <- strsplit(dtreestring[linecount],'\\:')
    stringpart <- strsplit(unlist(stringlist)[1],'\\s')
    if (open_connection==TRUE)  
      variables[currentvar,'TOKEN'] <- unlist(stringpart)[1]
      variables[currentvar,'SYMBOL'] <- paste('node',as.character(currentvar), sep='')
      variables[currentvar,'TYPE'] <- shape.question
      variables[currentvar,'QUERY'] <- 1
      theStack[theStackIndex,'ITEM']<-variables[currentvar,'SYMBOL']
      theStack[theStackIndex,'INDENT'] <-firstindent 
      theStackIndex<-theStackIndex+1
      currentvar <- currentvar + 1
      if(currentvar>2)  
        connectors[currentcon - 1,'END'] <- variables[currentvar - 1, 'SYMBOL']
      
    
    connectors[currentcon,'TOKEN'] <- paste(unlist(stringpart)[2],unlist(stringpart)[3])
    if (connectors[currentcon,'TOKEN']=='= 0') 
      connectors[currentcon,'TOKEN'] <- as.character(substitutes[1,bool.substitute])
    if (connectors[currentcon,'TOKEN']=='= 1') 
      connectors[currentcon,'TOKEN'] <- as.character(substitutes[2,bool.substitute])
    if (open_connection==TRUE)  
      if (lineindent<previousindent) 
        theStackIndex <- theStackIndex-(( previousindent- lineindent)  +1 )
        currentsymbol <-theStack[theStackIndex,'ITEM']
       else 
        currentsymbol <-variables[currentvar - 1,'SYMBOL']
     else   
      currentsymbol <-theStack[theStackIndex-((previousindent -lineindent ) +1    ),'ITEM']
      theStackIndex <- theStackIndex-(( previousindent- lineindent)    )
    
    connectors[currentcon, 'START'] <- currentsymbol
    currentcon <- currentcon + 1
    open_connection <- TRUE 
    if (length(unlist(stringlist))==2) 
      stringpart2 <- strsplit(unlist(stringlist)[2],'\\s')
      variables[currentvar,'TOKEN']  <- paste(ifelse((prefix==FALSE),'','Class'), unlist(stringpart2)[2]) 
      variables[currentvar,'SYMBOL']  <- paste('node',as.character(currentvar), sep='')
      variables[currentvar,'TYPE']   <- shape.conclusion
      variables[currentvar,'QUERY']  <- 0
      currentvar <- currentvar + 1
      connectors[currentcon - 1,'END'] <- variables[currentvar - 1,'SYMBOL']
      open_connection <- FALSE
    
    previousindent<-lineindent
  
  runningstring <- paste('digraph g ', 'graph ', sep='\n')
  runningstring <- paste(runningstring, ' [rankdir="', sep='')
  runningstring <- paste(runningstring, ifelse(vertical==TRUE,'TB','LR'), sep='' )
  runningstring <- paste(runningstring, '"]', sep='')
  for (lines in c(1:(currentvar-1))) 
    runningline <- paste(variables[lines,'SYMBOL'], '[shape="')
    runningline <- paste(runningline,variables[lines,'TYPE'], sep='' )
    runningline <- paste(runningline,'" label ="', sep='' )
    runningline <- paste(runningline,variables[lines,'TOKEN'], sep='' )
    runningline <- paste(runningline,
                         '" style=filled fontcolor=', sep='')
    runningline <- paste(runningline, col.font)
    runningline <- paste(runningline,' color=' )
    runningline <- paste(runningline, col.draw)
    runningline <- paste(runningline,' fontname=')
    runningline <- paste(runningline, fontname)
    runningline <- paste(runningline,' fillcolor=')
    runningline <- paste(runningline,
                         ifelse(variables[lines,'QUERY']== 0 ,col.conclusion,col.question))
    runningline <- paste(runningline,'];')
    runningstring <- paste(runningstring, runningline , sep='\n')
  
  for (lines in c(1:(currentcon-1)))   
    runningline <- paste (connectors[lines,'START'], '->')
    runningline <- paste (runningline, connectors[lines,'END'])
    runningline <- paste (runningline,'[label="')
    runningline <- paste (runningline,connectors[lines,'TOKEN'], sep='')
    runningline <- paste (runningline,'" fontname=', sep='')
    runningline <- paste (runningline, fontname)
    runningline <- paste (runningline,'];')
    runningstring <- paste(runningstring, runningline , sep='\n')
  
  runningstring <- paste(runningstring,'')
  cat(runningstring)
  sink(filename, split=TRUE)
  cat(runningstring)
  sink()

【讨论】:

以上是关于如何在 R 中绘制/可视化 C50 决策树?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 pyspark 中可视化决策树模型/对象?

C5.0 决策树 - 名为 exit 的 c50 代码,值为 1

如何在 R 中绘制探索性决策树

在 R 中绘制决策树(插入符号)

如何在 r studio 中缩小决策树图?

绘制决策树分类器时出现交互错误,获取值数组.. 使树很难可视化