基于R语言的用户分析

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了基于R语言的用户分析相关的知识,希望对你有一定的参考价值。

1. 基本分析理论

C5.0是决策树模型中的算法,79年由J R Quinlan发展,并提出了ID3算法,主要针对离散型属性数据,其后又不断的改进,形成C4.5,它在ID3基础上增加了队连续属性的离散化。C5.0是C4.5应用于大数据集上的分类算法,主要在执行效率和内存使用方面进行了改进。
C4.5算法是ID3算法的修订版,采用GainRatio来加以改进方法,选取有最大GainRatio的分割变量作为准则,避免ID3算法过度配适的问题。
C5.0算法则是C4.5算法的修订版,适用于处理大数据集,采用Boosting方式提高模型准确率,又称为BoostingTrees,在软件上计算速度比较快,占用的内存资源较少。
决策树模型,也称规则推理模型。通过对训练样本的学习,建立分类规则;依据分类规则,实现对新样本的分类;属于有指导(监督)式的学习方法,有两类变量:目标变量(输出变量),属性变量(输入变量)。
决策树模型与一般统计分类模型的主要区别:决策树的分类是基于逻辑的,一般统计分类模型是基于非逻辑的。
常见的算法有CHAID、CART、Quest和C5.0。对于每一个决策要求分成的组之间的“差异”最大。各种决策树算法之间的主要区别就是对这个“差异”衡量方式的区别。
决策树很擅长处理非数值型数据,这与神经网络智能处理数值型数据比较而言,就免去了很多数据预处理工作。
C5.0是经典的决策树模型算法之一,可生成多分支的决策树,目标变量为分类变量,使用C5.0算法可以生成决策树或者规则集。C5.0模型根据能偶带来的最大信息增益的字段拆分样本。第一次拆分确定的样本子集随后再次拆分,通常是根据另一个字段进行拆分,这一过程重复进行指导样本子集不能在被拆分为止。最后,重新缉拿眼最低层次的拆分,哪些对模型值没有显著贡献的样本子集被提出或者修剪。
C5.0优点:
C5.0模型在面对数据遗漏和输入字段很多的问题时非常稳健;
C5.0模型比一些其他类型的模型易于理解,模型退出的规则有非常直观的解释;
C5.0也提供强大技术以提高分类的精度。
C5.0算法
C5.0算法选择分支变量的依据:以信息熵的下降速度作为确定最佳分支变量和分割阀值的依据。信息熵的下降意味着信息的不确定性下降。

2. 涉及先关包
library(C50)
library(dplyr)

3. 实例

library(C50)
data(churn)

churn_data <- churnTrain
outcome_name <- ‘churn‘

# make the outcome variable easier to read
churn_data[,outcome_name] <- as.factor(ifelse(churn_data[,outcome_name]==‘yes‘,‘Does_Churn‘, ‘Stays‘))

interesting_interactions <- function(the_data_frame, outcome_name) {
  # install.packages(...) if missing
  require(C50)
  require(dplyr)
  
  c5model <- C5.0(
    x = the_data_frame[,setdiff(names(the_data_frame), outcome_name)],
    y = the_data_frame[,outcome_name],
    rules = TRUE
  )
  
  rule_munger <- capture.output(c5model$rules , split = TRUE)
  rule_munger <- strsplit(rule_munger,‘\\\\n‘)
  rule_munger <- gsub(x = rule_munger[[1]], pattern = ‘\\\\|\"‘, replacement = ‘‘)[-1]
  
  # extract results into data frame format
  rule_count <- 0
  conds_last <- 0
  cover_last <- 0
  ok_last <- 0
  lift_last <-  0
  class_last <- 0
  
  rules <- c()
  for (entry in rule_munger) {
    print(entry)
    if (substr(entry,1,5) == ‘rules‘)
      print(entry)
    
    # track only lines starting with conds or type - ignore rest
    if (substr(entry,1,5) == ‘conds‘ |
        substr(entry,1,4) == ‘type‘) {
      if (substr(entry,1,5) == ‘conds‘) {
        rule_count <- rule_count + 1
        conds_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][1], split = ‘=‘)[[1]][2]
        # cover is the number of training cases covered by the rule
        cover_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][2], split = ‘=‘)[[1]][2]
        # ok is the number of positives covered by class,
        ok_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][3], split = ‘=‘)[[1]][2]
        # lift is the estimated accuracy of the rule
        lift_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][4], split = ‘=‘)[[1]][2]
        # class predicted by
        class_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][5], split = ‘=‘)[[1]][2]
      }
      
      if (substr(entry,1,4) == ‘type‘) {
        # variable type
        type_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][1], split = ‘=‘)[[1]][2]
        att_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][2], split = ‘=‘)[[1]][2]
        
        # sniff out optional parameters
        elts_last <- ‘‘
        if (grepl(x = entry, pattern = ‘elts‘)) {
          elts_last <- strsplit(x = entry, split = "elts=")[[1]][2]
        }
        
        cut_last <- ‘‘
        if (grepl(x = entry, pattern = ‘cut‘)) {
          cut_last <-
            strsplit(
              x = strsplit(
                x = entry, split = "cut="
              )[[1]][2], split = ‘ ‘
            )[[1]][1]
        }
        
        val_last <- ‘‘
        if (grepl(x = entry, pattern = ‘val‘)) {
          val_last <- strsplit(x = entry, split = "val=")[[1]][2]
        }
        
        result_last <- ‘‘
        if (grepl(x = entry, pattern = ‘result‘)) {
          result_last <- strsplit(x = entry, split = "result=")[[1]][2]
        }
        
        rules <- rbind(
          rules, c(
            rule_count,
            conds_last,
            cover_last,
            ok_last,
            lift_last,
            type_last,
            att_last,
            elts_last,
            result_last,
            cut_last,
            val_last,
            class_last
          )
        )
      }
      
    }
    
  }
  if (!is.null(rules)) {
    
    rules <- data.frame(rules)
    
    names(rules) <-
      c(
        ‘rule_number‘, ‘conditions‘, ‘cover‘, ‘true_pos‘,
        ‘lift‘, ‘type‘, ‘attribute‘, ‘elts‘, ‘cut‘, ‘result‘,
        ‘value‘, ‘outcome‘
      )
    rules[, 1:6] <- sapply(rules[, 1:6], as.character)
    rules[, 1:6] <- sapply(rules[, 1:6], as.numeric)
    
    if (length(unique(rules$rule_number) > 0)) {
      rules %>% dplyr::arrange(desc(lift)) -> rules
    }
    
  }
  return (rules)
}

 

results <- interesting_interactions(the_data_frame = churn_data, outcome_name = outcome_name)

print_rules <- function(rules_found, rulenum) {
  print(‘‘)
  print(paste0(‘Rule #‘, rulenum))
  dplyr::filter(rules_found, rule_number == rulenum) -> pulled_rule
  dplyr::select(pulled_rule, cover, true_pos, outcome) %>% head(1) -> rule_def
  dplyr::select(pulled_rule, attribute, elts, cut, result, value) -> conditions
  
  print(paste0(‘In ‘, rule_def$cover, ‘ cases, ‘, round(rule_def$true_pos/rule_def$cover,2)*100, ‘% customers ‘,
               as.character(rule_def$outcome),‘ when:‘))
  
  for (cond_id in seq(nrow(conditions))) {
    cond <- conditions[cond_id,]
    #attribute elts cut result value
    if (nchar(as.character(cond$elts)) > 0) {
      print(paste0(cond$attribute, 
                   ‘: ‘, cond$elts))
    } else if (nchar(as.character(cond$value)) > 0) {
      print(paste0(cond$attribute, 
                   ‘ == ‘, cond$value))
    } else {
      print(paste0(cond$attribute, " ", cond$cut, " ", cond$result))
    }
  }
  print(‘‘)
}

for (rule_number in unique(results$rule_number))
  print_rules(results, rule_number)

## [1] ""
## [1] "Rule #1"
## [1] "In 60 cases, 100% customers Does_Churn when:"
## [1] "international_plan == yes"
## [1] "total_intl_calls < 2"
## [1] ""
## [1] ""
## [1] "Rule #2"
## [1] "In 57 cases, 100% customers Does_Churn when:"
## [1] "international_plan == yes"
## [1] "total_intl_minutes > 13.1"
## [1] ""
## [1] ""
## [1] "Rule #3"
## [1] "In 32 cases, 100% customers Does_Churn when:"
## [1] "total_day_minutes < 120.5"
## [1] "number_customer_service_calls > 3"
## [1] ""
## [1] ""
## [1] "Rule #4"
## [1] "In 79 cases, 96% customers Does_Churn when:"
## [1] "total_day_minutes < 160.2"
## [1] "total_eve_charge < 19.83"
## [1] "number_customer_service_calls > 3"
## [1] ""
## [1] ""
## [1] "Rule #5"
## [1] "In 43 cases, 95% customers Does_Churn when:"
## [1] "international_plan == no"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes > 246.60001"
## [1] "total_eve_charge > 20.5"
## [1] ""
## [1] ""
## [1] "Rule #6"
## [1] "In 28 cases, 93% customers Does_Churn when:"
## [1] "total_day_minutes < 264.39999"
## [1] "total_eve_calls < 125"
## [1] "total_eve_charge < 12.05"
## [1] "number_customer_service_calls > 3"
## [1] ""
## [1] ""
## [1] "Rule #7"
## [1] "In 78 cases, 90% customers Does_Churn when:"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes > 223.2"
## [1] "total_eve_charge > 20.5"
## [1] "total_night_minutes > 174.2"
## [1] ""
## [1] ""
## [1] "Rule #8"
## [1] "In 114 cases, 79% customers Does_Churn when:"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes > 223.2"
## [1] "total_eve_charge > 20.5"
## [1] ""
## [1] ""
## [1] "Rule #9"
## [1] "In 152 cases, 62% customers Does_Churn when:"
## [1] "total_day_minutes > 223.2"
## [1] "total_eve_charge > 20.5"
## [1] ""
## [1] ""
## [1] "Rule #10"
## [1] "In 211 cases, 60% customers Does_Churn when:"
## [1] "total_day_minutes > 264.39999"
## [1] ""
## [1] ""
## [1] "Rule #12"
## [1] "In 768 cases, 97% customers Stays when:"
## [1] "international_plan == no"
## [1] "voice_mail_plan == yes"
## [1] "number_customer_service_calls < 3"
## [1] ""
## [1] ""
## [1] "Rule #11"
## [1] "In 2221 cases, 97% customers Stays when:"
## [1] "international_plan == no"
## [1] "total_day_minutes < 223.2"
## [1] "number_customer_service_calls < 3"
## [1] ""
## [1] ""
## [1] "Rule #13"
## [1] "In 140 cases, 96% customers Stays when:"
## [1] "account_length < 123"
## [1] "total_eve_minutes < 187.7"
## [1] "total_night_minutes < 151.89999"
## [1] ""
## [1] ""
## [1] "Rule #14"
## [1] "In 45 cases, 98% customers Stays when:"
## [1] "international_plan == no"
## [1] "voice_mail_plan == yes"
## [1] "total_day_minutes > 264.39999"
## [1] ""
## [1] ""
## [1] "Rule #15"
## [1] "In 1972 cases, 96% customers Stays when:"
## [1] "total_day_minutes < 264.39999"
## [1] "total_intl_minutes < 13.1"
## [1] "total_intl_calls > 2"
## [1] "number_customer_service_calls < 3"
## [1] ""
## [1] ""
## [1] "Rule #16"
## [1] "In 197 cases, 95% customers Stays when:"
## [1] "total_day_minutes > 120.5"
## [1] "total_day_minutes < 160.2"
## [1] "total_eve_charge > 19.83"
## [1] ""
## [1] ""
## [1] "Rule #17"
## [1] "In 155 cases, 94% customers Stays when:"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes < 277"
## [1] "total_night_minutes < 126.9"
## [1] ""
## [1] ""
## [1] "Rule #18"
## [1] "In 1675 cases, 89% customers Stays when:"
## [1] "total_day_minutes > 160.2"
## [1] "total_day_minutes < 264.39999"
## [1] "total_eve_charge > 12.05"
## [1] ""
## [1] ""
## [1] "Rule #19"
## [1] "In 434 cases, 89% customers Stays when:"
## [1] "total_eve_charge < 12.26"
## [1] ""

以上是关于基于R语言的用户分析的主要内容,如果未能解决你的问题,请参考以下文章

R语言应用实战-基于R语言的判别分析:fisher判别法,距离判别法以及Bayers判别法(附源代码)

R语言︱情感分析—词典型代码实践(最基础)

中文分词实践(基于R语言)

基于R语言实现Lasso回归分析

中文分词实践(基于R语言)

R语言实战应用-基于R语言的对应分析