数据分析系列精彩浓缩

Posted jcjc

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了数据分析系列精彩浓缩相关的知识,希望对你有一定的参考价值。

数据分析(三)

在分析UCI数据之前,有必要先了解一些决策树的概念(decision tree)

  • 此处推荐一个关于决策树的博客地址:
    http://www.cnblogs.com/yonghao/p/5061873.html
  • 决策树(decision tree (DT))的基本特征

    • DT 是一个监督学习方法(supervised learning method)

    • DT is a supervised learning method, thus we need labeled data

    • It is one process only thus it is not good for giant datasets

    • PS: It is pretty good on small and clean datasets

  • UCI数据特征: UCI credit approval data set

    • 690 data entries, relatively small dataset

    • 15 attributes, pretty tiny to be honest

    • missing value is only 5%

    • 2 class data

  • By looking at these two, we know DT should work well for our dataset

综上,就可以尝试用代码实现决策树的功能了,此时使用段老师提供的skeleton(框架),按照以下步骤写自己的代码

  • Copy and paste your code to function readfile(file_name) under the comment # Your code here.

  • Make sure your input and output matches how I descirbed in the docstring

  • Make a minor improvement to handle missing data, in this case let‘s use string "missing" to represent missing data. Note that it is given as "?".

  • Implement is_missing(value), class_counts(rows), is_numeric(value) as directed in the docstring
  • Implement class Determine. This object represents a node of our DT. 这个对象表示的是决策树的节点。
    • It has 2 inputs and a function. 有两个输入,一个方法

    • We can think of it as the Question we are asking at each node. 可以理解成决策树中每个节点我们所提出的“问题”

  • Implement the method partition(rows, question)as described in the docstring
    • Use Determine class to partition data into 2 groups

  • Implement the method gini(rows) as described in the docstring
    • Here is the formula for Gini impurity: 技术分享图片

      • where n is the number of classes

      • 技术分享图片 is the percentage of the given class i

  • Implement the method info_gain(left, right, current_uncertainty) as described in the docstring
    • Here is the formula for Information Gain: 技术分享图片

      • where 技术分享图片

      • 技术分享图片 is current_uncertainty

      • 技术分享图片 is the percentage/probability of left branch, same story for 技术分享图片

  • my code is as follows , for reference only(以下是我的代码,仅供参考)

    def readfile(file_name):
       """
      This function reads data file and returns structured and cleaned data in a list
      :param file_name: relative path under data folder
      :return: data, in this case it should be a 2-D list of the form
      [[data1_1, data1_2, ...],
        [data2_1, data2_2, ...],
        [data3_1, data3_2, ...],
        ...]
       
      i.e.
      [[‘a‘, 58.67, 4.46, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 3.04, ‘t‘, ‘t‘, 6.0, ‘f‘, ‘g‘, ‘00043‘, 560.0, ‘+‘],
        [‘a‘, 24.5, 0.5, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 1.5, ‘t‘, ‘f‘, 0.0, ‘f‘, ‘g‘, ‘00280‘, 824.0, ‘+‘],
        [‘b‘, 27.83, 1.54, ‘u‘, ‘g‘, ‘w‘, ‘v‘, 3.75, ‘t‘, ‘t‘, 5.0, ‘t‘, ‘g‘, ‘00100‘, 3.0, ‘+‘],
      ...]
       
      Couple things you should note:
      1. You need to handle missing data. In this case let‘s use "missing" to represent all missing data
      2. Be careful of data types. For instance,
          "58.67" and "0.2356" should be number and not a string
          "00043" should be string but not a number
          It is OK to treat all numbers as float in this case. (You don‘t need to worry about differentiating integer and float)
      """
       # Your code here
       data_ = open(file_name, ‘r‘)
       # print(data_)
       lines = data_.readlines()
       output = []
       # never use built-in names unless you mean to replace it
       for list_str in lines:
           str_list = list_str[:-1].split(",")
           # keep it
           # str_list.remove(str_list[len(str_list)-1])
           data = []
           for substr in str_list:
               if substr.isdigit():
                   if len(substr) > 1 and substr.startswith(‘0‘):
                       data.append(substr)
                   else:
                       substr = int(substr)
                       data.append(substr)
               else:
                   try:
                       current = float(substr)
                       data.append(current)
                   except ValueError as e:
                       if substr == ‘?‘:
                           substr = ‘missing‘
                       data.append(substr)
           output.append(data)
       return output
    ?
    ?
    ?
    ?
    def is_missing(value):
       """
      Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
      :param value: value to be checked
      :return: boolean (True, False) of whether the input value is the same as our "missing" notation
      """
       return value == ‘missing‘
    ?
    ?
    def class_counts(rows):
       """
      Count how many data samples there are for each label
      数每个标签的样本数
      :param rows: Input is a 2D list in the form of what you have returned in readfile()
      :return: Output is a dictionary/map in the form:
      {"label_1": #count,
        "label_2": #count,
        "label_3": #count,
        ...
      }
      """
       # 这个方法是一个死方法 只使用于当前给定标签(‘+’,‘-’)的数据统计   为了达到能使更多不确定标签的数据的统计 扩展出下面方法
       # label_dict = {}
       # count1 = 0
       # count2 = 0
       # # rows 是readfile返回的结果
       # for row in rows:
       #     if row[-1] == ‘+‘:
       #         count1 += 1
       #     elif row[-1] == ‘-‘:
       #         count2 += 1
       # label_dict[‘+‘] = count1
       # label_dict[‘-‘] = count2
       # return label_dict
    ?
       # 扩展方法一
       # 这个方法可以完成任何不同标签的数据的统计 使用了两个循环 第一个循环是统计出所有数据中存在的不同类型的标签 得到一个标签列表lable_list
       # 然后遍历lable_list中的标签 重要的是在其中嵌套了遍历所有数据的循环 同时在当前循环中统计出所有数据的标签中和lable_list中标签相同的总数
       # label_dict = {}
       # lable_list = []
       # for row in rows:
       #     lable = row[-1]
       #     if lable_list == []:
       #         lable_list.append(lable)
       #     else:
       #         if lable in lable_list:
       #             continue
       #         else:
       #             lable_list.append(lable)
       #
       # for lable_i in lable_list:
       #     count_row_i = 0
       #     for row_i in rows:
       #         if lable_i == row_i[-1]:
       #             count_row_i += 1
       #     label_dict[lable_i] = count_row_i
       # print(label_dict)
       # return label_dict
       #
    ?
    # 扩展方法二
       # 此方法是巧妙的使用了dict.key()函数将所有的状态进行保存以及对出现的次数进行累计
       label_dict = {}
       for row in rows:
           keys = label_dict.keys()
           if row[-1] in keys:
               label_dict[row[-1]] += 1
           elif row[-1] not in keys:
               label_dict[row[-1]] = 1
       return label_dict
    ?
    ?
    def is_numeric(value):
       print(type(value),‘-----‘)
       print(value)
       """
      Test if the input is a number(float/int)  
      :param value: Input is a value to be tested    
      :return: Boolean (True/False)    
      """
       # Your code here
       # 此处用到eavl()函数:将字符串string对象转换为有效的表达式参与求值运算返回计算结果
       # if type(eval(str(value))) == int or type(eval(str(value))) == float:
       #     return True
       # 不用eval()也可以 而且有博客说eval()存在一定安全隐患
    ?
       # if value is letter(字母) 和将以0开头的字符串检出来
       if str(value).isalpha() or str(value).startswith(‘0‘):
           return False
       return type(int(value)) == int or type(float(value)) == float
    ?
    ?
    class Determine:
       """
      这个class用来对比。取列序号和值
      match方法比较数值或者字符串
      可以理解为决策树每个节点所提出的“问题”,如:
          今天温度是冷还是热?
          今天天气是晴,多云,还是有雨?
      """
       def __init__(self, column, value):
           """
          initial structure of our object
          :param column: column index of our "question"
          :param value: splitting value of our "question"
          """
           self.column = column
           self.value = value
    ?
       def match(self, example):
           """
          Compares example data and self.value
          note that you need to determine whether the data asked is numeric or categorical/string
          Be careful for missing data
          :param example: a full row of data
          :return: boolean(True/False) of whether input data is GREATER THAN self.value (numeric) or the SAME AS self.value (string)
          """
           # Your code here . missing is string too so don‘t judge(判断)
           e_index = self.column
           value_node = self.value
           # 此处and之后的条件是在e_index = 10是补充的,因为此列的数据类型不统一,包括0开头的字符串,还有int型数字,这就尴尬了,int 和 str 无法做compare
           if is_numeric(example[e_index]) and type(value_node) is int or type(value_node) is float:
               return example[e_index] > value_node
           else:
               return example[e_index] == value_node
    ?
    ?
       def __repr__(self):
           """
          打印树的时候用
          :return:
          """
           if is_numeric(self.value):
               condition = ">="
           else:
               condition = "是"
           return "{} {} {}?".format(
               header[self.column], condition, str(self.value))
    ?
    ?
    def partition(rows, question):
       """
      将数据分割,如果满足上面Question条件则被分入true_row,否则被分入false_row
      :param rows: data set/subset
      :param question: Determine object you implemented above
      :return: 2 lists based on the answer of the question
      """
       # Your code here . question is Determine‘s object
       true_rows, false_rows = [], []
       # 此处将二维数组进行遍历的目的是Determine对象中match方法只处理每个一维列表中指定索引的数据
       for row in rows:
           if question.match(row):
               true_rows.append(row)
           else:
               false_rows.append(row)
       return true_rows, false_rows
    ?
    ?
    def gini(rows):
       """
      计算一串数据的Gini值,即离散度的一种表达方式
      :param rows: data set/subset
      :return: gini值,”不纯度“ impurity
      """
       data_set_size = len(rows)    # 所有数据的总长度
       class_dict = class_counts(rows)
       sum_subgini = 0
       for class_dict_value in class_dict.values():
           sub_gini = (class_dict_value/data_set_size) ** 2
           sum_subgini += sub_gini
       gini = 1 - sum_subgini
       return gini
    ?
    ?
    ?
    def info_gain(left, right, current_uncertainty):
       """
      计算信息增益
      Please refer to the .md tutorial for details
      :param left: left branch
      :param right: right branch
      :param current_uncertainty: current uncertainty (data)
      """
       p_left = len(left) / (len(left) + len(right))
       p_right = 1 - p_left
       return current_uncertainty - p_left * gini(left) - p_right * gini(right)
    ?
    ?
    ?
    ?
    # 使用这组数据测试自己代码的质量
    data = readfile("E:datacrx.data")
    t, f = partition(data, Determine(2,‘1.8‘))
    print(info_gain(t, f, gini(data)))
    ?
    ?

 

January 2, 2019


















































































































































































以上是关于数据分析系列精彩浓缩的主要内容,如果未能解决你的问题,请参考以下文章

勺子和浓缩咖啡测试

全栈编程系列SpringBoot整合Shiro(含KickoutSessionControlFilter并发在线人数控制以及不生效问题配置启动异常No SecurityManager...)(代码片段

AI智能剪辑,仅需2秒一键提取精彩片段

解密体育背后AI黑科技:花样滑冰动作识别多模视频分类和精彩片段剪辑

精彩系列论文之一国家电网有限公司 冷喜武,陈国平等:智能电网监控运行大数据分析系统数据规范和数据处理

支持高并发高性能 通用缓存容器 浓缩的精华 代码优化版