数据分析系列精彩浓缩
Posted jcjc
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了数据分析系列精彩浓缩相关的知识,希望对你有一定的参考价值。
数据分析(三)
在分析UCI数据之前,有必要先了解一些决策树的概念(decision tree)
-
此处推荐一个关于决策树的博客地址:
http://www.cnblogs.com/yonghao/p/5061873.html
-
决策树(decision tree (DT))的基本特征
-
DT 是一个监督学习方法(supervised learning method)
-
DT is a supervised learning method, thus we need labeled data
-
It is one process only thus it is not good for giant datasets
-
PS: It is pretty good on small and clean datasets
-
-
UCI数据特征: UCI credit approval data set
-
690 data entries, relatively small dataset
-
15 attributes, pretty tiny to be honest
-
missing value is only 5%
-
2 class data
-
-
By looking at these two, we know DT should work well for our dataset
综上,就可以尝试用代码实现决策树的功能了,此时使用段老师提供的skeleton(框架),按照以下步骤写自己的代码
-
Copy and paste your code to function
readfile(file_name)
under the comment# Your code here
. -
Make sure your input and output matches how I descirbed in the docstring
-
Make a minor improvement to handle missing data, in this case let‘s use string
"missing"
to represent missing data. Note that it is given as"?"
. -
Implement
is_missing(value)
,class_counts(rows)
,is_numeric(value)
as directed in the docstring -
Implement class
Determine
. This object represents a node of our DT. 这个对象表示的是决策树的节点。-
It has 2 inputs and a function. 有两个输入,一个方法
-
We can think of it as the Question we are asking at each node. 可以理解成决策树中每个节点我们所提出的“问题”
-
-
Implement the method
partition(rows, question)
as described in the docstring-
Use Determine class to partition data into 2 groups
-
-
Implement the method
gini(rows)
as described in the docstring -
Implement the method
info_gain(left, right, current_uncertainty)
as described in the docstring -
my code is as follows , for reference only(以下是我的代码,仅供参考)
def readfile(file_name):
"""
This function reads data file and returns structured and cleaned data in a list
:param file_name: relative path under data folder
:return: data, in this case it should be a 2-D list of the form
[[data1_1, data1_2, ...],
[data2_1, data2_2, ...],
[data3_1, data3_2, ...],
...]
i.e.
[[‘a‘, 58.67, 4.46, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 3.04, ‘t‘, ‘t‘, 6.0, ‘f‘, ‘g‘, ‘00043‘, 560.0, ‘+‘],
[‘a‘, 24.5, 0.5, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 1.5, ‘t‘, ‘f‘, 0.0, ‘f‘, ‘g‘, ‘00280‘, 824.0, ‘+‘],
[‘b‘, 27.83, 1.54, ‘u‘, ‘g‘, ‘w‘, ‘v‘, 3.75, ‘t‘, ‘t‘, 5.0, ‘t‘, ‘g‘, ‘00100‘, 3.0, ‘+‘],
...]
Couple things you should note:
1. You need to handle missing data. In this case let‘s use "missing" to represent all missing data
2. Be careful of data types. For instance,
"58.67" and "0.2356" should be number and not a string
"00043" should be string but not a number
It is OK to treat all numbers as float in this case. (You don‘t need to worry about differentiating integer and float)
"""
# Your code here
data_ = open(file_name, ‘r‘)
# print(data_)
lines = data_.readlines()
output = []
# never use built-in names unless you mean to replace it
for list_str in lines:
str_list = list_str[:-1].split(",")
# keep it
# str_list.remove(str_list[len(str_list)-1])
data = []
for substr in str_list:
if substr.isdigit():
if len(substr) > 1 and substr.startswith(‘0‘):
data.append(substr)
else:
substr = int(substr)
data.append(substr)
else:
try:
current = float(substr)
data.append(current)
except ValueError as e:
if substr == ‘?‘: