数据分析系列精彩浓缩

Posted jcjc

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了数据分析系列精彩浓缩相关的知识,希望对你有一定的参考价值。

数据分析系列精彩浓缩(二)

那么我们有了UCI提供的datasets,我们怎么Perfect operation呢?

  • First,we download a data file to the localhost , such as crx.data file

  • we will use pure python operation crx.data file

  • step are as follows

    • input : crx.data file

    • output : A 2-D list

    • it should look like

      >>> output
      [[data_0], [data_1], [data_2], ...]
    • individual data example

      >>> data_[0]
      [‘b‘, 30.83, 0, ‘u‘, ‘g‘, ‘w‘, ‘v‘, 1.25, ‘t‘, ‘t‘, ‘01‘, ‘f‘, ‘g‘, ‘00202‘, 0, ‘+‘]
    • Mind the data types,Do‘t make all of them string.注意数据类型

  • my code is as follows,for reference only

     file_name = "E:datacrx.data"
    data_ = open(file_name, ‘r‘)
       # print(data_)
       lines = data_.readlines()
       output = []
       # never use built-in names unless you mean to replace it
       for list_str in lines:
           str_list = list_str[:-1].split(",")
           # keep it
           # str_list.remove(str_list[len(str_list)-1])
           data = []
           for substr in str_list:
               if substr.isdigit():
                   if len(substr) > 1 and substr.startswith(‘0‘):
                       data.append(substr)
                   else:
                       substr = int(substr)
                       data.append(substr)
               else:
                   try:
                       current = float(substr)
                       data.append(current)
                   except ValueError as e:
                       if substr == ‘?‘:
                           substr = ‘missing‘
                       data.append(substr)
           output.append(data)
       return output
  • 通过上面的操作,我们就可以感觉到已经做和数据相关的事情了,the importance of data types

ok back to the point , before you do anything

It is important for you to at least have a rough idea of what kind of data you are dealing with. For instance, if you have read through all the files in the data folder and the description on the website, you should at least know that:

  • This dataset consists of 690 credit card applicants‘ personal information and whether or not they are approved for the credit card.

  • Each data entry has 15 attributes, and data types of each attribute are on the website

    • we see that A2, A3, A8, A11, A14, A15 are continuous (number)

    • All others are categorical (choices)

  • 37 cases (5%) have one or more missing values

  • This dataset has 2 classes, positive and negative, meaning approved and declined

If you haven‘t already read through all these information, go back and try to capture and understand your dataset first

Here is the link:

https://archive.ics.uci.edu/ml/datasets/Credit+Approval
  • 通过对数据文件和网站上的描述(By describing data folders and website )

  • 我们已经了解了这些数据实际是干什么用的

  • 也知道了python解析出来的每条数据对应的属性和分类

既然知道了这些数据的attribute and classify,那就期待进一步Perfect operation吧。。。

  • Decmber 28.2018

  •  






























以上是关于数据分析系列精彩浓缩的主要内容,如果未能解决你的问题,请参考以下文章

勺子和浓缩咖啡测试

全栈编程系列SpringBoot整合Shiro(含KickoutSessionControlFilter并发在线人数控制以及不生效问题配置启动异常No SecurityManager...)(代码片段

AI智能剪辑,仅需2秒一键提取精彩片段

解密体育背后AI黑科技:花样滑冰动作识别多模视频分类和精彩片段剪辑

精彩系列论文之一国家电网有限公司 冷喜武,陈国平等:智能电网监控运行大数据分析系统数据规范和数据处理

支持高并发高性能 通用缓存容器 浓缩的精华 代码优化版