Understand the data
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Understand the data相关的知识,希望对你有一定的参考价值。
A new data set (problem) is a wrapped gift. It’s full of promise and anticipation at the miracles you can wreak once you’ve solved it. But it remains a mystery until you’ve opened it. This chapter is about opening up your new data set so you can see what’s inside, get an appreciation for what you’ll be able to do with the data, and start thinking about how you’ll approach model building with it.
Attributes (the variables being used to make predictions) are also known as the
following:
■Predictors
■Features
■Independent variables
■Inputs
Labels are also known as the following:
■Outcomes
■Targets
■Dependent variables
■Responses
Different Types of Attributes and Labels Drive Modeling Choices
The attributes come in two different types: numeric variables and categorical (or factor) variables. Attribute 1 (height) is a numeric variable and is the most usual type of attribute. Attribute 2 is gender and is indicated by the entry Male or Female. This type of attribute is called a categoricalor factor variable. Categorical variables have the property that there’s no order relation between the various values. There’s no sense to Male < Female (despite centuries of squabbling). Categorical variables can be two‐valued, like Male Female, or multivalued, like states (AL, AK, AR . . . WY). Other distinctions can be drawn regarding attributes (integer versus float, for example), but they do not have the same impact on machine learning algorithms. The reason for this is that many machine learning algorithms take numeric attributes only; they cannot handle categorical or factor variables. Penalized regression algorithms deal only with numeric attributes. The same is true for support vector machines, kernel methods, and K‐nearest neighbors.
When the labels are numeric, the problem is called a regression problem. When the labels are categorical, the problem is called a classification problem. If the categorical target takes only two values, the problem is called a binary classification problem. If it takes more than two values, the problem is called a multiclass classification problem.
The classification problem might also be simpler than the regression problem. Consider, for instance, the difference in complexity between a topographic map with a single contour line (say the 100‐foot contour line) and a topographic map with contour lines every 10 feet. The single contour divides the map into the areas that are higher than 100 feet and those that are lower and contains considerably less information than the more detailed contour map. A classifier is trying to compute a single dividing contour without regard for behavior distant from
the decision boundary, whereas regression is trying to draw the whole map.????不懂
Items to Check:
Number of rows and columns
Number of categorical variables and number of unique values for each
Missing values
Summary statistics for attributes and labels
Classification Problems: Detecting Unexploded Mines Using Sonar
待续
以上是关于Understand the data的主要内容,如果未能解决你的问题,请参考以下文章
Understand the Qt containers(有对应表)
operator wrong, but I don't understand what the errror is.
云原生渐进式交付,刷 Argo CD 技术文档之 Understand The Basics & Core Concepts 篇
性能优化Observability Tools C++: Beyond GDB and printf - Tools to Understand the Behavior... - 知识点目录