分层贝叶斯学习

Posted JoAnna_L

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了分层贝叶斯学习相关的知识,希望对你有一定的参考价值。

频率推理(Frequentist inference is a type of statistical inference that draws conclusions from sample data by emphasizing the frequency or proportion of the data. An alternative name is frequentist statistics

This is the inference framework in which the well-established methodologies of statistical hypothesis testing and confidence intervals are based.

Other than frequentistic inference, the main alternative approach to statistical inference is Bayesian inference, while another is fiducial inference.

two major differences in the frequentist and Bayesian approaches to inference that are not included in the above consideration of the interpretation of probability:

····In a frequentist approach to inference, unknown parameters are often, but not always, treated as having fixed but unknown values that are not capable of being treated as random variates in any sense, and hence there is no way that probabilities can be associated with them. making operational decisions and estimating parameters with or without confidence intervals. Frequentist inference is based solely on the (one set of) evidence

In contrast, a Bayesian approach to inference does allow probabilities to be associated with unknown parameters, where these probabilities can sometimes have a frequency probability interpretation as well as a Bayesian one. The Bayesian approach allows these probabilities to have an interpretation as representing the scientist‘s belief that given values of the parameter are true。

Bayesian inference is explicitly based on the evidence and prior opinion, which allows it to be based on multiple sets of evidence.

···While "probabilities" are involved in both approaches to inference, the probabilities are associated with different types of things. The result of a Bayesian approach can be a probability distribution for what is known about the parameters given the results of the experiment or study. The result of a frequentist approach is either a "true or false" conclusion from a significance test or a conclusion in the form that a given sample-derived confidence interval covers the true value: either of these conclusions has a given probability of being correct, where this probability has either a frequency probability interpretation or a pre-experiment interpretation.

 

Example: A frequentist does not say that there is a 95% probability that the true value of a parameter lies within a confidence interval, saying instead that 95% of confidence intervals contain the true value.

 

Efron‘s comparative adjectives
 BayesFrequentist
  • Basis
  • Resulting Characteristic
  • _
  • Ideal Application
  • Target Audience
  • Modeling Characteristic
  • Belief (prior)
  • Principled Philosophy
  • One distribution
  • Dynamic (repeated sampling)
  • Individual (subjective)
  • Aggressive
  • Behavior (method)
  • Opportunistic Methods
  • Many distributions (bootstrap?)
  • Static (one sample)
  • Community (objective)
  • Defensive

概率与似然

A probability refers to variable data for a fixed hypothesis while a likelihood refers to variable hypotheses for a fixed set of data.

Each fixed set of observational conditions is associated with a probability distribution and each set of observations can be interpreted as a sample from that distribution – the frequentist view of probability.

Alternatively a set of observations may result from sampling any of a number of distributions (each resulting from a set of observational conditions). The probabilistic relationship between a fixed sample and a variable distribution (resulting from a variable hypothesis) is termed likelihood – a Bayesian view of probability。

The principle says that all of the information in a sample is contained in the likelihood function, which is accepted as a valid probability distribution by Bayesians (but not by frequentists).

 

many statisticians accept the cautionary words of statistician George Box, "All models are wrong, but some are useful."

Bayes’ theorem

The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options.

Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability 技术分享, the survival probability will be updated with the occurrence of y, the event in which a hypothetical controversial serum is created which, as believed by some, increases survival in cardiac patients.

In order to make updated probability statements about  技术分享, given the occurrence of event y, we must begin with a model providing a joint probability distribution for  技术分享 and y. This can be written as a product of the two distributions that are often referred to as the prior distribution  技术分享 and the sampling distribution 技术分享 respectively:

技术分享

Using the basic property of conditional probability, the posterior distribution will yield:

技术分享

This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes‘ theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to incorporate the updated belief,  技术分享, in appropriate and solvable ways.

Exchangeability

 

 

Finite exchangeability

If 技术分享 are independent and identically distributed, then they are exchangeable, but not conversely true。比如:一个盒子里有篮球和红球。那么先拿到红球和先拿到篮球的概率都是1/2。

But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first draw is 0, and is not equal to the probability that the red ball is selected in the second draw which is equal to 1/2 技术分享).

Thus, 技术分享 and  技术分享 are not independent。

Infinite exchangeability

Hierarchical models

Components

Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely:

1. Hyperparameter: parameter of the prior distribution

2. Hyperprior: distribution of a Hyperparameter

Say a random variable Y follows a normal distribution with parameters θ as the mean and 1 as the variance, that is 技术分享. The parameter  技术分享 has a prior distribution given by a normal distribution with mean  技术分享 and variance 1, i.e.  技术分享. Furthermore,  技术分享 follows another distribution given, for example, by the standard normal distribution 技术分享. The parameter 技术分享 is called the hyperparameter, while its distribution given by  技术分享 is an example of a hyperprior distribution.

The notation of the distribution of Y changes as another parameter is added, i.e.技术分享. If there is another stage, say,  技术分享 follows another normal distribution with mean 技术分享 and variance  技术分享, meaning  技术分享   技术分享 and  技术分享 can also be called hyperparameters while their distributions are hyperprior distributions as well.

Framework

Let 技术分享 be an observation and  技术分享 a parameter governing the data generating process for  技术分享.

Assume further that the parameters  技术分享 are generated exchangeably from a common population, with distribution governed by a hyperparameter 技术分享.
The Bayesian hierarchical model contains the following stages:

技术分享

技术分享
技术分享

The likelihood, as seen in stage I is  技术分享, with  技术分享 as its prior distribution. Note that the likelihood depends on 技术分享 only through  技术分享.

The prior distribution from stage I can be broken down into:

技术分享 [from the definition of conditional probability]

With技术分享 as its hyperparameter with hyperprior distribution, 技术分享.

Thus, the posterior distribution is proportional to:

技术分享 [using Bayes’ Theorem]
技术分享

Example

To further illustrate this, consider the example: A teacher wants to estimate how well a male student did in his SAT. He uses information on the student’s high school grades and his current grade point average (GPA) to come up with an estimate. His current GPA, denoted by 技术分享, has a likelihood given by some probability function with parameter  技术分享, i.e.  技术分享. This parameter  技术分享 is the SAT score of the student. The SAT score is viewed as a sample coming from a common population distribution indexed by another parameter 技术分享, which is the high school grade of the student.That is,  技术分享. Moreover, the hyperparameter  技术分享 follows its own distribution given by 技术分享, a hyperprior. To solve for the SAT score given information on the GPA,

技术分享
技术分享

All information in the problem will be used to solve for the posterior distribution. Instead of solving only using the prior distribution and the likelihood function, the use of hyperpriors gives more information to make more accurate beliefs in the behavior of a parameter.

2-stage hierarchical model

In general, the joint posterior distribution of interest in 2-stage hierarchical models is:

技术分享
技术分享

3-stage hierarchical model

For 3-stage hierarchical models, the posterior distribution is given by:

技术分享
技术分享


以上是关于分层贝叶斯学习的主要内容,如果未能解决你的问题,请参考以下文章

朴素贝叶斯与贝叶斯网络

机器学习朴素贝叶斯应用实例

机器学习朴素贝叶斯应用实例

统计学习方法-朴素贝叶斯

统计学习方法-朴素贝叶斯

机器学习——朴素贝叶斯算法