GWAS相关知识

Posted 2023-04-27

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了GWAS相关知识相关的知识，希望对你有一定的参考价值。

参考技术A

是指在理想状态下，各等位基因的频率在遗传中是稳定不变的，即保持着基因平衡。该定律运用在生物学、生态学、遗传学。条件：①种群足够大；②种群个体间随机交配；③没有突变；④没有选择；⑤没有迁移；⑥没有遗传漂变。

相关图片如下：

之前，我对这两个概念有点混淆，后来明白过来了。这两个概念一个是对基因频率进行的筛选，一个是对基因型频率进行的筛选。对于一个位点“AA AT TT”，其中A的频率为基因频率，AA为基因型频率。MAF直接是对基因频率进行筛选，而哈温平衡检验，则是根据基因型推断出理想的（AA，AT，TT）的分布，然后和实际观察的进行适合性检验，然后得到P值，根据P值进行筛选。即P值越小，说明该位点越不符合哈温平衡。

主成分分析（principal component analysis）
中文解释：
将多个变量通过线性变换以选出较少个重要变量的一种多元统计分析方法，又称主分量分析。在实际课题中，为了全面分析问题，往往提出很多与此有关的变量（或因素），因为每个变量都在不同程度上反映这个课题的某些信息。但是，在用统计分析方法研究这个多变量的课题时，变量个数太多就会增加课题的复杂性。人们自然希望变量个数较少而得到的信息较多。在很多情形，变量之间是有一定的相关关系的，当两个变量之间有一定相关关系时，可以解释为这两个变量反映此课题的信息有一定的重叠。主成分分析是对于原先提出的所有变量，建立尽可能少的新变量，使得这些新变量是两两不相关的，而且这些新变量在反映课题的信息方面尽可能保持原有的信息。主成分分析首先是由K.皮尔森对非随机变量引入的，尔后H.霍特林将此方法推广到随机向量的情形。信息的大小通常用离差平方和或方差来衡量。

PCA算法
总结一下PCA的算法步骤：
设有m条n维数据。
1）将原始数据按列组成n行m列矩阵X
2）将X的每一行（代表一个属性字段）进行零均值化，即减去这一行的均值
3）求出协方差矩阵
4）求出协方差矩阵的特征值及对应的特征向量
5）将特征向量按对应特征值大小从上到下按行排列成矩阵，取前k行组成矩阵P
6）即为降维到k维后的数据

根据上面对PCA的数学原理的解释，我们可以了解到一些PCA的能力和限制。PCA本质上是将方差最大的方向作为主要特征，并且在各个正交方向上将数据“离相关”，也就是让它们在不同正交方向上没有相关性。
因此，PCA也存在一些限制，例如它可以很好的解除线性相关，但是对于高阶相关性就没有办法了，对于存在高阶相关性的数据，可以考虑Kernel PCA，通过Kernel函数将非线性相关转为线性相关，关于这点就不展开讨论了。另外，PCA假设数据各主特征是分布在正交方向上，如果在非正交方向上存在几个方差较大的方向，PCA的效果就大打折扣了。
最后需要说明的是，PCA是一种无参数技术，也就是说面对同样的数据，如果不考虑清洗，谁来做结果都一样，没有主观参数的介入，所以PCA便于通用实现，但是本身无法个性化的优化。
希望这篇文章能帮助朋友们了解PCA的数学理论基础和实现原理，借此了解PCA的适用场景和限制，从而更好的使用这个算法。
英文视频讲解网址：
网址1
网址2

它是把GWAS分析之后所有SNP位点的p-value在整个基因组上从左到右依次画出来。并且，为了可以更加直观地表达结果，通常都会将p-value转换为-log10(p-value)。这样的话，基因位点-log10(p-value)在Y轴的高度就对应了与表型性状或者疾病的关联程度，关联度越强（即，p-value越低）就越高。而且，一般而言，由于连锁不平衡（LD）关系的原因，那些在强关联位点周围的SNP也会跟着显示出类似的信号强度，并依次往两边递减。由于这个原因，我们在曼哈顿图上就会看到一个个整齐的信号峰（如下图红色部分）。而这些峰所处的位置一般也是整个研究中真正关心的地方。GWAS研究中，p-value阈值一般要在10 -6次方甚至10 -8次方以下，有些时候也要看你的实际数据表现。

基因组膨胀因子λ定义为经验观察到的检验统计分布与预期中位数的中值之比，从而量化了因大量膨胀而造成结果的假阳性率。换句话说，λ定义为得到的卡方检验统计量的中值除以卡方分布的预期中值。预期的P值膨胀系数为1，当实际膨胀系数越偏离1，说明存在群体分层的现象越严重，容易有假阳性结果，需要重新矫正群体分层。

30X的测序深度，而人类基因组约为30亿个碱基，也就是我拿到了900亿个碱基，碱基以ATCG的字符表示，每一个碱基同样对应着一个质量值，同样也是字母表示(可自行搜索phred质量值)，这就是说我会拿到1800亿的字母。因为我的测序策略是PE150，也就是我会拿到900亿/150=6亿条reads

最小等位基因频率怎么计算？比如一个位点有AA或者AT或者TT，那么就可以计算A的基因频率和T的基因频率，qA + qT = 1，这里谁比较小，谁就是最小等位基因频率，比如qA = 0.3, qT = 0.7，那么这个位点的MAF为0.3. 之所以用这个过滤标准，是因为MAF如果非常小，比如低于0.02，那么意味着大部分位点都是相同的基因型，这些位点贡献的信息非常少，增加假阳性。更有甚者MAF为0，那就是所有位点只有一种基因型，这些位点没有贡献信息，放在计算中增加计算量，没有意义，所以要根据MAF进行过滤

MAF is the Minor Allele Frequency. It can be used to exclude SNPs which are not informative because they show little variation in the sample set being analyzed. For instance, if a SNP shows variation in only 1 of the 89 individuals, it is not useful statistically and should be removed.

In classical genetics, if genes A and B are mutated, and each mutation by itself produces a unique phenotype but the two mutations together show the same phenotype as the gene A mutation, then gene A is epistatic and gene B is hypostatic. For example, the gene for total baldness is epistatic to the gene for brown hair. In this sense, epistasis can be contrasted with genetic dominance, which is an interaction between alleles at the same gene locus. As the study of genetics developed, and with the advent of molecular biology, epistasis started to be studied in relation to quantitative trait loci (QTL) and polygenic inheritance.

An unbiased estimator is an accurate statistic that\'s used to approximate a population parameter. “Accurate” in this sense means that it\'s neither an overestimate nor an underestimate. If an overestimate or underestimate does happen, the mean of the difference is called a “bias.”

Confounding variables (a.k.a. confounders or confounding factors) are a type of extraneous variable that are related to a study’s independent and dependent variables. A variable must meet two conditions to be a confounder:

if you have collected the data, you can include the possible confounders as control variables in your regression models.in this way, you will control for the impact of the confounding variable.
statistical control特点：

Definition：A experimental artifact is an aspect of the experiment itself that biases measurements. Example. An early experiment finds that the heart rate of aquatic birds is higher when they are above water than when they are submerged
Although often used interchangeably, confounds and artifacts refer to two different kinds of threats to the validity of social psychological research.
Within a given social-psychological experiment, researchers are attempting to establish a relationship between a treatment (also known as an independent variable or a predictor) and an outcome (also known as a dependent variable or a criterion). Usually, but not always, they are trying to prove that the treatment causes the outcome and that differential levels of the treatment lead to differential levels.

Confounds are threats to internal validity.[1] Confounds refer to variables that should have been held constant within a specific study but were accidentally allowed to vary (and covary with the independent/predictor variable). A confound exists when the treatment influences the outcome, but not for the theoretical reason proposed by the researchers. Confounds may be related to the "reactivity" of the study (e.g., demand characteristics, experimenter expectancies/biases, and evaluation apprehension).
Suggestions for minimizing confounds include telling participants a believable and coherent cover story (to reduce demand characteristics or to attempt to keep them constant across conditions) and keeping researchers, research assistants, and others who have contact with participants "blind" to the experimental condition to which participants are assigned (to minimize experimenter expectancies/biases).

Artifacts, on the other hand, refer to variables that should have been systematically varied, either within or across studies, but that was accidentally held constant. Artifacts are thus threats to external validity. Artifacts are factors that covary with the treatment and the outcome. Campbell and Stanley[2] identify several artifacts. The major threats to internal validity are history, maturation, testing, instrumentation, statistical regression, selection, experimental mortality, and selection-history interactions.
One way to minimize the influence of artifacts is to use a pretest-posttest control group design. Within this design, "groups of people who are initially equivalent (at the pretest phase) are randomly assigned to receive the experimental treatment or a control condition and then assessed again after this differential experience (posttest phase)".[3] Thus, any effects of artifacts are (ideally) equally distributed in participants in both the treatment and control conditions.
Principal component analysis (PCA) is an effective means of extracting key information from phenotypically complex traits that are highly correlated while retaining the original information (7, 8). PCA can transform a set of correlated variables into a substantially smaller set of uncorrelated variables as principal components (PCs), which can capture most information from the original data (9).
Principal component analysis (PCA) is an effective means of extracting key information from phenotypically complex traits that are highly correlated while retaining the original informa tion (7, 8). PCA can transform a set of correlated variables into a substantially smaller set of uncorrelated variables as principal
components (PCs), which can capture most information from the original data (9). In this study, PCA was performed for rice ar chitecture, and a genome-wide association study (GWAS) using PC scores was utilized to identify genetic factors regulating plant architecture. This approach was validated as effective in identi
fying causal genes associated with plant architecture

Mechanism. Pleiotropy describes the genetic effect of a single gene on multiple phenotypic traits. The underlying mechanism is genes that code for a product that is either used by various cells or has a cascade-like signaling function that affects various targets.

A mixed model is a good choice here: it will allow us to use all the data we have (higher sample size) and account for the correlations between data coming from the sites and mountain ranges. We will also estimate fewer parameters and avoid problems with multiple comparisons that we would encounter while using separate regressions.

is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters)

-用的是最大似然法：maximum likelihood。
fixed-effects, 固定效应; random efffects，随机效应;
Y = Xβ+Zβ+ε
上式由两部分组成，分别被称为固定部分和随机部分，可见和普通线型模型相比，混合线性模型主要是对原先的随机误差进行了更加精细的分解。

前面我们介绍了如何将方差分析通过模型来解读，也就是方差分析模型。例如单因素方差分析的模型解读：假设单个因素为不同职业；因变量为工资收入，那么单因素方差分析模型可以表示为：
yij=u+aj+εij
u表示所有受访者的平均月收入
ai表示第i种职业对平均月收入的影响
εij表示落实到这位受访者对第i种职业平均月收入的随机误差
yij表示某位受访者的收入

由此可见，方差分析的模型解读是更为精准的办法，回顾该部分内容可以点击链接：SPSS分析技术：单因素方差分析结果的模型解读。

前面介绍方差分析时，我们逐步介绍了许多种方差分析类型，单因素方差分析，多因素方差分析、包括随机因素和协变量的方差分析等。如果以上情况都出现在一个分析环境中，应该如何分析呢？今天我们介绍混合效应模型中最基础的一种----混合线性模型，它就是解决这类情况的基础模型之一。
视频网址： https://www.youtube.com/watch?v=zM4VZR0px8E

混合线性模型要比前面介绍的方差分析模型更加复杂，为了通俗解释。我们引入例子进行说明。假设现在有来自100所学校的5000名学生的数据，该分数据包括以下变量：
==学生编号，学校名称，学校类型，座号，性别，入学成绩，中考成绩==
现在假设分析的目的是想以入学成绩为自变量建立针对中考成绩的回归方程，则按照方差分析模型的标准思路：入学成绩（定距数据）为协变量。学校（100所学校）、学校类别（男校、女校和军事化管理学校）、性别（男和女）为因素，这些因素有的是固定因素，有的是随机因素。
如果我们只考虑学校因素（school）和入学成绩（Rscores），建立中考成绩的回归模型。如果将学校看成是固定因素（100所学校），则建立的模型如下：
yij=u+Rscores+schoolj+εij
yij代表某个学生的中考成绩
Rscores代表该生的入学成绩（学生基础）对中考成绩的影响
schoolj代表学校因素对该生中考成绩的影响
εij代表不同学生之间的随机误差

将上式改写成回归模型的形式如下：
yij=a+β1Rscoresij+ 求和βjschoolj+eij
β1代表入学成绩的影响（回归系数）
βj代表第j个学校对中考成绩的效应
eij为第j个学校第i个学生的随机误差

上面的回归方程看起来没什么问题，但若换个角度思考，就会发现它忽略了许多深层次的信息。可以看下面的两幅图：

左边的散点图是只有1所学校数据的散点图，右边的散点图包括了4所学校的数据。从两幅图的趋势线可以发现，由学校因素引起的学生中考成绩（因变量）的差异既包括了截距的差异，也包括了斜率的差异。

如果只考虑一所学校的差异引起的学生中考成绩的不同，那么方差回归模型可以表示为：

yi=α+β1Rscoresi+ei
其中下标i代表第i个学生。在单独考虑这一所学校时，上面的模型是非常完善的，但同时考虑多所学校时问题就出现了。从上图（右）可以发现，各个学校的教学水平是有差异的，也就是说同一所学校学生的成绩之间实际并不独立，好学校的学生成绩会普遍好一些，差学校学生的成绩会普遍差一些。

上图（右）是包含四所学校的数据，可以发现四条回归线的截距不同，这种差异实际上反映了学校间教学水平的差异，即入学成绩相同的学生，在不同学校中学习后，最后的中考成绩的平均估计值可能是不同的。若考虑到截距的变异，则刚才的模型应扩展为：

yij=(a0+u0j)+β1 Rscoresij +eij
yij代表了第j所学校的第i个学生的中考成绩
a0表示各学校总的平均水平
u0j表示不同学校之间引起的中考成绩变异
Rscoresij表示入学成绩，即学生的入学基础
β1表示学生入学基础对中考成绩的影响程度
eij表示不同学生之间的随机误差

从上图（右）可以看出除了截距以外，各回归线的斜率也不相同。即成绩在学校间的聚集性除了表现为成绩的平均水平不同外，还表现在不同学校中成绩的离散度，即对中考层级的影响程度上。斜率高的学校对中考成绩影响程度较高，斜率低的则影响程度较低。根据以上推断，模型需要继续扩展：
uij=(a0+u0j)+(β1+u1j)Rscoresij +eij
u1j表示不同学校对中考成绩的影响系数
对上面的式子进行整理，整理成下面的形式：
yij=(a0+β1Rscoresij)+(u0j+u1jRscoresij+eij
上式由两部分组成，分别被称为固定部分和随机部分，可见和普通线型模型相比，混合线性模型主要是对原先的随机误差进行了更加精细的分解。

GWAS中的Gene Set Analysis,
简称GSA分析，是从基因或者通路水平来进行关联分析，是建立在SNP水平的的GWAS分析结果基础上的，在更高的层次进行深入挖掘，以发现更加有用的信息。 MAGMA 是进行GSA分析的一款工具，其官网如下

Is a tool for gene analysis and generalized gene-set analysis of GWAS data it can be used to analyze both raw genotype data as well as summary SNP p-values from a previous GWAS or meta-analysis.

![GWAS网站软件]
( https://note.youdao.com/src/82618652255B494594E3000ED751969C )
GWAS网站软件网址

GWAS分析有两大坑：
坑1：关联分析的结果是假阳性（有结果，但结果是错的）；
坑2：目标性状多基因控制，每个基因效应太弱，结果中找不到显著相关的位点（干脆没结果）。
应对以上两大坑，我们可以采取的常见方法包括：
扩大样本量，提高检验功效。
优化表型鉴定的体系。
提高表型鉴定的精度；
采用多维度的方法对表型进行评估，例如代谢组。
充分利用先验信息。
使用候选基因或已知内参基因的方法，合理减低阈值。
注意统计模型的控制和优化。
校正群体结构、系统关系、离群样本的影响；
计算其他因素，例如：性别，作息习惯，年龄等因素的影响。
采用多阶段法验证候选基因。
阶段I：使用宽松的阈值获得获选候选位点；
阶段II~n：在独立群体进行验证。
采用gene based/pathway based 关联分析的方法，提高检验功效。
加入更多组学数据联合分析，例如，转录组、表观基因组。

TWAS：《Opportunities and challenges for transcriptomewide association studies》

《Integrative approaches for large-scale transcriptome-wide association studies》

孟德尔随机化
孟德尔随机化（Mendelian Randomization，MR）研究设计，遵循“亲代等位基因随机分配给子代”的孟德尔遗传规律，如果基因型决定表型，基因型通过表型而与疾病发生关联，因此可以使用基因型作为工具变量来推断表型与疾病之间的关联。

SNP is associated with the exposure
SNP is not associated with confounding variables
SNP only associated with outcome through the exposure

r 加载GWAS目录

library(readr)
gwas_cat <- read_tsv("https://www.genome.gov/admin/gwascatalog.txt")

以上是关于GWAS相关知识的主要内容，如果未能解决你的问题，请参考以下文章

【豆科基因组】大豆适应性位点GWAS分析[转载]