9.2 GWAS：关联分析——TASSEL（GLM/MLM/CMLM)

Posted 2023-05-10

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了9.2 GWAS：关联分析——TASSEL（GLM/MLM/CMLM)相关的知识，希望对你有一定的参考价值。

参考技术A TASSEL是最早出现的用于动植物关联分析的软件，还可以对进化模式以及连锁不平衡进行评估，功能非常强大，要说缺点，可能就是真的有点慢。

表型数据处理在下面这篇帖子中有介绍，这里使用BLUE值进行关联分析。
3.2 GWAS：最佳线性无偏估计量——BLUE值计算（多年单点有重复） - (jianshu.com)

Tassel的安装在亲缘关系计算中有提到：
8.GWAS：亲缘关系——TASSEL&GCTA - (jianshu.com)

关联分析所用到的vcf文件是在上一步亲缘关系中，进行排序后的文件

将群体结构分析中生成的.Q文件，增加一列对应的sample名，一行亚群名。
5. GWAS：群体结构——Admixture - (jianshu.com)

亲缘关系得到的kinship文件进行整理，第一行为sample数，第一列为sample名，中间为矩阵，下图以GCTA结果为例。
8. GWAS：亲缘关系——TASSEL&GCTA - (jianshu.com)

-fork1 vcf文件 Troot.vcf
-fork2 表型数据文件 trait.txt
-fork3 群体结构Q文件 Q.txt

混合线性模型中要加入系谱矩阵，即亲缘关系K矩阵。

结果文件：
主要关注第六列p值，以及第七列marker_Rsq即R2贡献率。

R语言之Logistic回归分析

一、probit回归模型
在R中，可以使用glm函数(广义线性模型)实现，只需将选项binomial选项设为probit即可，并使用summary函数得到glm结果的细节，但是和lm不同，summary对于广义线性模型并不能给出决定系数，需要使用pscl包中的pR2函数得到伪决定系数，然后再使用summary得到细节
> library(RSADBE)
> data(sat)
> pass_probit <- glm(Pass~Sat,data=sat,binomial(probit))
> summary(pass_probit)
> library(pscl)
> pR2(pass_probit)
> predict(pass_probit,newdata=list(Sat=400),type = "response")
> predict(pass_probit,newdata=list(Sat=700),type = "response")

二、logistic回归模型

可使用glm函数和其选项family=binomial拟合logistic回归模型。

> library(RSADBE)
> data(sat)
> pass_logistic <- glm(Pass~Sat,data=sat,family = ‘binomial‘)
> summary.glm(pass_logistic)
> pR2(pass_logistic)
> with(pass_logistic, pchisq(null.deviance - deviance, df.null
+ - df.residual, lower.tail = FALSE))
> confint(pass_logistic)
> predict.glm(pass_logistic,newdata=list(Sat=400),type = "response")
> predict.glm(pass_logistic,newdata=list(Sat=700),type = "response")
> sat_x <- seq(400,700, 10)
> pred_l <- predict(pass_logistic,newdata=list(Sat=sat_x),type="response")
> plot(sat_x,pred_l,type="l",ylab="Probability",xlab="Sat_M")
上述代码解释：
通过glm函数拟合logistic模型，并通过summary.glm得到模型结果细节，其中Null deviance和Residual deviance类似于线性回归模型中的残差平方和，看用来评估拟合优度，Null deviance是没有任何信息的模型的残差，如果自变量对因变量有影响，则Residual deviance应该明显小于Null deviance。
使用pR2函数得到伪决定系数，通过with函数得到模型整体的显著性水平，我们通过summary.glm函数得到了null.deviance、deviance,、df.null、df.residual，使用with函数提取并通过pchisq函数得到偏差为null.deviance - deviance，自由度为df.null - df.residual的P值。

使用confint函数得到回归系数的置信区间，并通过predict.glm预测自变量为400和700时的模型的值。

使用plot函数做模型图。

三、使用Hosmer-Lemeshow拟合优度检验
构造该统计量的步骤为
1.使用分类和拟合函数对拟合值排序
2.将排序后的拟合值分为g组，g的值一般选择6-10
3.找到每组观察和预期的数
4.对这些组进行卡方拟合优度检验。

实现代码为
> pass_hat <- fitted(pass_logistic)
> hosmerlem <- function(y, yhat, g=10)    {
+   cutyhat <- cut(yhat,breaks = quantile(yhat, probs=seq(0,1, 1/g)), include.lowest=TRUE)
+      obs = xtabs(cbind(1 - y, y) ~ cutyhat)
+      expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
+      chisq = sum((obs - expect)^2/expect)
+      P = 1 - pchisq(chisq, g - 2)
+   return(list(chisq=chisq,p.value=P))
+                    }
> hosmerlem(pass_logistic$y, pass_hat)
首先用fitted函数提取拟合值，然后自定义函数进行计算

四、广义线性模型的残差图
广义线性模型的残差和一般线性模型的残差有所不同，但是作用类似
1.响应残差
真实值和拟合值之间的差
2.异常残差
对第i个观测值，异常残差是模型中异常观测值总和的平方根。
3.皮尔逊残差
4.局部残差
5.沃金残差
可以使用residuals函数得到上述残差

> library(RSADBE)
> data(sat)
> pass_logistic <- glm(Pass~Sat,data=sat,family = ‘binomial‘)
> par(mfrow=c(1,3), oma=c(0,0,3,0))
> plot(fitted(pass_logistic), residuals(pass_logistic,"response"), col="red", > xlab="Fitted Values", ylab="Response Residuals")
> points(fitted(pass_probit), residuals(pass_probit,"response"), col="green")
> abline(h=0)
> plot(fitted(pass_logistic), residuals(pass_logistic,"deviance"), col="red", > xlab="Fitted Values", ylab="Deviance Residuals")
> points(fitted(pass_probit), residuals(pass_probit,"deviance"), col="green")
> abline(h=0)
> plot(fitted(pass_logistic), residuals(pass_logistic,"pearson"), col="red",   xlab="Fitted Values", ylab="Pearson Residuals")
> points(fitted(pass_probit), residuals(pass_probit,"pearson"), col="green")
> abline(h=0)
> title(main="Response, Deviance, and Pearson Residuals Comparison for the Logistic and > Probit Models",outer=TRUE)

上述代码分别计算了响应残差、异常残差和皮尔逊残差并作图

五、广义线性模型的影响点和杠杆点
和一般线性模型基本一样，广义线性模型也是使用hatvalues、cooks.distance、dfbetas、dffits计算影响点和杠杆点，但是判断标准有所改变
1.hatvalues值大于2(p+1)/2，可认为观测值有杠杆效应
2.Cooks距离比F分布的10%分位数大，可认为该观测值对参数估计有影响，如果超出50%分位数，则认为是强影响点
3.dfbetas、dffits的经验法则是，如果绝对值大于1，则认为观测值对协变量存在影响

> hatvalues(pass_logistic)
> cooks.distance(pass_logistic)
> dfbetas(pass_logistic)
> dffits(pass_logistic)
> cbind(hatvalues(pass_logistic),cooks.distance(pass_logistic),
dfbetas(pass_logistic),dffits(pass_logistic))
> hatvalues(pass_logistic)>2*(length(pass_logistic$coefficients)-1)
/length(pass_logistic$y)
> cooks.distance(pass_logistic)>qf(0.1,length(pass_logistic$coefficients),
length(pass_logistic$y)-length(pass_logistic$coefficients))
> cooks.distance(pass_logistic)>qf(0.5,length(pass_logistic$coefficients),
length(pass_logistic$y)-length(pass_logistic$coefficients))
> par(mfrow=c(1,3))
> plot(dfbetas(pass_logistic)[,1],ylab="DFBETAS - INTERCEPT")
> plot(dfbetas(pass_logistic)[,2],ylab="DFBETAS - SAT")
> plot(dffits(pass_logistic),ylab="DFFITS")

以上是关于9.2 GWAS：关联分析——TASSEL（GLM/MLM/CMLM)的主要内容，如果未能解决你的问题，请参考以下文章

GWAS分析- P值计算过程 (七)

全基因组关联分析（GWAS）的计算原理

一行命令学会全基因组关联分析(GWAS)的meta分析

全基因组关联分析GWAS专题2——连锁不平衡

9.3 GWAS：关联分析——EMMAX

绘图之全基因组关联分析可视化(GWAS)