GATK - Read groups

Posted 2023-04-23

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了GATK - Read groups相关的知识，希望对你有一定的参考价值。

参考技术A

运行GATK时，报错：java.lang.IllegalArgumentException: samples cannot be empty
这个问题在GATK Forum有讨论到；
IllegalArgumentException: samples cannot be empty

Picard ValidateSamFile

ValidateSamFile下有两种mode:
VERBOSE : 检查到100个错误之后退出，并且输出错误到终端；
SUMMARYL: 输出结果是一个表格，展示errors和warnings的数目；

问题在bam文件 MISSING_READ_GROUP；请自动屏蔽WARNING结果；

bwa 比对使用参数-R

bwa -R 参数的使用

英文原版查看

测序时：
样本建一个库在同一条lane上完成测序产生reads sets可定义为一个Read group;
样本建成分开独立的库测序得到的reads sets也就分属于不同的Read groups;
存在于SAM/BAM /CRAM 文件中Read groups是由一系列标签组成；这些标签代表着测序中的一些技术特征；有了这些数据之后，大家就可以对bam文件进行重复序列标识和碱基质量重新矫正。
GATK要求输入的bam文件包含Read groups，如果没有就会报错。

例子：

GATk 要求read group的格式
Read group是@RG开始。

ID = Read group identifier
每一个Read group独有的ID；
Illumina 测序数据中，read group IDs由flowcell ，lane name 和number组成。
在矫正碱基质量时，read group IDs对区分技术批次效应是必须的；在这过程中，同一read group的reads假定为有一样的技术误差。

PU = Platform Unit
Platform Unit由三部分组成： FLOWCELL_BARCODE.LANE.SAMPLE_BARCODE
FLOWCELL_BARCODE refers to the unique identifier for a particular flow cell;
The LANE indicates the lane of the flow cell ;
The SAMPLE_BARCODE is a sample/library-specific identifier;
GATK 使用时，PU不是必须要求的；但是PU与ID同时存在时，PU优先级高于ID。

SM = Sample
reads属于的样品名；SM要设定正确，因为GATK产生的VCF文件也使用这个名字。

PL = Platform/technology used to produce the read
测序使用的平台： ILLUMINA, SOLID, LS454, HELICOS and PACBIO。

LB = DNA preparation library identifier
对一个read group的reads进行重复序列标记时，需要使用LB来区分reads来自那条lane;有时候，同一个库可能在不同的lane上完成测序;为了加以区分，同一个或不同库只要是在不同的lane产生的reads都要单独给一个ID。

从read names中提取ID与PU

例子-Multi-sample and multiplexed example
三个样品：MOM, DAD, KID；
建库：每个样品分别建两个库，一个insert为200bp，一个insert为400bp；
上机：每个测序库使用Illumina HiSeq的两条lane；
reads：来自 3 x 2 x 2 = 12条lane,可以产生12个bam文件，结果如下：

参考：
Read groups
Picard
从零开始完整学习全基因组测序（WGS）数据分析：第4节构建WGS主流程

如何在Python中打开/ etc / group时只显示用户的名字？

这是我的代码：

grouplist = open("/etc/group" , "r")
with grouplist as f2:
    with open("group" , "w+") as f1:
    f1.write(f2.read())
    f1.seek(0,0)
    command = f1.read()
    print
    print command

我可以使用什么命令使它只显示没有“：x：1000：”的用户名

答案

你几乎达到了目标。通过对代码的一点修复，这是解决方案。

 with open("/etc/group" , "r") as source:
    with open("./group" , "w+") as destination:
        destination.write(source.read())
        source.seek(0,0)
        for user in source.readlines():
            print user[: user.index(':')]

不过，这只显示名称，但仍然复制原始文件。

这样，您只能在新文件中写入名称。

with open("/etc/group" , "r") as source:
    with open("./group" , "w+") as destination:
        for user in source.readlines():
            u = user[: user.index(':')]
            destination.write('%s
' % u)
            print u

另一答案

怎么样split() [1] [2]

with open("/etc/group" , "r") as f2:
   for line in f2:
       list1=line.split(str=":")
       print list1[0]

以上是关于GATK - Read groups的主要内容，如果未能解决你的问题，请参考以下文章

GATK4 多个样本GenotypeGVCFs前用 CombineGVCFs还是GenomicsDBImport