Quick question is: I have some mapped reads in bam file which have good read quality, but they have sam flag 0x200 which means they didn‘t pass the vendor check. Should I include them or not in downstream analysis?
Long question is: what‘ s the relationship between read quality score and Chastity score?
First, everybody may know read quality score:
Reads quality score(phred score) is calculated by -10*log(P(error_base)), P(error_base) represents the probability that the base is incorrect.
Second, I want to talk about Chastity score during the vendor check:
For reads in fastq format, there is a header field ‘Y/N‘ which indicates whether the read pass filtering step. And the corresponding sam flag is 0x200, indicating "not passing filters, such as platform/vendor quality controls". How does Illumina set the filtering criteria?
As far as I know, read filtering by Illumina Real Time Analysis (RTA) happens during the run, and filtering is determined by Chastity score. Chastity Score is calculated by “the ratio of the highest of the four (base type) intensities to the sum of highest two”. Illumina described the vendor check as follows:
"To remove the least reliable data from the analysis, the raw data can be filtered to remove any clusters that have “too much” intensity corresponding to bases other than the called base. By default, the purity of the signal from each cluster is examined over the first 25 cycles and calculated as Chastity = Highest_Intensity / (Highest_Intensity + Next_Highest_Intensity) for each cycle. The new default filtering implemented at the base calling stage allows at most one cycle that is less than the Chastity threshold. The higher the value, the better. This value is very dependent on cluster density, since the major cause of an impure signal in the early cycles is the presence of another cluster within a few micrometers."
So, to my understanding, every cycle the Sequencer scan a cluster, there would be 4 kinds of signals from 4 bases(am I right?) the most significant base would the final choice. The bigger the signal intensity divergence is the better for base calling. For the first 25 cycles, Illumina allow at most one base with smaller signal intensity divergence, otherwise, Illumina would set the read as vendor failed. Is my understanding right so far?
But what is the relationship between the Phred score and the Chastity score? if they really have. Can I still use vendor failed reads if they have high phred score?
Thanks! Tao
Curious. Why are the vendor failed reads in your dataset?
I downloaded the bam file from GTEx (dbGaP). The bam file contains all the reads, including mapped, unmapped, vendor failed reads. For a sample with ~100M reads, ~12M are labeled as vendor failed including both mapped and unmapped reads. Part of the vendor failed reads have read good quality. So, I‘m not sure if I should include them.