Thursday, December 8, 2011

Pipelines to simulate and detect indels

This post describes a pipeline to simulate short reads using wgsim, align the produce short reads using BWA or Bowtie2 and finally detects indels using GATK UnifiedGenotyper (Broad Institute) or Dindel (Welcome Trust Sanger Institute).

Step 1 - Get a reference genome

I chose to get the human reference genome v37 from 1000 Genomes Project: download link to human_g1k_v37.fasta.gz file (851MB). You can extract the chromosome of your choice (to avoid working on the whole genome) with samtools:

samtools faidx human_g1k_v37.fasta 20 > human_g1k_v37_chr20.fasta


Step 2 - Simulate short reads

There are several short reads simulators (e.g. MAQSimSeqFluxSimulator). Here we are going to use wgsim that was once included in samtools and later extracted as a standalone project.

Wgsim project homepage: https://github.com/lh3/wgsim

Before running the simulator we need to calculate the number of reads we need. Remember the formula to calculate the coverage:

Coverage = N x L / G
(where N: number of reads - L: read length - G: genome size)

To detect indels we aim to have a "good" coverage of the genome, let's aim to get a coverage around 20. Then, if we produce 70-bp reads on human chromosome 20 (63Mbp long), it is easy with the formula above to find that we need at least 18.000.000 reads to achieve this coverage.

Once installed one run wgsim with the following command to generate the two files of paired reads:

wgsim -N 20000000 -X 0.95 human_g1k_v37_chr20.fasta out.read1.fq out.read2.fq > wgsim.out


Note the use of the -X option to increase the probability to extend an indel. The file wgsim.out will contain the indels generated where the column 1 is the chromosome, column 2 is the position, column 3 is the original base, column 4 is the new base (following the IUPAC codes) and column 5 is genomic copy/haplotype. The files out.read1.fq and out.read2.fq will contains the two reads of the paired reads (remember we are working with PEM here).

A limitation of wgsim is that it generates bases with a fixed quality of Q17 (encoded as "2" in fastq file) and the UnifiedGenotyper to be used below throws away everything that is below Q20. Hence we can reprocess the fastq files to uniformly increase base qualities with a awk command (Credit: biostar contribution):

awk ' /222/ { gsub("2", "I"); print $0; next } { print } ' out.read1.fq > out.read1b.fq
awk ' /222/ { gsub("2", "I"); print $0; next } { print } ' out.read2.fq > out.read2b.fq


Step 3 - Aligning short reads
Whatever the alignment method you choose, be sure to apply "Post-processing of alignment file" before going ahead with next step.

3.1. Aligning using BWA
Please consult BWA Manual Reference pages to get more details on the commands used below to index database sequences in the FASTA format (index), Find the SA coordinates of the input reads (aln), and finally generate alignments in the SAM format given paired-end reads (sampe):

bwa index human_g1k_v37_chr20.fasta

Important: the default BWA algorithm for constructing index, IS, is fast but will not work on database (fasta file) larger than 2GB. For larger database, use the option "-a bwtsw". 

bwa aln human_g1k_v37_chr20.fasta out.read1b.fq -f out1.sai
bwa aln human_g1k_v37_chr20.fasta out.read2b.fq -f out2.sai



Note: if your computer has several CPUs, you probably want to use them all through the "-t" option (see BWA Manual Reference pages). 

bwa sampe -f out.sam -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq
samtools view -hbS out.sam -o out.bam 

Note:  as an optimization the command "bowtie2..." could be pipedto the "samtools view..." command in order to generate directly the binary version of the alignment file (.bam instead of .sam) and avoid the creation of an intermediary .sam file. This is an optimization proposed by swbarnes2 in SEQanswers.

Note: as suggested by Heisman in SEQanswers, the previous command may also be optimized using the optional parameter -u to request samtools not to compress the .BAM file as this will be done by the "samtools sort" command below.

3.2. Aligning using Bowtie
As an alternative to BWA, we can use Bowtie 2 to align short reads to the genome. First, call bowtie-build to index our reference genome:

bowtie2-build human_g1k_v37_chr20.fasta homo_chr20

Then the call to actually create the alignment:

bowtie2 -t homo_chr20 -X 700 -1 out.read1b.fq -2 out.read2b.fq -S out.sam
samtools view -hbS out.sam -o out.bam

Note: as an optimization the command "bowtie2..." could be pipedto the "samtools view..." command in order to generate directly the binary version of the alignment file (.bam instead of .sam) and avoid the creation of an intermediary .sam file. This is an optimization proposed by swbarnes2 in SEQanswers.

Note: as suggested by Heisman in SEQanswers, the previous command may also be optimized using the optional parameter -u to request samtools not to compress the .BAM file as this will be done by the "samtools sort" command below.


3.3. Post-processing of alignment file
Before calling indels, the alignment file will be converted in binary format, sorted then indexed using samtools:

samtools sort out.bam out_sorted
samtools index out_sorted.bam




Step 4 - Calling indels

4.1. Calling indels with GATK UnifiedGenotyper

Important: running the local realignment as described here is highly recommended to improve accuracy prior to run UnifiedGenotyper.

java -jar /home/pmaugeri/tools/GenomeAnalysisTK-1.3-21-gcb284ee/GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T UnifiedGenotyper -I out_sorted.bam -o gatk_indels.vcf -glm INDEL

Indels called by the UnifiedGenotyper can be found in the file specified by the -o option (file "gatk_pipeline_indels.vcf").

Here's an example of a VCF file generated with GATK with this pipeline: gatk.vcf (~2MB)

4.2. Calling indels with Dindel
Calling indels using Sanger software is done in 4 steps where the first and the third are the longest ones:

dindel --ref human_g1k_v37_chr20.fasta --outputFile 1 --bamFile out_sorted.bam --analysis getCIGARindels

python makeWindows.py --inputVarFile 1.variants.txt --windowFilePrefix 2 --numWindowsPerFile 20000

dindel --analysis indels --doDiploid --bamFile out_sorted.bam --ref human_g1k_v37_chr20.fasta --varFile 2.1.txt --libFile 1.libraries.txt --outputFile 3 > 3.out 2> 3.err

echo 3.glf.txt > 3.list
python mergeOutputDiploid.py -i 3.list -o dindel_indels.vcf -r human_g1k_v37_chr20.fasta


Indels called by Dindel are found in file specified by -o option.

Here's an example of a VCF file generated with Dindel with this pipeline: dindel.vcf (~300KB)


Thanks a lot to Alex Renwick, swbarnes2, Heisman and vyellapa for their great suggestions in   SEQanswers forum to improve this post.