Tuesday, November 29, 2011

Genome Structural Variation on simulated genome

Here's a brief memo to simulate a read alignment (and indels) based on a human genome chromosome and detect Structural Variations.

Step 1 - Get a reference genome
I chose to get the human reference genome v37 from 1000 Genomes Project: download link to human_g1k_v37.fasta.gz file (851MB).

You can extract the chromosome of your choice (to avoid working on the whole genome) with samtools:

samtools faidx human_g1k_v37.fasta 20 > human_g1k_v37_chr20.fasta

Step 2 - Simulate reads and indels
There are several short reads simulators (e.g. SimSeq, FluxSimulator). Here we are going to use wgsim that was once included in samtools and later extracted as a standalone project.

Wgsim project homepage: https://github.com/lh3/wgsim

Before running the simulator we need to calculate the number of reads we need. Remember the formula to calculate the coverage:

Coverage = N x L / G
(where N: number of reads - L: read length - G: genome size)

To detect indels we aim to have a "good" coverage of the genome, let's aim to get a coverage around 20. Then, if we produce 70-bp reads on human chromosome 20 (63Mbp long), it is easy with the formula above to find that we need at least 18.000.000 reads to achieve this coverage.

Once installed you run wgsim with the following command to generate the two files of paired reads:

wgsim -N 20000000 -X 0.95 human_g1k_v37_chr20.fasta out.read1.fq out.read2.fq > wgsim.out

Note the use of the -X option to increase the probability to extend an indel. The file wgsim.out will contain the indels generated where the column 1 is the chromosome, column 2 is the position, column 3 is the original base, column 4 is the new base (following the IUPAC codes) and column 5 is genomic copy/haplotype. The files out.read1.fq and out.read2.fq will contains the two reads of the paired reads (remember we are working with PEM here).

Step 3 - create the alignment file
With the paired reads files created in previous files we can align these reads against the reference genome and create an alignment file.

Again here, several tools are available and we will use bowtie2.

First, call bowtie-build to index our reference genome

bowtie2-build human_g1k_v37_chr20.fasta homo_chr20

Then the call to actually create the alignment:

bowtie2 -t homo_chr20 -X 700 -1 out.read1.fq -2 out.read2.fq -S homo_chr20.sam

Convert the alignment file to a binary alignment file (.SAM -> .BAM), sort and index using samtools:

samtools view -bS homo_chr20.sam > homo_chr20.bam
samtools sort homo_chr20.bam homo_chr20_sorted
samtools index homo_chr20_sorted.bam

Step 4 - detect Structural Variants
The following command calls shows how to detect indels generated in step 1:

dindel --ref human_g1k_v37_chr20.fasta --outputFile 1 --bamFile homo_chr20_sorted.bam --analysis getCIGARindels

python makeWindows.py --inputVarFile 1.variants.txt --windowFilePrefix 2 --numWindowsPerFile 20000

dindel --analysis indels --doDiploid --bamFile homo_chr20_sorted.bam --ref human_g1k_v37_chr20.fasta --varFile 2.1.txt --libFile 1.libraries.txt --outputFile 3 > 3.out 2> 3.err

echo 3.glf.txt > 3.list

python mergeOutputDiploid.py -i 3.list -o 4.vcf -r human_g1k_v37_chr20.fasta


  1. Nitpick: In the terminology of 1000 Genomes and the review paper "Genome structural variation discovery and genotyping", a structural variant is at least 50bp long. Wgsim's indels are almost certainly much shorter.

    1. Jesse if you increase the -X parameter, for instance, to 0.99 you will get big structural variants, far higher from 50bd