Here's a brief memo to simulate a read alignment (and indels) based on a human genome chromosome and detect Structural Variations.
Step 1 - Get a reference genome
Before running the simulator we need to calculate the number of reads we need. Remember the formula to calculate the coverage:
To detect indels we aim to have a "good" coverage of the genome, let's aim to get a coverage around 20. Then, if we produce 70-bp reads on human chromosome 20 (63Mbp long), it is easy with the formula above to find that we need at least 18.000.000 reads to achieve this coverage.
Step 1 - Get a reference genome
I chose to get the human reference genome v37 from 1000 Genomes Project: download link to human_g1k_v37.fasta.gz file (851MB).
You can extract the chromosome of your choice (to avoid working on the whole genome) with samtools:
samtools faidx human_g1k_v37.fasta 20 > human_g1k_v37_chr20.fasta
Step 2 - Simulate reads and indels
There are several short reads simulators (e.g. SimSeq, FluxSimulator). Here we are going to use wgsim that was once included in samtools and later extracted as a standalone project.
Wgsim project homepage: https://github.com/lh3/wgsim
Before running the simulator we need to calculate the number of reads we need. Remember the formula to calculate the coverage:
Coverage = N x L / G
(where N: number of reads - L: read length - G: genome size)
Once installed you run wgsim with the following command to generate the two files of paired reads:
wgsim -N 20000000 -X 0.95 human_g1k_v37_chr20.fasta out.read1.fq out.read2.fq > wgsim.out
Note the use of the -X option to increase the probability to extend an indel. The file wgsim.out will contain the indels generated where the column 1 is the chromosome, column 2 is the position, column 3 is the original base, column 4 is the new base (following the IUPAC codes) and column 5 is genomic copy/haplotype. The files out.read1.fq and out.read2.fq will contains the two reads of the paired reads (remember we are working with PEM here).
Step 3 - create the alignment file
With the paired reads files created in previous files we can align these reads against the reference genome and create an alignment file.
Again here, several tools are available and we will use bowtie2.
First, call bowtie-build to index our reference genome
bowtie2-build human_g1k_v37_chr20.fasta homo_chr20
Then the call to actually create the alignment:
bowtie2 -t homo_chr20 -X 700 -1 out.read1.fq -2 out.read2.fq -S homo_chr20.sam
Convert the alignment file to a binary alignment file (.SAM -> .BAM), sort and index using samtools:
samtools view -bS homo_chr20.sam > homo_chr20.bam
samtools sort homo_chr20.bam homo_chr20_sorted
samtools index homo_chr20_sorted.bam
Step 4 - detect Structural Variants
The following command calls shows how to detect indels generated in step 1:
dindel --ref human_g1k_v37_chr20.fasta --outputFile 1 --bamFile homo_chr20_sorted.bam --analysis getCIGARindels
python makeWindows.py --inputVarFile 1.variants.txt --windowFilePrefix 2 --numWindowsPerFile 20000
dindel --analysis indels --doDiploid --bamFile homo_chr20_sorted.bam --ref human_g1k_v37_chr20.fasta --varFile 2.1.txt --libFile 1.libraries.txt --outputFile 3 > 3.out 2> 3.err
echo 3.glf.txt > 3.list
python mergeOutputDiploid.py -i 3.list -o 4.vcf -r human_g1k_v37_chr20.fasta
wgsim -N 20000000 -X 0.95 human_g1k_v37_chr20.fasta out.read1.fq out.read2.fq > wgsim.out
Note the use of the -X option to increase the probability to extend an indel. The file wgsim.out will contain the indels generated where the column 1 is the chromosome, column 2 is the position, column 3 is the original base, column 4 is the new base (following the IUPAC codes) and column 5 is genomic copy/haplotype. The files out.read1.fq and out.read2.fq will contains the two reads of the paired reads (remember we are working with PEM here).
Step 3 - create the alignment file
With the paired reads files created in previous files we can align these reads against the reference genome and create an alignment file.
Again here, several tools are available and we will use bowtie2.
First, call bowtie-build to index our reference genome
bowtie2-build human_g1k_v37_chr20.fasta homo_chr20
Then the call to actually create the alignment:
bowtie2 -t homo_chr20 -X 700 -1 out.read1.fq -2 out.read2.fq -S homo_chr20.sam
Convert the alignment file to a binary alignment file (.SAM -> .BAM), sort and index using samtools:
samtools view -bS homo_chr20.sam > homo_chr20.bam
samtools sort homo_chr20.bam homo_chr20_sorted
samtools index homo_chr20_sorted.bam
Step 4 - detect Structural Variants
The following command calls shows how to detect indels generated in step 1:
dindel --ref human_g1k_v37_chr20.fasta --outputFile 1 --bamFile homo_chr20_sorted.bam --analysis getCIGARindels
python makeWindows.py --inputVarFile 1.variants.txt --windowFilePrefix 2 --numWindowsPerFile 20000
dindel --analysis indels --doDiploid --bamFile homo_chr20_sorted.bam --ref human_g1k_v37_chr20.fasta --varFile 2.1.txt --libFile 1.libraries.txt --outputFile 3 > 3.out 2> 3.err
echo 3.glf.txt > 3.list
python mergeOutputDiploid.py -i 3.list -o 4.vcf -r human_g1k_v37_chr20.fasta