Tuesday, March 13, 2012

Biotech companies map - Denmark and Sweden

A Google map displaying Biotech companies located in Sweden and in Denmark:

Thank you very much to @TommyCarstensen for his great help and advices in selecting the companies displayed on this map!


View Biotech Companies 2012 - Denmark and Sweden in a larger map
Use this form if you want to suggest a correction to your company information.

Legend:


the marker shows the precise location of the company 
the marker shows the location area of the company


Sunday, March 4, 2012

Biotech Companies map - Germany

A Google map displaying Biotech companies located in Germany (see also Biotechs in Spain and Biotechs in France maps):


View Biotech Companies 2012 - Germany in a larger map

Use this form if you want to suggest a correction to your company information.


Legend:


the marker shows the precise location of the company 
the marker shows the location area of the company


Sunday, February 19, 2012

Biotech Companies map - France

Here is a new map displaying Biotech companies located in France, following the previous post about Spanish Biotech companies.


View Biotech Companies 2012 - France in a larger map.
Use this form if you want to suggest a correction to your company information.



Legend:


the marker shows the precise location of the company 
the marker shows the location area of the company


Thursday, February 16, 2012

Biotechs companies map - Spain

This is a first step in building a map of European Biotech companies. Here we start with Spanish companies only. The info box (when you click a marker) should display the name, web, employees and location of the company. 


View Biotech Companies 2012 - Spain in a larger map.
Use this form if you want to suggest a correction to your company information

LEGEND:


the markers shows the exact location of the company,
the marker shows the area of the company






Thursday, January 19, 2012

Genomic Structural Variation: Methods & Protocols

Book of the month about Structural Variation:


Content:
1. What Have Studies of Genomic Disorders Taught Us About Our Genome?
2. Microdeletion and Microduplication Syndromes
3. Structural Genomic Variation in Intellectual Disability
4. Copy Number Variation and Psychiatric Disease Risk
5. Detection and Characterization of Copy Number Variation in Autism Spectrum Disorder
6. Structural Variation in Subtelomeres
7. Array-Based Approaches in Prenatal Diagnosis
8. Structural Variation and Its Effect on Expression
9. The Challenges of Studying Complex and Dynamic Regions of the Human Genome
10. Population Genetic Nature of Copy Number Variation
11. Detection and Interpretation of Genomic Structural Variation in Mammals
12. Structural Genetic Variation in the Context of Somatic Mosaicism
13. Online Resources for Genomic Structural Variation
14. Algorithm Implementation for CNV Discovery Using Affymetrix and Illumina SNP Array Data
15. Targeted Screening and Validation of Copy Number Variations
16. High-Resolution Copy Number Profiling by Array CGH Using DNA Isolated from Formalin-Fixed, Paraffin-Embedded Tissues
17. Characterizing and Interpreting Genetic Variation from Personal Genome Sequencing
18. Massively Parallel Sequencing Approaches for Characterization of Structural Variation

Thursday, December 8, 2011

Pipelines to simulate and detect indels

This post describes a pipeline to simulate short reads using wgsim, align the produce short reads using BWA or Bowtie2 and finally detects indels using GATK UnifiedGenotyper (Broad Institute) or Dindel (Welcome Trust Sanger Institute).

Step 1 - Get a reference genome

I chose to get the human reference genome v37 from 1000 Genomes Project: download link to human_g1k_v37.fasta.gz file (851MB). You can extract the chromosome of your choice (to avoid working on the whole genome) with samtools:

samtools faidx human_g1k_v37.fasta 20 > human_g1k_v37_chr20.fasta


Step 2 - Simulate short reads

There are several short reads simulators (e.g. MAQSimSeqFluxSimulator). Here we are going to use wgsim that was once included in samtools and later extracted as a standalone project.

Wgsim project homepage: https://github.com/lh3/wgsim

Before running the simulator we need to calculate the number of reads we need. Remember the formula to calculate the coverage:

Coverage = N x L / G
(where N: number of reads - L: read length - G: genome size)

To detect indels we aim to have a "good" coverage of the genome, let's aim to get a coverage around 20. Then, if we produce 70-bp reads on human chromosome 20 (63Mbp long), it is easy with the formula above to find that we need at least 18.000.000 reads to achieve this coverage.

Once installed one run wgsim with the following command to generate the two files of paired reads:

wgsim -N 20000000 -X 0.95 human_g1k_v37_chr20.fasta out.read1.fq out.read2.fq > wgsim.out


Note the use of the -X option to increase the probability to extend an indel. The file wgsim.out will contain the indels generated where the column 1 is the chromosome, column 2 is the position, column 3 is the original base, column 4 is the new base (following the IUPAC codes) and column 5 is genomic copy/haplotype. The files out.read1.fq and out.read2.fq will contains the two reads of the paired reads (remember we are working with PEM here).

A limitation of wgsim is that it generates bases with a fixed quality of Q17 (encoded as "2" in fastq file) and the UnifiedGenotyper to be used below throws away everything that is below Q20. Hence we can reprocess the fastq files to uniformly increase base qualities with a awk command (Credit: biostar contribution):

awk ' /222/ { gsub("2", "I"); print $0; next } { print } ' out.read1.fq > out.read1b.fq
awk ' /222/ { gsub("2", "I"); print $0; next } { print } ' out.read2.fq > out.read2b.fq


Step 3 - Aligning short reads
Whatever the alignment method you choose, be sure to apply "Post-processing of alignment file" before going ahead with next step.

3.1. Aligning using BWA
Please consult BWA Manual Reference pages to get more details on the commands used below to index database sequences in the FASTA format (index), Find the SA coordinates of the input reads (aln), and finally generate alignments in the SAM format given paired-end reads (sampe):

bwa index human_g1k_v37_chr20.fasta

Important: the default BWA algorithm for constructing index, IS, is fast but will not work on database (fasta file) larger than 2GB. For larger database, use the option "-a bwtsw". 

bwa aln human_g1k_v37_chr20.fasta out.read1b.fq -f out1.sai
bwa aln human_g1k_v37_chr20.fasta out.read2b.fq -f out2.sai



Note: if your computer has several CPUs, you probably want to use them all through the "-t" option (see BWA Manual Reference pages). 

bwa sampe -f out.sam -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq
samtools view -hbS out.sam -o out.bam 

Note:  as an optimization the command "bowtie2..." could be pipedto the "samtools view..." command in order to generate directly the binary version of the alignment file (.bam instead of .sam) and avoid the creation of an intermediary .sam file. This is an optimization proposed by swbarnes2 in SEQanswers.

Note: as suggested by Heisman in SEQanswers, the previous command may also be optimized using the optional parameter -u to request samtools not to compress the .BAM file as this will be done by the "samtools sort" command below.

3.2. Aligning using Bowtie
As an alternative to BWA, we can use Bowtie 2 to align short reads to the genome. First, call bowtie-build to index our reference genome:

bowtie2-build human_g1k_v37_chr20.fasta homo_chr20

Then the call to actually create the alignment:

bowtie2 -t homo_chr20 -X 700 -1 out.read1b.fq -2 out.read2b.fq -S out.sam
samtools view -hbS out.sam -o out.bam

Note: as an optimization the command "bowtie2..." could be pipedto the "samtools view..." command in order to generate directly the binary version of the alignment file (.bam instead of .sam) and avoid the creation of an intermediary .sam file. This is an optimization proposed by swbarnes2 in SEQanswers.

Note: as suggested by Heisman in SEQanswers, the previous command may also be optimized using the optional parameter -u to request samtools not to compress the .BAM file as this will be done by the "samtools sort" command below.


3.3. Post-processing of alignment file
Before calling indels, the alignment file will be converted in binary format, sorted then indexed using samtools:

samtools sort out.bam out_sorted
samtools index out_sorted.bam




Step 4 - Calling indels

4.1. Calling indels with GATK UnifiedGenotyper

Important: running the local realignment as described here is highly recommended to improve accuracy prior to run UnifiedGenotyper.

java -jar /home/pmaugeri/tools/GenomeAnalysisTK-1.3-21-gcb284ee/GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T UnifiedGenotyper -I out_sorted.bam -o gatk_indels.vcf -glm INDEL

Indels called by the UnifiedGenotyper can be found in the file specified by the -o option (file "gatk_pipeline_indels.vcf").

Here's an example of a VCF file generated with GATK with this pipeline: gatk.vcf (~2MB)

4.2. Calling indels with Dindel
Calling indels using Sanger software is done in 4 steps where the first and the third are the longest ones:

dindel --ref human_g1k_v37_chr20.fasta --outputFile 1 --bamFile out_sorted.bam --analysis getCIGARindels

python makeWindows.py --inputVarFile 1.variants.txt --windowFilePrefix 2 --numWindowsPerFile 20000

dindel --analysis indels --doDiploid --bamFile out_sorted.bam --ref human_g1k_v37_chr20.fasta --varFile 2.1.txt --libFile 1.libraries.txt --outputFile 3 > 3.out 2> 3.err

echo 3.glf.txt > 3.list
python mergeOutputDiploid.py -i 3.list -o dindel_indels.vcf -r human_g1k_v37_chr20.fasta


Indels called by Dindel are found in file specified by -o option.

Here's an example of a VCF file generated with Dindel with this pipeline: dindel.vcf (~300KB)


Thanks a lot to Alex Renwick, swbarnes2, Heisman and vyellapa for their great suggestions in   SEQanswers forum to improve this post.





Tuesday, November 29, 2011

Genome Structural Variation on simulated genome

Here's a brief memo to simulate a read alignment (and indels) based on a human genome chromosome and detect Structural Variations.


Step 1 - Get a reference genome
I chose to get the human reference genome v37 from 1000 Genomes Project: download link to human_g1k_v37.fasta.gz file (851MB).

You can extract the chromosome of your choice (to avoid working on the whole genome) with samtools:

samtools faidx human_g1k_v37.fasta 20 > human_g1k_v37_chr20.fasta

Step 2 - Simulate reads and indels
There are several short reads simulators (e.g. SimSeq, FluxSimulator). Here we are going to use wgsim that was once included in samtools and later extracted as a standalone project.

Wgsim project homepage: https://github.com/lh3/wgsim

Before running the simulator we need to calculate the number of reads we need. Remember the formula to calculate the coverage:

Coverage = N x L / G
(where N: number of reads - L: read length - G: genome size)

To detect indels we aim to have a "good" coverage of the genome, let's aim to get a coverage around 20. Then, if we produce 70-bp reads on human chromosome 20 (63Mbp long), it is easy with the formula above to find that we need at least 18.000.000 reads to achieve this coverage.

Once installed you run wgsim with the following command to generate the two files of paired reads:

wgsim -N 20000000 -X 0.95 human_g1k_v37_chr20.fasta out.read1.fq out.read2.fq > wgsim.out

Note the use of the -X option to increase the probability to extend an indel. The file wgsim.out will contain the indels generated where the column 1 is the chromosome, column 2 is the position, column 3 is the original base, column 4 is the new base (following the IUPAC codes) and column 5 is genomic copy/haplotype. The files out.read1.fq and out.read2.fq will contains the two reads of the paired reads (remember we are working with PEM here).

Step 3 - create the alignment file
With the paired reads files created in previous files we can align these reads against the reference genome and create an alignment file.

Again here, several tools are available and we will use bowtie2.

First, call bowtie-build to index our reference genome

bowtie2-build human_g1k_v37_chr20.fasta homo_chr20


Then the call to actually create the alignment:

bowtie2 -t homo_chr20 -X 700 -1 out.read1.fq -2 out.read2.fq -S homo_chr20.sam

Convert the alignment file to a binary alignment file (.SAM -> .BAM), sort and index using samtools:

samtools view -bS homo_chr20.sam > homo_chr20.bam
samtools sort homo_chr20.bam homo_chr20_sorted
samtools index homo_chr20_sorted.bam

Step 4 - detect Structural Variants
The following command calls shows how to detect indels generated in step 1:

dindel --ref human_g1k_v37_chr20.fasta --outputFile 1 --bamFile homo_chr20_sorted.bam --analysis getCIGARindels

python makeWindows.py --inputVarFile 1.variants.txt --windowFilePrefix 2 --numWindowsPerFile 20000

dindel --analysis indels --doDiploid --bamFile homo_chr20_sorted.bam --ref human_g1k_v37_chr20.fasta --varFile 2.1.txt --libFile 1.libraries.txt --outputFile 3 > 3.out 2> 3.err

echo 3.glf.txt > 3.list

python mergeOutputDiploid.py -i 3.list -o 4.vcf -r human_g1k_v37_chr20.fasta