Annotation Version 4 (TGAC v2) - on assembly BATG-0.5
Released 08/02/16 in collaboration with The Genome Analysis Centre
Annotation | File | Notes |
Full Annotation | GFF | GFF file of all gene, mRNA, CDS, 3'UTR and 5'UTR annotations (all isoforms), excluding transposable elements |
Transposable Elements | GFF | GFF file of transposable elements. |
cDNA Fasta | FASTA | FASTA file of all cDNA (transcript) sequences (all isoforms). |
CDS Fasta | FASTA | FASTA file of all CDS DNA sequences (all isoforms). |
Proteome | FASTA | FASTA file of all CDS peptide sequences (all isoforms). |
Longest transcript: Annotation | GFF | GFF file of all genes, and only mRNA, CDS, 3'UTR and 5'UTR annotations for the longest isoform. |
Longest transcript: cDNA | FASTA | FASTA file of cDNA (transcript) sequences (longest isoform per gene only). |
Longest transcript: CDS | FASTA | FASTA file of CDS DNA sequences, (longest isoform per gene only) |
Longest transcript: Proteome | FASTA | FASTA file of CDS peptide sequences (longest isoform per gene only) |
Functional annotation | TSV | GO and Interproscan terms associated with each gene |
Annotation Version 3 (TGAC v2) - on assembly BATG-0.4
Released 26/02/15 in collaboration with The Genome Analysis Centre
Annotation | File | Notes |
Full Annotation | GFF | GFF file of all gene, mRNA, CDS, 3'UTR and 5'UTR annotations. (all isoforms). |
cDNA Fasta | FASTA | FASTA file of all cDNA (transcript) sequences (all isoforms). |
CDS Fasta | FASTA | FASTA file of all CDS DNA sequences (all isoforms). |
Proteome | FASTA | FASTA file of all CDS peptide sequences (all isoforms). |
Functional annotation | TSV | GO and Interproscan terms associated with all genes (all isoforms). |
Longest transcript: Annotation | GFF | GFF file of all genes, and only mRNA, CDS, 3'UTR and 5'UTR annotations for the longest isoform. |
Longest transcript: cDNA | FASTA | FASTA file of cDNA (transcript) sequences (longest isoform per gene only). |
Longest transcript: CDS | FASTA | FASTA file of CDS DNA sequences, (longest isoform per gene only) |
Longest transcript: Proteome | FASTA | FASTA file of CDS peptide sequences (longest isoform per gene only) |
Longest transcript: Functional annotation | TSV | GO and Interproscan terms associated with longest isoform of each gene. |
Methods V3 (BATG-0.4)
A three-way pipeline was used to predict genes ab initio consisting of 1). MAKER, 2). Augustus (without RNA-seq data), and 3). Augustus (with RNA-seq data). The following data were fed into the pipeline: RNA reads from five samples shown below in Annotation Version 2, the gene models produced in Version 2 by Lizzy Sollars, a repeat-masked genome produced by Laura Kelly at QMUL, and alignments of protein sequences from eight other plant species. The pipeline was run by Gemy Kaithokottil and David Swarbreck at TGAC. Evidence Modeller was then used to select the most accurate structure for each gene, as each of the three methods will predict slightly different gene structures. Filtering was performed using PASA. Resulting genes were annotated using BLAST, GO terms and Interproscan. This updated version predicts more genes than the previous version, as the previous relied solely on RNA data and therefore only those genes that were expressed at the time of RNA extraction.
This annotation can be visualised in the JBrowse tool on this website, and also on gbrowse hosted at TGAC.
Annotation Version 2 - BATG-0.4:
Sample | GFF3 | FASTA | No. of genes/proteins |
Notes |
---|---|---|---|---|
Selfed tree leaf | S_L1 | S_L1 | 27,360 | Assembled transcriptome of leaf tissue from the selfed tree (gz compressed). |
Mother tree leaf | M_L1 | M_L1 | 24,473 | Assembled transcriptome of leaf tissue from the mother tree (gz compressed). |
Mother tree cambium | M_C1 | M_C1 | 27,368 | Assembled transcriptome of cambium tissue from the mother tree (gz compressed). |
Mother tree root | M_R2 | M_R2 | 28,275 | Assembled transcriptome of root tissue from the mother tree (gz compressed). |
Mother tree flower | M_F1 | M_F1 | 29,562 | Assembled transcriptome of flower tissue from the mother tree (gz compressed). |
All samples combined | Bulk | Bulk | 36,944 | Tar archive of the five files above. |
All samples | Unigenes | 36,944 | The longest transcript per gene (gz compressed). Filtered on coverage. | |
All samples | Proteome | 36,893 | Predicted protein coding sequence for each unigene (gz compressed) | |
All samples unfiltered | Unigenes | 41,521 | Longest transcript per gene, before filtering. | |
All samples unfiltered | mRNA | 72,139 | All mRNA transcripts, before filtering. |
Methods (v2)
RNA was extracted by Jasmin Zohren from 5 tissues: leaf, cambium, root, and flower of the 'mother' tree, and from leaf tissue of the 'selfed' tree (the individual for which we have provided a reference genome sequence). These were sequenced using Illumina HiSeq paired-end technology. Adapter sequences were removed from the reads, which were then also quality trimmed to a minimum Phred score of 20 and minimum length of 50bp. The transcriptome data were assembled by Lizzy Sollars using the CLC Transcript Discovery Plugin. RNA-seq reads were mapped to the BATG_0.4 reference genome using the Large Gap Read Mapper (accounts for intron sequences in the reference), and the location of genes and mRNA transcripts were predicted using the Transcript Discovery tool. Reads were then mapped back to the transcripts and those with an average coverage of less than 5 were filtered out.
The GFF3 files above contain the locations of genes, transcripts, exons, and introns for each tissue, and the FASTA files contain the sequences of each mRNA transcript. The 'bulk' files are tar archives of the five separate sample files. The unigenes files comprises one transcript per gene, with all samples combined, i.e. a 'complete' reference ash transcriptome. The longest gene in each location was selected for this file, regardless of which sample it originated from. The proteome file contains one protein sequence for each gene, predicted from running a command-line version of OrfPredictor. The input for this was the unigene file and its resulting output of a BLASTx search against all plant proteins in the RefSeq database.