20080604 Joshua Stein New files: genes.v7.gff blessed_genes.v7.gff Description of genes.v7.gff: Includes all genes in evidence_genes.v7.gff and fgenesh.v7.gff, but provides additional QC data as attributes in column 9. Attribute "Class=" can have values of WH, NH, TE, and NULL. These designations are based on blastp alignments to NCBI nraa, and comparison of hits to a list of transposable elements that are frequently aligned to maize. Thus genes encoding proteins that align to those on the transposable element list are designated as TE. Those that otherwise have significant hits are designated as WH (with homology). Those with no significant hits are designated as NH (no homology). A small number of Fgenesh models did not have protein translations available and were thus designated as NULL. For genes having multiple transcripts, I based the classification on the longest translation produced by that gene. These designations are similar to the "biotype=" designations, which is based on a similar method using the Ensembl pipeline. In most cases the two designations agree (WH=protein_coding ; NH=protein_coding_hypothetical ; TE=transposon_pseudogene). However in manually checking discrepancies I found that the "Class=" designation is more reliable than the "biotype=" designation. For Fgenesh genes I generally regard the NH genes as false positive gene calls. They tend to be short (often single-exon) and not conserved with rice in DNA-based genome alignments. I do not know yet how to regard the NH gene class in evidence-based gene calls (they are based on evidence afterall), but I have noted that they too are short as a group compared to the WH class. For now I think of them as hypotheticals that may or may not prove to be real genes. I encourage you to work with both the NH and WH sets and provide feedback on any observations you have regarding the legitimacy of NH genes. Blessed genes: An additional attribute added to column 9 of genes is "Blessed=", which can have values of "blessed" or "NULL". In comparing the WH classes of Fgenesh and evidence-based gene models I noticed that there is a population of Fgenesh models that do not overlap with the coordinates of evidence-based genes. These may be true genes that were not detected using the evidence-based gene-build. To generate a working set of genes that is as complete as possible, I combined the evidence-based WH genes with the non-overlapping Fgenesh WH genes and designated them as "Blessed=blessed". Specifically, there are 565 evidence_genes + 136 Fgenesh = 701 blessed genes. You will find a small number of blessed genes that are not WH, as well as a small number of WH evidence-based genes that are not in the the blessed set. This is because when manually checking discrepancies between biotype and Class, those that appeared to be real genes (regardless of automated designation) were included as candidates to go into the blessed set. An additional note: the method for gene-calling is not specifically noted in the GFF, but evidence-gene ID's start with "ZmAcc7" while Fgenesh genes start with "Pseudomolecule_". Description of blessed_genes.v7.gff This is the subset of genes and gene features that are "blessed". For questions please contact steinj@cshl.edu.