Tools to facilitate annotation and comparison of genomes and metagenomes

seq-annot

Tools to facilitate annotation and comparison of genomes and metagenomes

About

seq-annot is a python package for annotating and counting genomic features in genomes and metagenomes.

Licence

seq-annot is under open source licence GPLv3

Requirements

Python 3.4+

Python Libraries:

  • arandomness
  • bio_utils
  • HTSeq
  • numpy

Installation

pip install seq-annot

Usage

reldb

Integral to the annotation process is a sequence database with high-quality supporting information. seq-annot will accept supporting data in the form of a JSON-formatted relational database that provides context for entries in the sequence database, such as function, alternative IDs, organism, etc. Database entries can contain any number of fields. reldb is a tool used to perform various manipulations on one or more relational databases, including: entry merging, database subsetting, and format conversion.

Examples

Find all entries with istA in the product description and output results as \
a tabular CSV:
    reldb --csv --case -s "product:istA" in.mapping

Merge entries by value of the field 'gene':
    reldb -m gene in.mapping

screen_features

screen_features is for screening the results of a homology search using alternative phenotype-conferring snps, bitscore thresholds, and other scoring metrics.

Examples

Filter hits by bitscore threshold specified in the relational database:
    screen_features -m in.mapping -b bitscore in.b6

annotate_features

annotate_features is for annotating a GFF3 file of putative protein-coding genes using the results of a homology search to a database of genes with known or predicted function.

Examples

Annotate predicted protein-coding genes, adding fields Dbxref and gene 
from the relational database as additional attributes:
    annotate_features -m mapping.json -f Dbxref,gene -b in.b6 in.gff

combine_features

combine_features is for combining GFF3 files containing feature annotations. Overlap conflicts can be resolved through a hierarchy of precedence determined by file input order.

Examples

Merge two GFF3 files, using input file order to resolving overlap conflicts
but allowing overlaps if found on different strands:
    combine_features --precedence --stranded in.gff1 in.gff2

count_features

count_features is for estimating the abundance of genomic features using a GFF3 file of feature annotations and an read alignment file in SAM/BAM format.

Examples

Estimate abundance for features in GFF in units of FPKM:
    count_features --nonunique -u fpkm in.bam in.gff

compare_features

compare_features is for comparing annotated features across multiple samples.

Examples

Create table of feature abundances by sample:
    compare_features -n sample1,sample2 in.csv1 in.csv2

colocate_features

colocate_features is for finding evidence of co-residence between two sets of genomic features in an assembly of mixed populations (such as a metagenome assembly).

Examples

Find all instances of co-residence between sets of features:
    colocate_features -a Alias -f in.fasta in.sets in.gff