Basic tools for preprocessing sequencing reads

seq-qc - Python package with modules for preprocessing sequencing reads

About

seq-qc is a python package for performing various quality control tasks on sequencing reads. Currently, seq-qc has three programs for this - a dereplicator of paired-end reads, a tool for performing quality-based trimming, and a demultiplexer that uses the barcodes in sequence headers to partition sequences.

Requirements

Python 3.4+

Python Libraries:

  • arandomness
  • bio_utils

Installation

pip install seq-qc

Usage

filter_replicates takes as input fastq or fasta files. Paired reads can either be split in separate files or can be in a single, interleaved file. filter_replicates currently supports searching for exact, 5’-prefix, and reverse-complement replicates.

Examples

filter_replicates --prefix --rev-comp --reverse input_reverse.fastq \
    input_forward.fastq

filter_replicates --prefix --rev-comp -o output_forward.fastq.gz -v \
    output_reverse.fastq.gz --reverse input_reverse.fastq.gz \
    input_forward.fastq.gz

filter_replicates --interleaved -o output_interleaved.fasta \
    --format fasta --log output.log input_interleaved.fasta

filter_replicates -o output_singles.fastq input_singles.fastq

qtrim takes only fastq files as input. It can perform a variety of trimming steps - including trimming low quality bases from the start and end of a read, trimming the read after the position of the first ambiguous base, and trimming a read using a sliding-window approach. It also supports cutting the read to a desired length by removing bases from the start or end of the read.

Examples

qtrim -1 -o \
output_forward.fastq.gz --qual-type phred33 --sliding-window 10:20 -v \
    out_reverse.fastq.gz -s out_singles.fastq.gz --reverse \
    in_reverse.fastq.gz in_forward.fastq.gz

qtrim --threads 4 --interleaved --leading 20 --trailing 20 --trunc-n \
    --min-len 60 -o out_interleaved.fastq in_interleaved.fastq

demultiplex_headers will split sequences into separate files by the barcode sequence found in the sequence headers. Headers must use the Casava 1.8 format for demultiplexing to work. An optional barcodes file can be provided that maps desired output filenames to the barcodes expected in the input Fastq/a file. See demultiplex_headers –help for formatting requirements.

Examples

demultiplex_headers --barcodes barcode_map.tsv --distance 2 --hist \
    barcodes.hist --reverse in_reverse.fastq.gz input_forward.fastq.gz

demultiplex_headers --bzip2 --interleaved input_interleaved.fastq.gz