A major goal of metagenomic studies is to characterize the bacterial composition of an environmental sample. MetaPhyler is a novel taxonomic classifier for metagenomic shotgun reads, which uses phylogenetic marker genes as a taxonomic reference. Our classifier, based on BLAST, uses different thresholds (automatically learned from the reference database) for each combination of taxonomic rank, reference gene, and sequence length. Our reference database includes marker genes from all complete genomes, several draft genomes and the NCBI nr protein database. Results on simulated metagenomic datasets demonstrate that MetaPhyler outperforms previous tools used in this context (CARMA, MEGAN and PhymmBL).


New version (05/23/2012): MetaPhylerV1.25.tar.gz(252Mb). Allows training on your own data set; computes a confidence score for each classification. Here is a tutorial: metaphyler_tutorial.pdf.

Short Illumina reads (NEW): MetaPhylerSRV0.115.tar.gz(30MB). This version is suitable for short reads. It can analyze 10 million metagenomic Illumina reads in 50 minutes on a single processor. Parallelization using multiple threads is also available.

Publication - Please Cite

Bo Liu, Theodore Gibbons, Mohammad Ghodsi, Todd Treangen, Mihai Pop. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences BMC Genomics 2011, 12(Suppl 2):S4
Bo Liu, Theodore Gibbons, Mohammadreza Ghodsi, Mihai Pop. MetaPhyler: Taxonomic profiling for metagenomic sequences. Proceedings of 2010 IEEE Bioinformatics and Biomedicine:95-100.


(1) BLAST software package and Perl program are required.
(2) Download MetaPhylerV1.13.tar.gz.
(3) Uncompress by command: $: tar -xzvf MetaPhylerV1.07.tar.gz
(4) Under the './data/blastdb' directory, build BLAST database by command:
$: formatdb -p F -i markers.pfasta

Running MetaPhyler

There are two ways to run MetaPhyler: (1) run everything with a single command, OR (2) run MetaPhyler in two steps (mapping and classification).

(1) Run everything with a single command
$: ./ <input fasta file> <Output prefix>


(2) Run metaphyler in two steps
(2.1) Map metagenomic reads against the MetaPhyler reference marker genes
$: ./ <input fasta file> > "prefix".blastn
This step is very computational expensive, so the users can parallelize it in various ways depending on the machines they have.
(2.2) Estimate the microbial composition and the abundances from blastx results.
$: ./ "prefix".blastn "prefix"

Interpreting Results

6 output files will be generated per MetaPhyler run:

Classification of shotgun reads, which are from phylogenetic marker genes. Here is the format:
column 1: query sequence id
column 2: phylogenetic marker gene name
column 3: best reference gene hit
column 4: % similarity with best hit
column 5: classification rule
column 6: taxonomic label at genus level
column 7: taxonomic label at family level
column 8: taxonomic label at order level
column 9: taxonomic label at class level
column 10: taxonomic label at phylum level

Bacterial composition at 5 taxonomic ranks, the format is:
column 1: taxonomic clade name
column 2: % relative abundances (based on column 3)
column 3: depth of coverage of genomes
column 4: number of sequences binned to this clade
column 5: similarity with best reference genes (only available at the genus level)

Finding Novel Species

Because MetaPhyler uses different classification thresholds at different taxonomic ranks, it can avoid assigning an organism to a lower-level taxonomic group if the evidence does not support this assignment. To identify and represent novel species from metagenomic samples, we have the following naming rule, for example: if a novel species (or a set of sequences) can be successfully classified into family Enterobacteriaceae, but can not be classified into any genus under this family, then this novel species is named as Enterobacteriaceae{family}.