
Introduction: MetaPhyler is a computational tool that can characterize the taxonomic diversity of a 
metagenomic sample by identifying phylogenetic marker genes from metagenomic sequences, and classifying 
them into taxonomic groups. It uses 31 marker genes as a taxonomic reference, which includes 
marker genes from all complete genomes, several draft genomes and the NCBI nr protein database. 
MetaPhyler classifier, based on BLAST, uses different thresholds (automatically learned from the 
reference database) for each combination of taxonomic level, reference marker gene, and sequence length.

Contact: Bo Liu, boliu@umiacs.umd.edu

1 System Requirements:

  - Local installation of the standalone BLAST software package. 
    This can be downloaded from ftp://ftp.ncbi.nih.gov/blast/executables/release/

  - MetaPath is implemented in Perl.

2 Installation:

  (1) Download MetaPhylerV1.07.tar.gz from http://cbcb.umd.edu/~boliu/metaphyler/
  
  (2) Uncompress by command: tar -xzvf MetaPhylerV1.07.tar.gz

  (3) Under the 'markerGenes' directory, build BLAST database by command:
      formatdb -p T -i markers.pfasta

3 MetaPhyler Usage:

  (1) Map metagenomic reads against the reference marker genes using BLASTX or BLASTP. 
      The reference marker genes database is 'MetaPhylerV1.07/markerGenes/markers.pfasta'  
      Under MetaPhylerV1.07 directory, run BLASTX with your input fasta sequence file input.fasta by command: 
     
      blastall -p blastx -m 8 -b 30 -e 0.01 -d ./markerGenes/markers.pfasta -i input.fasta -o output.blastx
     
      Results are stored in file output.blastx. 
      This is a very computational expensive step. You can parallelize this step if you have
      a huge input sequence file. But be sure to use the parameter: -m 8 -b 30 -e 0.01

  (2) Estimate the microbial diversity and the abundances from blastx results. 
      Run MetaPhyler by command:

      perl metaphyler.pl output.blastx result

      Two files will be generated: result.classify and result.tax.
      result.classify contains taxonomic classification results for reads that contain marker gene sequences.
      The format is:
        column 1: query read id
        column 2: reference gene id
        column 3: reference gene name
        column 4: taxonomic label at genus level
        column 5: taxonomic label at family level
        column 6: taxonomic label at order level
        column 7: taxonomic label at class level
        column 8: taxonomic label at phylum level

      result.tax contains the estimated taxonomic profile of the query sequences input.fasta. 
      The format is:
        >taxonomic_level
        taxonomic_name percent_abundance
      
      where the taxonomic_level could be genus, family, order, class or phylum. 
   
4 Example:

  Under MetaPhylerV1.07 directory, run the following commands:

  blastall -p blastx -m 8 -b 30 -e 0.01 -d ./markerGenes/markers.pfasta -i ./test/test.fasta -o ./test/test.blastx

  perl metaphyler.pl ./test/test.blastx test

  Two files will be generated: test.classify and test.tax
