Discrepancies between marker database and BLAST

Hello there,

First, I’d like to thanks the Biobakery team for their workflows. I used Metaphlan and succeded in everything I wanted to do, mainly thanks to the tutorials and forums available.

I’d expose briefly my issue : I ran Metaphlan on clean shotgun sequencing fastq files, exported the absolute abundance table to Phyloseq, ran alpha, beta diversity, and differental abundance analysis. Metaphlan allowed me to identify micro-organisms differently abundant between my control and case groups.

The next step in my research is to investigate whether those micro-organisms can be detected as well by qPCR. This is what I did :

  • I downloaded the database used by Metaphlan containing all the marker genes (MetaPhlAn 3 | Zenodo)
  • I extracted the 150 marker genes to identify the bacterium Oscillobacter_sp_57_20.
  • I looked for some of the marker genes on Uniprot (BHW41_05790, for instance).
  • I landed on ENSEMBL, where I found the gene sequence, for instance :
    https://bacteria.ensembl.org/Oscillibacter_sp_57_20_gca_001916835/Gene/Sequence?g=BHW41_03065;r=Ley3_66761_scaffold_296:32041-33135;t=OLA40997
  • Then, I blasted the obtained sequence to check whether this gene is specific to this bacterium of interest, Oscillobacter_sp_20.
  • For one of the 150 marker gene, I found a very low e-value for Oscillobacter sp. For another gene, there were multiple bacterium for which the sequence hit. For a third gene, no bacterium was associated with the gene sequence, but only viruses !

Thus, I am wondering whether I misunderstood something in my protocole, or if I understood correctly. In the later, it implies that there are multiple false positives when aligning against Metaphlan database ?

Any insight would be helpful :slight_smile:

Hi @Herve
If you wanted to retrieve the marker sequences for your species, the best thing would be to get it from the available FASTA file: http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.tar using the information of this file: http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2 to know which 150 markers belong to your species.
For the BLAST mapping, which database did you use as reference? Did you use the NCB BLAST webservice or you did it locally?

Thank you for your answer !

First, I didn’t find the FASTA file that contain the sequences associated with my marker genes. It makes it more easier to BLAST each sequence, since I don’t have to retrieve the sequence on another database.

Second, I BLASTed against “nucleotid collection” on NCBI.

I think I found a marker gene that is specific to my species of interest. I will order and try some primers.
I still want to understand how Metaphlan works. Is it an algorithm that requires hit on multiple marker gene to have one hit on the taxon ?

Hi @Herve
You can find a detailed description of how the marker genes database was built in our last manuscript: Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3 | eLife section " Generation of MetaPhlAn 3 markers"

Regarding on how MetaPhlAn detects a species, it needs (by default) to map your reads against, at least 20% of the markers of the species (this can be modified with the --stat_q parameter)

Ah, a big thanks for your answers, everything is fine now :slight_smile: