MetaPhlAn v4 output warning message

Hello biobakery team,
I just finished running the first sample using MetaPhlAn v4 with “mpa_vJan21_CHOCOPhlAnSGB_202103” database. However, the log file shows me a warning message “WARNING: MetaPhlAn did not detect any microbial taxa in the sample.” The output file also does not contain any taxonomy data. But I ran this fastq data using the MetaPhlAn v3.1 with “mpa_v31_CHOCOPhlAn_201901” database without any problem. Do you know why this happened? Thank you!

Hi @hellofuture
How many species MetaPhlAn 3.1 was able to detect in your sample? Could you also check in the bowtieout output how many reads are mapping against the database for 3.1 and 4.0 (or share both files so we can inspect them).

Hi Aitor,
Thank you! I am attaching the outputs from both Metaphlan versions (v3.1 vs v4). It seems the output from v3.1 contains 100% Moraxella catarrhalis. I am sorry I have already removed the bowtie output file from v3.1 in order to run v4. I only have v4 bowtie output, the file size is too big to attach here. Is there another to share it with you? Thank you!

Metaphlan version 3.1
#mpa_v31_CHOCOPhlAn_201901
#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 100.0
k__Bacteria|p__Proteobacteria 2|1224 100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria 2|1224|1236 100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales 2|1224|1236|72274 100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Moraxellaceae 2|1224|1236|72274|468 100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Moraxellaceae|g__Moraxella 2|1224|1236|72274|468|475 100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Moraxellaceae|g__Moraxella|s__Moraxella_catarrhalis 2|1224|1236|72274|468|475|480 100.0 k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Moraxellaceae|g__Moraxella|s__Moraxella_sp_HMSC061H09

Metaphlan version 4
#mpa_vJan21_CHOCOPhlAnSGB_202103
#clade_name NCBI_tax_id relative_abundance additional_species
UNCLASSIFIED -1 100.0

Hi @hellofuture
You can upload the bowtie2out file here: Biobakery_forum_problems - Google Drive
Having only one species reported in the mpa3.1 profile leads me to two potential scenarios:

  1. In mpa3.1, only 58 marker genes are available for M. catarrhalis. For detection, mpa needs 20% of the markers to be hit by at least one read, so it only needs 12 markers to be hit to detect the species. In mpa4, M. catarrhalis has 200 marker genes, thus at least 40 markers have to be hit to detect it. If the species is in the limit of detection, that can explain why you are not finding it in mpa4.
  2. A good number of M. catarrhalis marker genes in mpa3.1 are actually quasi-markers (marker genes present also in few genomes of other species, more info here: Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3 | eLife). Instead, the 200 marker genes in mpa4 are perfectly-unique markers. With this in mind, there is the posibility of M. catarrhalis to be a FP in mpa3.1.

Checking both bowtie2out files from mpa3.1 and mpa4 could help to identify what’s the problem happening in your sample.

Hi Aitor,
Thank you! I just uploaded the bowtie2out file from Metaphlan v4 to the Google Drive folder. In terms of v3.1, can you let me know how I can generate the bowtie2out file again without downgrading my Metaphlan v4? Thanks!

Hi @hellofuture
I think the best thing to do would be to create a new conda environment to install metaphlan3.1.

I had a look at the bowtie2out and it seems to me that is the #1 scenario. From the 200 markers available for M. catarrhalis, only 36 markers are being detected. One solution for you might be to run metaphlan with a lower --stat_q parameter, e.g. --stat_q 0.15.
To understand what the parameter does, --stat_q is the main QC threshold for reporting a species to be present. In a really simplified way, it represents the proportion of markers that have to be present to detect a species (by default is set to 0.2, so 20% of the markers). Reducing the parameter can increase the recall of the tool (thus reducing false negatives) but at the cost of decreasing the precision (increasing the false positives). On average, the default value produce a good balance between FP/FN, but for specific cases in which you have low microbial biomass can be quite stringent.

Thank you so much, Aitor! This makes great sense.