Drastically different profile when using --add_viruses?

Hello,

I’m trying to profile some metatranscriptomic stool samples with MetaphlAn 3.0 and getting some strange results. If I run metaphlan3 with default settings, I get a profile that is 100% k__Bacteria. However, when I use the --add_viruses option I get a profile that is 99.9% Virus (Cucumber green mottle mosaic virus, to be specific) and 0.1% Bacteria. If I run Kraken + Bracken on this sample, it only reports 0.7% of reads belonging to this virus, which seems a lot more realistic for stool. BLASTX-ing the sample against this virus is also not resulting in a lot of convincing hits.

I’ve made sure to filter out adapters and low complexity regions (bbduk), and I’m also filtering out human DNA, RNA, and 16S rRNA with kneaddata. Is there something else that could be causing this huge discrepancy between the two methods?

Thank you for your time.

1 Like

Hi Stephen,
I’d not advise to use MetaPhlAn for estimating the relative abundance of viruses.
The report of 99% of viruses present is due to the estimation of the relative abundance and not the absolute one, therefore if one clade dominates all the others you’ll see very small abundances of them.

Have you tried to look at the results using the --unknown_estimation parameter?

Could you please expand on why the add_viruses flag would cause the above issue? It’s not likely that read counts matching a viral biomarker would outnumber bacterial biomarkers in a stool sample, so I’m not sure how viral relative abundance would be larger than that of bacteria.

If it’s really not recommended to use the add_viruses flag, I would recommend removing it as an option and not advertise it on the GitHub page. I was very excited when I saw it and have been using it (without the above issue so far) and only found this thread, that’s it’s not a good idea to use it, by chance.

Thanks,
Samantha

1 Like

The markers used for identify viruses are coming from the previous database version, we kept them just to maintain the possibility to identify viruses. The profile of the transcriptome analyzed could have identified few reads mapping against bacterial markers and slightly more reads mapping to the Cucumber green mottle mosaic virus, which its relative abundance could be inflated by the small genome.

We are working on identify markers for bacteriophages in order to include them in a future update of the database, unfortunately the current pipeline was not developed in such way to identify viral makers.