HUMAnN Is Ignoring Viruses

Why does it only create search indexes for bacteria? Note that Human_erythrovirus was ignored.

06/22/2021 12:42:33 PM - humann.search.prescreen - INFO: Found g__Streptococcus.s__Streptococcus_mitis : 57.60% of mapped reads
06/22/2021 12:42:33 PM - humann.search.prescreen - INFO: Found g__Haemophilus.s__Haemophilus_haemolyticus : 27.09% of mapped reads
06/22/2021 12:42:33 PM - humann.search.prescreen - INFO: Found g__Erythroparvovirus.s__Human_erythrovirus_V9 : 15.31% of mapped reads
06/22/2021 12:42:33 PM - humann.search.prescreen - INFO: Total species selected from prescreen: 3
06/22/2021 12:42:33 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Haemophilus.s__Haemophilus_haemolyticus.centroids.v296_v201901b.ffn.gz
06/22/2021 12:42:33 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Streptococcus.s__Streptococcus_mitis.centroids.v296_v201901b.ffn.gz
06/22/2021 12:42:33 PM - humann.search.prescreen - INFO: Creating custom ChocoPhlAn database ........
06/22/2021 12:42:53 PM - humann.humann - INFO: Total bugs from nucleotide alignment: 2
06/22/2021 12:42:53 PM - humann.humann - INFO: 
g__Haemophilus.s__Haemophilus_haemolyticus: 490 hits
g__Streptococcus.s__Streptococcus_mitis: 120 hits

Hello, HUMAnN will only add the species identified in the prescreen that are also included as files in the ChocoPhlAn database. Any species that you see not added check to see if they are also not included in the database as a file with the same name.

Thank you,
Lauren

Hey, you’re right.

/verona/biostat/databases$ find Chocophlan3/ -name \*Erythroparvovirus\*
/verona/biostat/databases$

It begs the question why does MetaPhlAn 3 use a different subset of species from ChocoPhlAn 3 that what HUMAnN 3 does? Why can’t users search for Human_erythrovirus_V9 with HUMAnN 3?

Thank you for checking. Is it possible the MetaPhlAn version (database) and the ChocoPhlAn version you have installed are slightly different (so maybe one is newer then the other)?

Thank you,
Lauren

Both databases are the latest available from their respective websites. For MetaPhlAn, mpa_v30_CHOCOPhlAn_201901 is used and got and for HUMAnN, it is full_chocophlan.v296_201901b. Why is the one named full the one that is missing the virus I noticed? The smaller marker gene database has the virus, but not the full version.

I can chime in on this one - there are a few things going on:

  1. We’re in the process of re-evaluating methods for viral profiling given that viruses don’t conform to the same principles that MetaPhlAn uses for profiling cellular microbes (e.g. averaging signals from 100s of marker genes). An approximate method (inherited from MetaPhlAn 2) was retained in MetaPhlAn 3 as an “expert mode.”

  2. Partly as a consequence of 1, and partly due to the general “weirdness” of defining a pangenome for viral species (which don’t have quite the same flexible “bag of genes” biology compared with cellular microbes), we opted not to include approximated pangenomes for them in HUMAnN 3.

There are a couple of workarounds:

  1. If MetaPhlAn 3 detects a virus, you can treat the entire viral genome (including all of its proteins) as having being detected.

  2. Since HUMAnN 3 will map viral reads to proteins during the translated search phase, you could use the infer_taxonomy script to work out which unclassified UniRef90s are likely viral in origin.

Hope this helps!

Hi @franzosa
would you advise against passing the --metaphlan-options="–add-viruses" to Humann3 ?

It shouldn’t really change things downstream of MetaPhlAn since HUMAnN 3 doesn’t have viral pangenomes to map against (as discussed above). I suppose if you detected A LOT of virus it might compress the relative abundance of a bacterial species in your sample below HUMAnN’s inclusion threshold, but the threshold is pretty lenient by default (0.01%).

1 Like