High proportion of Unmapped reads in metagenomic data

Hello,

I am working with lake water sample. My focus is on harmful cyanobacteria. I ran metagenomic sequence using HUMAnN3.

Configuration I used for my sample is :

HUMAnN Configuration ( Section : Name = Value )
database_folders : nucleotide = /home/hassan/Desktop/hdatabases/chocophlan
database_folders : protein = /home/hassan/Desktop/hdatabases/uniref50ecf
database_folders : utility_mapping = /home/hassan/Desktop/b/utility_mapping
run_modes : resume = False
run_modes : verbose = False
run_modes : bypass_prescreen = False
run_modes : bypass_nucleotide_index = False
run_modes : bypass_nucleotide_search = False
run_modes : bypass_translated_search = False
run_modes : threads = 1
alignment_settings : evalue_threshold = 1.0
alignment_settings : prescreen_threshold = 0.01
alignment_settings : translated_subject_coverage_threshold = 50.0
alignment_settings : translated_query_coverage_threshold = 90.0
alignment_settings : nucleotide_subject_coverage_threshold = 50.0
alignment_settings : nucleotide_query_coverage_threshold = 90.0
output_format : output_max_decimals = 10
output_format : remove_stratified_output = False
output_format : remove_column_description_output = False

Outputs I got:

Removing spaces from identifiers in input file …

Running metaphlan …

Found g__GGB43952.s__GGB43952_SGB61317 : 24.13% of mapped reads ( )
Found g__GGB43067.s__GGB43067_SGB57480 : 13.42% of mapped reads ( )
Found g__GGB46342.s__GGB46342_SGB64120 : 13.10% of mapped reads ( )
Found g__GGB24856.s__GGB24856_SGB81948 : 8.95% of mapped reads ( )
Found g__GGB43951.s__GGB43951_SGB61315 : 6.47% of mapped reads ( )
Found g__GGB35689.s__GGB35689_SGB85076 : 6.18% of mapped reads ( )
Found g__Polynucleobacter.s__Polynucleobacter_sp_MWH_UH24A : 4.19% of mapped reads ( )
Found g__GGB25977.s__GGB25977_SGB37971 : 3.43% of mapped reads ( )
Found g__GGB59202.s__GGB59202_SGB80969 : 3.40% of mapped reads ( )
Found t__SGB13449 : 3.15% of mapped reads ( s__Cylindrospermopsis_raciborskii,g__Raphidiopsis.s__Raphidiopsis_brookii )
Found t__SGB24761 : 2.51% of mapped reads ( s__Cuspidothrix_issatschenkoi )
Found t__SGB24471 : 2.42% of mapped reads ( s__Phenylobacterium_sp_HYN0004 )
Found g__GGB43953.s__GGB43953_SGB61318 : 2.39% of mapped reads ( )
Found g__Candidatus_Methylopumilus.s__Candidatus_Methylopumilus_rimovensis : 1.20% of mapped reads ( )
Found g__GGB57651.s__GGB57651_SGB79249 : 0.84% of mapped reads ( )
Found g__GGB34754.s__GGB34754_SGB82226 : 0.65% of mapped reads ( )
Found t__SGB28829 : 0.58% of mapped reads ( s__Pelagibacterales_bacterium )
Found g__Pseudanabaena.s__Pseudanabaena_sp_FACHB_1050 : 0.54% of mapped reads ( )
Found t__SGB24760 : 0.36% of mapped reads ( s__Anabaena_sp_CRKS33,g__Dolichospermum.s__Dolichospermum_planctonicum,g__Dolichospermum.s__Dolichospermum_flos_aquae,g__Dolichospermum.s__Dolichospermum_sp_FACHB_1091,g__Anabaena.s__Anabaena_sp_FACHB_1250,g__Anabaena.s__Anabaena_sp_FACHB_1391 )
Found t__SGB13518 : 0.28% of mapped reads ( s__Microcystis_aeruginosa,g__Microcystis.s__Microcystis_viridis,g__Microcystis.s__Microcystis_wesenbergii,g__Microcystis.s__Microcystis_sp_0824,g__Microcystis.s__Microcystis_sp_T1_4,g__Microcystis.s__Microcystis_sp_LEGE_00066,g__Microcystis.s__Microcystis_sp_MC19,g__Microcystis.s__Microcystis_sp_LEGE_08355,g__Microcystis.s__Microcystis_flos_aquae )
Found g__Alphaproteobacteria_unclassified.s__alpha_proteobacterium_SCGC_AAA028_D10 : 0.28% of mapped reads ( g__Alphaproteobacteria_unclassified.s__alpha_proteobacterium_SCGC_AAA027_C06,g__Alphaproteobacteria_unclassified.s__alpha_proteobacterium_SCGC_AAA027_L15 )
Found g__GGB73741.s__GGB73741_SGB49722 : 0.27% of mapped reads ( )
Found g__GGB32003.s__GGB32003_SGB45716 : 0.16% of mapped reads ( )
Found g__GGB43055.s__GGB43055_SGB60296 : 0.14% of mapped reads ( )
Found g__GGB24725.s__GGB24725_SGB36612 : 0.13% of mapped reads ( )
Found g__Limnohabitans.s__Limnohabitans_sp_103DPR2 : 0.11% of mapped reads ( g__Limnohabitans.s__Limnohabitans_sp_Hippo4 )
Found g__Actinomycetia_unclassified.s__actinobacterium_SCGC_AAA028_A23 : 0.10% of mapped reads ( )
Found g__Pseudanabaena.s__Pseudanabaena_yagii : 0.10% of mapped reads ( )
Found t__SGB5711 : 0.09% of mapped reads ( s__Candidatus_Nanopelagicus_limnes )
Found t__SGB13423 : 0.08% of mapped reads ( s__Planktothrix_agardhii,g__Planktothrix.s__Planktothrix_rubescens,g__Planktothrix.s__Planktothrix_prolifica )
Found g__GGB56956.s__GGB56956_SGB78416 : 0.08% of mapped reads ( )
Found g__GGB32489.s__GGB32489_SGB48813 : 0.08% of mapped reads ( )
Found g__GGB62809.s__GGB62809_SGB85028 : 0.07% of mapped reads ( )
Found t__SGB24763 : 0.06% of mapped reads ( s__Sphaerospermopsis_kisseleviana,g__Sphaerospermopsis.s__Sphaerospermopsis_kisseleviana,g__Sphaerospermopsis.s__Sphaerospermopsis_sp_FACHB_1194,g__Sphaerospermopsis.s__Sphaerospermopsis_sp_LEGE_08334,g__Sphaerospermopsis.s__Sphaerospermopsis_sp_FACHB_1094,g__Sphaerospermopsis.s__Sphaerospermopsis_reniformis,g__Sphaerospermopsis.s__Sphaerospermopsis_sp_LEGE_00249 )
Found g__GGB46492.s__GGB46492_SGB64353 : 0.02% of mapped reads ( )
Found g__GGB43954.s__GGB43954_SGB61319 : 0.02% of mapped reads ( )
Found g__GGB44382.s__GGB44382_SGB61797 : 0.01% of mapped reads ( )

Total species selected from prescreen: 71

Selected species explain 99.99% of predicted community composition

Creating custom ChocoPhlAn database …

Running bowtie2-build …

Running bowtie2 …

Total bugs from nucleotide alignment: 13

g__Cuspidothrix.s__Cuspidothrix_issatschenkoi: 46989 hits
g__Anabaena.s__Anabaena_sp_CRKS33: 10985 hits
g__Pelagibacterales_unclassified.s__Pelagibacterales_bacterium: 6402 hits
g__Cylindrospermopsis.s__Cylindrospermopsis_raciborskii: 30055 hits
g__Phenylobacterium.s__Phenylobacterium_sp_HYN0004: 28182 hits
g__Microcystis.s__Microcystis_aeruginosa: 12437 hits
g__Raphidiopsis.s__Raphidiopsis_brookii: 47806 hits
g__Sphaerospermopsis.s__Sphaerospermopsis_kisseleviana: 3140 hits
g__Candidatus_Nanopelagicus.s__Candidatus_Nanopelagicus_limnes: 7823 hits
g__Planktothrix.s__Planktothrix_agardhii: 1498 hits
g__Microcystis.s__Microcystis_flos_aquae: 1067 hits
g__Microcystis.s__Microcystis_wesenbergii: 661 hits
g__Planktothrix.s__Planktothrix_rubescens: 835 hits

Total gene families from nucleotide alignment: 7904

Unaligned reads after nucleotide alignment: 99.2391962932 %

Running diamond …

Aligning to reference database: uniref50_201901b_ec_filtered.dmnd

Total bugs after translated alignment: 14

g__Cuspidothrix.s__Cuspidothrix_issatschenkoi: 46989 hits
g__Anabaena.s__Anabaena_sp_CRKS33: 10985 hits
g__Pelagibacterales_unclassified.s__Pelagibacterales_bacterium: 6402 hits
g__Cylindrospermopsis.s__Cylindrospermopsis_raciborskii: 30055 hits
g__Phenylobacterium.s__Phenylobacterium_sp_HYN0004: 28182 hits
g__Microcystis.s__Microcystis_aeruginosa: 12437 hits
g__Raphidiopsis.s__Raphidiopsis_brookii: 47806 hits
g__Sphaerospermopsis.s__Sphaerospermopsis_kisseleviana: 3140 hits
g__Candidatus_Nanopelagicus.s__Candidatus_Nanopelagicus_limnes: 7823 hits
g__Planktothrix.s__Planktothrix_agardhii: 1498 hits
g__Microcystis.s__Microcystis_flos_aquae: 1067 hits
g__Microcystis.s__Microcystis_wesenbergii: 661 hits
g__Planktothrix.s__Planktothrix_rubescens: 835 hits

unclassified: 1965349 hits

Total gene families after translated alignment: 55375

Unaligned reads after translated alignment: 92.6032063024 %

After running metaphlan 71 spices were identified but most of them (58) are unclassified species. 13 species were classified in metaphlan which is consistent with bowtie2 and diamond (uniref50ecfiltered). These species are mostly related to algal bloom which is also my interest as well. But my questions are:

  1. Is there any way to increase the number of identified species, decease % of unaligned reads of nucleotide and translated alignment (which is currently approx. 99% and 92% respectively)?

  2. What could be the reason for finding only algal bloom related species in my sample although a major portion is unaligned ?

  3. what is the difference of ec-filtered database and not ec-filtered database?

  4. If there is no way to increase % aligned reads do you think this % is usual for these kind of water sample?

( I have also attached the log file of my run for better understanding)
run0043_lane9_read2_indexN726-S518=ENN-8-17-16.txt|attachment (113.8 KB)

Thanks,

Hassan

  1. The bottom line (according to your MetaPhlAn output) is that most of the species in these samples are just not very well characterized. This means that they are missing pangenomes for alignment in HUMAnN 3. HUMAnN 4 will improve this by including pangenomes populated by MAGs from different environments, but we are always playing “catchup” to some extent on the less-well studied environments. For translated search, are you using the EC filtered database (per your question 3)? That only contains proteins that have an EC annotation (~10%), so mapping to that database will produce fewer translated hits, but is faster and more interpretable.

  2. I am not an expert on the biology here, but perhaps the species responsible for algal blooms have been better studied / sequenced than the background species? This is true for human pathogens (as an analogy).

  3. See comment in 1.

  4. I have not worked with this sort of sample enough to have a good % mapping rate in mind. You might check the forum for rates reported by other HUMAnN users with similar environmental communities for comparison?

2 Likes

Thanks a lot for your reply. It helped me a lot.

Currently I am facing a issue with metaphlan bug list.

Log file of the run:
062615a_S9_L001_R1_001.txt (117.3 KB)

Here is the concerning portion of my output:

Creating output directory: /home/hassan/Desktop/bouy/withoutbypass
Output files will be written to: /home/hassan/Desktop/bouy/withoutbypass
Decompressing gzipped file …

Removing spaces from identifiers in input file …

Running metaphlan …

Found t__SGB24760 : 100.00% of mapped reads ( s__Anabaena_sp_CRKS33,g__Dolichospermum.s__Dolichospermum_planctonicum,g__Dolichospermum.s__Dolichospermum_flos_aquae,g__Dolichospermum.s__Dolichospermum_sp_FACHB_1091,g__Anabaena.s__Anabaena_sp_FACHB_1250,g__Anabaena.s__Anabaena_sp_FACHB_1391 )

Total species selected from prescreen: 7

062615a_S9_L001_R1_001_metaphlan_bugs_list.tsv (2.1 KB)

Although total species selected from prescreen is 7. In the metaphlan bug list file has only 1 species which is below:

k__Bacteria|p__Cyanobacteria|c__Cyanobacteria_unclassified|o__Nostocales|f__Aphanizomenonaceae|g__Dolichospermum|s__Dolichospermum_circinale

What could be the reason for this? How can I get the complete bug list with their relative abundance?

Thanks

Hi @franzosa ,
I would appreciate your kind suggestion regarding this issue. Thanks

This is a function of the compatibility mode between HUMAnN 3 and MetaPhlAn 4. MetaPhlAn 4 is reporting an SGB that was historically grouped into multiple named species, so HUMAnN 3 will map against all of those species pangenomes pangenomes. You can find the full taxonomic profile under the HUMAnN temp folder for your sample.


Screenshot from 2023-10-27 11-13-05
Hi @franzosa ,
There are the output from HUMAnN.
Can you please tell me which file contains full taxonomic profile?

062615a_S9_L001_R1_001_metaphlan_bugs_list.tsv (2.1 KB)
This file contains
k__Bacteria|p__Cyanobacteria|c__Cyanobacteria_unclassified|o__Nostocales|f__Aphanizomenonaceae|g__Dolichospermum|s__Dolichospermum_circinale

on the other hand,
062615a_S9_L001_R1_001_pathabundance.tsv (1.2 KB)

this file contains: g__Anabaena.s__Anabaena_sp_CRKS33 as stratified output. Which are not same.

Can you please tell me why is these difference?

Thanks

I think my above response addresses this?