Hello,
I am new to Metaphlan and playing around with the command line options.
I have an environmental sample as FASTQ file available (wastewater) and I am wondering what you would suggest to use as commandline options for Metaphlan in this case.
When I first ran Metaphlan with the --unclassified_estimation
I got around 80% relative abundance flagged as UNCLASSIFIED.
So, I lowered the stat_q
parameter and ignored the MAPQ value by adding the arguments --stat_q 0.1 --min_mapq_val -1
as suggested in the forum.
(e.g. Understanding Parameters (stat_q) for Environmental sample,
Which parameters to tweak to improve abundance calculations?)
In another post, I found the suggestion for longer reads that one can add a Bowtie2 parameter to do the alignment in the local mode, i.e in very-sensitive-local instead of the default very-sensitive mode, by using --bt2_ps very-sensitive-local
. (e.g. MetaPhlan3 –unknown_estimation)
Interestingly, there are huge differences for the relative abundance, when I compare the Metaphlan profile result for 2 runs, where I used either the default very-sensitive
mode and the very-sensitive-local
Bowtie2 mode.
This confuses me, and I wonder what you would suggest to use resp. what makes more sense.
My raw reads of my samples have variable length. The average raw read length is between 179bp and 326bp.
The minimum raw sequence length is 25bp and the maximum raw sequence length varies between 707bp and 837bp.
Based on these read length stats, would you say it makes sense to use the local BowTie2 alignment preset mode?
Below I report the different profiling results of Metaphlan when running with or without the local Bowtie mode.
One can see that for the local mode the UNCLASSIFIED relative abundance is set to 0.0, though having around 63% for the non-local mode. This results in a distorted view of the abundance estimation.
Is there an error when computing the relative abundance in local mode?
When using the local mode, I see Eukaryota in the result and in general also some other Bacteria, that are not present when using the non-local mode.
#mpa_vJan21_CHOCOPhlAnSGB_202103
#/Users/bernhard/miniconda3/envs/metaphlan4_py3.9/bin/metaphlan ../raw/sample.fastq --input_type fastq --nproc 10 --bowtie2db ../metaphlan4_db_mac/ --read_min_len 60 --bowtie2out ./sample_bowtie2out_local.txt -o sample_profiled_metagenome_local.txt -t rel_ab_w_read_stats --unclassified_estimation --stat_q 0.1 --min_mapq_val -1 --bt2_ps very-sensitive-local
#3109526 reads processed
#Metaphlan_Analysis
#estimated_reads_mapped_to_known_clades:1371133
#clade_name clade_taxid relative_abundance coverage estimated_number_of_reads_from_the_clade
UNCLASSIFIED -1 0.0 - 1738393
k__Bacteria 2 99.68134 0.41586 1331663
k__Archaea 2157 0.21061 0.00088 1769
k__Eukaryota 2759 0.10805 0.00045 37701
k__Bacteria|p__Planctomycetes 2|203682 83.63309 0.34891 1126429
k__Bacteria|p__Proteobacteria 2|1224 12.69694 0.05297 155939
k__Bacteria|p__Bacteroidetes 2|976 1.76021 0.00734 25752
k__Bacteria|p__Ignavibacteriae 2|1134404 0.82266 0.00343 13741
k__Archaea|p__Euryarchaeota 2157|28890 0.21061 0.00088 1769
k__Bacteria|p__Actinobacteria 2|201174 0.16971 0.00071 2452
k__Bacteria|p__Nitrospirae 2|40117 0.13247 0.00055 2283
k__Bacteria|p__Firmicutes 2|1239 0.12792 0.00053 1607
k__Eukaryota|p__Apicomplexa 2759|5794 0.10805 0.00045 37701
k__Bacteria|p__Tenericutes 2|544448 0.10707 0.00045 298
k__Bacteria|p__Spirochaetes 2|203691 0.1046 0.00044 1017
k__Bacteria|p__Chloroflexi 2|200795 0.08364 0.00035 1537
k__Bacteria|p__Verrucomicrobia 2|74201 0.02003 8e-05 406
k__Bacteria|p__Fusobacteria 2|32066 0.01314 5e-05 50
k__Bacteria|p__Chlamydiae 2|204428 0.00831 3e-05 104
k__Bacteria|p__Acidobacteria 2|57723 0.00153 1e-05 48
Below the non-local mode:
#mpa_vJan21_CHOCOPhlAnSGB_202103
#/Users/bernhard/miniconda3/envs/metaphlan4_py3.9/bin/metaphlan ../raw/sample.fastq --input_type fastq --nproc 10 --bowtie2db ../metaphlan4_db_mac/ --read_min_len 60 --bowtie2out ./sample_bowtie2out.txt -o sample_profiled_metagenome.txt -t rel_ab_w_read_stats --unclassified_estimation --stat_q 0.1 --min_mapq_val -1
#3109526 reads processed
#Metaphlan_Analysis
#estimated_reads_mapped_to_known_clades:996571
#clade_name clade_taxid relative_abundance coverage estimated_number_of_reads_from_the_clade
UNCLASSIFIED -1 63.27951 - 2112955
k__Bacteria 2 36.65945 0.31195 995533
k__Archaea 2157 0.06104 0.00052 1038
k__Bacteria|p__Planctomycetes 2|203682 34.34729 0.29227 939923
k__Bacteria|p__Proteobacteria 2|1224 1.83997 0.01566 43584
k__Bacteria|p__Bacteroidetes 2|976 0.34498 0.00294 9912
k__Archaea|p__Euryarchaeota 2157|28890 0.06104 0.00052 1038
k__Bacteria|p__Tenericutes 2|544448 0.04194 0.00036 238
k__Bacteria|p__Firmicutes 2|1239 0.03929 0.00033 902
k__Bacteria|p__Spirochaetes 2|203691 0.03652 0.00031 724
k__Bacteria|p__Nitrospirae 2|40117 0.00321 3e-05 120
k__Bacteria|p__Chlamydiae 2|204428 0.00272 2e-05 69
k__Bacteria|p__Fusobacteria 2|32066 0.00194 2e-05 15
k__Bacteria|p__Actinobacteria 2|201174 0.0016 1e-05 46
One more thing… according to the manual the --bt2_ps
flag is only applied when a FASTA file is provided, but nevertheless when I supplied a FASTQ file, the local mode is also added when bowtie-align is called. I am not sure if the manual is not up to date, or it makes any difference when I do not supply a FASTA file. Can you clarify that for me please?
--bt2_ps BowTie2 presets
Presets options for BowTie2 (applied only when a FASTA file is provided)
The choices enabled in MetaPhlAn are:
* sensitive
* very-sensitive
* sensitive-local
* very-sensitive-local
[default very-sensitive]
If it makes any difference, I am using MetaPhlAn version 4.0.3 (24 Oct 2022).
What would you say, which command line options should I use for my environment sample with relatively high average read length. Does it make sense to use --stat_q 0.1 --min_mapq_val -1 --bt2_ps very-sensitive-local
?
Thank you for your time!
Best regards,
Bernhard