Hi
My team has been using Metaphlan3.0 for a metagenomics project. Our bioinformatician did not use the --unclassified_estimation option. So, the output does not include an estimate of the % of unclassified reads in a specimen.
Here is an example of the script that was run.
/local/projects-t3/MSL/pipelines/packages/envs/.YAMP/biobakery3-69f203c8fe3e5852a83df3c7cc5e8c46/bin/metaphlan --input_type fastq --tmp_dir=. --biom MG_401888_589.biom --bowtie2out=MG_401888_589_bt2out.txt --bowtie2db metaphlan_databases --bt2_ps very-sensitive --add_viruses --sample_id MG_401888_589 --nproc 4 MG_401888_589_QCd.fq.gz MG_401888_589_metaphlan_bugs_list.tsv
Our question is: how is the relative abundance calculated when --unclassified_estimation is used, and how is it calculated when it is not used?
The original Metaphlan paper has this to say about how relative abundances and the fraction of unclassified reads are calculated: “Moreover, for all clades except the leaves of the taxonomic tree (species), the two estimations are compared and an “unclassified” subclade is added when the clade-specific read count is larger than the sum of descendant counts, and the difference between the two estimations is assigned to the new descendant. Relative abundances are estimated weighting read counts assigned using the direct method with the total nucleotide size of all the markers in the clade, and normalizing by the sum of all directly-estimated weighted read counts.”
The Metaphlan4 paper describes how the fraction of unclassified reads is calculated but I could not find a description of how the relative abundances are calculated with or without this unclassified fraction included. Is there a description of this somewhere? Or can some guidance be provided on how this is done?
(Total reads-(sum of all SGBs(avg nonzero marker coverage for an SGB x average genome length for an SBG)/avg. read length))/Total reads
Also, if we use the following options --unclassified_estimation --ignore_eukaryotes --ignore_archaea and leave out --add_viruses so that we only get bacteria,is the unclassified fraction calculated as the % of unclassified bacterial reads only?
Thank you and best wishes.