How does using --unclassified_estimation affect relative abundances of the classified portion of reads

Hi

My team has been using Metaphlan3.0 for a metagenomics project. Our bioinformatician did not use the --unclassified_estimation option. So, the output does not include an estimate of the % of unclassified reads in a specimen.

Here is an example of the script that was run.


/local/projects-t3/MSL/pipelines/packages/envs/.YAMP/biobakery3-69f203c8fe3e5852a83df3c7cc5e8c46/bin/metaphlan --input_type fastq --tmp_dir=. --biom MG_401888_589.biom --bowtie2out=MG_401888_589_bt2out.txt --bowtie2db metaphlan_databases --bt2_ps very-sensitive --add_viruses --sample_id MG_401888_589 --nproc 4 MG_401888_589_QCd.fq.gz MG_401888_589_metaphlan_bugs_list.tsv

Our question is: how is the relative abundance calculated when --unclassified_estimation is used, and how is it calculated when it is not used?

The original Metaphlan paper has this to say about how relative abundances and the fraction of unclassified reads are calculated: “Moreover, for all clades except the leaves of the taxonomic tree (species), the two estimations are compared and an “unclassified” subclade is added when the clade-specific read count is larger than the sum of descendant counts, and the difference between the two estimations is assigned to the new descendant. Relative abundances are estimated weighting read counts assigned using the direct method with the total nucleotide size of all the markers in the clade, and normalizing by the sum of all directly-estimated weighted read counts.”

The Metaphlan4 paper describes how the fraction of unclassified reads is calculated but I could not find a description of how the relative abundances are calculated with or without this unclassified fraction included. Is there a description of this somewhere? Or can some guidance be provided on how this is done?

(Total reads-(sum of all SGBs(avg nonzero marker coverage for an SGB x average genome length for an SBG)/avg. read length))/Total reads

Also, if we use the following options --unclassified_estimation --ignore_eukaryotes --ignore_archaea and leave out --add_viruses so that we only get bacteria,is the unclassified fraction calculated as the % of unclassified bacterial reads only?

Thank you and best wishes.

1 Like

Hi @ethankgough
Answering your questions:
Is there a description of this somewhere? Or can some guidance be provided on how this is done?
The relative abundance is calculated from the coverage of the species-specific marker genes. Then, using the median genome lenght of the species and the avg. coverage of the species markers, it will assess the estimated number of reads mapping to the species. Then, for the relative abundance estimation, it will account for the estimated unclassified fraction of the reads if the flag is specified

is the unclassified fraction calculated as the % of unclassified bacterial reads only?
Yes, it will use only the reads mapping against bacteria and consider the rest as unclassified