Questions about calculate_diversity.R script

agavriilidou · June 12, 2024, 11:57am

MetaPhlAn 4.1.0
mpa_vJun23_CHOCOPhlAnSGB_202307

Hi,
I wanted to make a phyloseq object using the mpa_table and the nwk tree after filtering out the tips of the tree that did not match with the taxa labels in the mpa_table and I encountered different issues.
I did in the past with an older version of the chocophlan database (2019) but I didnt manage to do it again with this version. One issue was that the tip labels of the tree were numbers that I did not figure out what they were representing. So I couldn’t match them and filter the table and the tree based on the labels.
To solve my issues, I looked into the Rscript provided as utility to calculate a- and b-diversity based on the mpa output table and I have a few questions.

In my mpa table I have 3,036 entries classified to species level and 1,567 entries classified to taxon.
Given that I understood correctly if the taxon separator is ‘t__’ the script removes the rest of the label. If the taxon separatory is ‘s__’ , it does the same.

if(opt$taxon_separator == "t__"){
  mpa_table[,1] <- gsub(".+\\|t__SGB", "", mpa_table[,1])
} else { 
  mpa_table[,1] <- gsub(".+\\|s__", "", mpa_table[,1])
}

Then the rest of the taxa names annotated up to species level are filtered out of the table at a later point in the Rscript and we end up with a filt_mpa_table that is used for the calculation of unifrac distances. We also filter out the tree and keep only the matching tip labels. Then calculate the distances.
If I do this I will miss 98 entries (annotated up to species level and not taxon level) that I won’t include in the diversity calculation. Can you recommend any ways to avoid this?

Another thing I couldn’t figure out is how the matching of the labels happens since this version of the tree has numbers on the tips while the clade names in the mpa tables are characters.
I tried to filter the mpa_table at the beginning and kept only entries annotated up to species and taxon level. Then removed the strings ‘s__’ and ‘t__SGB’ and ended up having characters in the rownames and was not able to match the tip labels of tree. Any ideas how to overcome this?
I noticed for the calculation of all the diversity metrics (alpha and beta) the Rscript used the mpa_table and not the filtered one (filt_mpa_table), meaning that all taxa levels are included and not only the ones at the species level. This includes duplicates based on the way the metaphlan output is structured.

Example below:

# Bray-Curtis
  if (opt$metric == "bray-curtis"){
    mat <- rbiom::beta.div(as.matrix(mpa_table), method="bray-curtis", weighted=TRUE)
  }
  
  # Jaccard
  if (opt$metric == "jaccard"){
    mat <- rbiom::beta.div(as.matrix(mpa_table), method="jaccard", weighted=FALSE)
  }

Only in the case of unifrac distances the filtered mpa table is used.

 if (opt$metric == "weighted-unifrac"){
      mat <- rbiom::beta.div(as.matrix(filt_mpa_table), tree=filt_tree, method="unifrac", weighted=TRUE)
      
    } else if (opt$metric == "unweighted-unifrac"){
      mat <- rbiom::beta.div(as.matrix(filt_mpa_table), tree=filt_tree, method="unifrac", weighted=FALSE)
    }

Is this the case? I might be missing something so I would highly appreciate your help.
Thank you in advance.

Best,
AG

Topic		Replies	Views
calculate_diversity.R Issues MetaPhlAn	3	39	December 16, 2024
Inquiry regarding MetaPhlAn SGBs phylogenetic tree MetaPhlAn	18	1403	July 4, 2024
Metaphlan4 Diversity Calculation Error MetaPhlAn	2	499	April 21, 2023
Conceptual Gap?: diversity metrics from metaphlan metagenomic output MetaPhlAn	3	2902	March 5, 2021
Need CHOCOPhlAnSGB_202212.nwk MetaPhlAn	6	850	May 17, 2023

Questions about calculate_diversity.R script

Related topics