Questions about calculate_diversity.R script

MetaPhlAn 4.1.0
mpa_vJun23_CHOCOPhlAnSGB_202307

Hi,
I wanted to make a phyloseq object using the mpa_table and the nwk tree after filtering out the tips of the tree that did not match with the taxa labels in the mpa_table and I encountered different issues.
I did in the past with an older version of the chocophlan database (2019) but I didnt manage to do it again with this version. One issue was that the tip labels of the tree were numbers that I did not figure out what they were representing. So I couldn’t match them and filter the table and the tree based on the labels.
To solve my issues, I looked into the Rscript provided as utility to calculate a- and b-diversity based on the mpa output table and I have a few questions.

  1. In my mpa table I have 3,036 entries classified to species level and 1,567 entries classified to taxon.
    Given that I understood correctly if the taxon separator is ‘t__’ the script removes the rest of the label. If the taxon separatory is ‘s__’ , it does the same.
if(opt$taxon_separator == "t__"){
  mpa_table[,1] <- gsub(".+\\|t__SGB", "", mpa_table[,1])
} else { 
  mpa_table[,1] <- gsub(".+\\|s__", "", mpa_table[,1])
}

Then the rest of the taxa names annotated up to species level are filtered out of the table at a later point in the Rscript and we end up with a filt_mpa_table that is used for the calculation of unifrac distances. We also filter out the tree and keep only the matching tip labels. Then calculate the distances.
If I do this I will miss 98 entries (annotated up to species level and not taxon level) that I won’t include in the diversity calculation. Can you recommend any ways to avoid this?

  1. Another thing I couldn’t figure out is how the matching of the labels happens since this version of the tree has numbers on the tips while the clade names in the mpa tables are characters.
    I tried to filter the mpa_table at the beginning and kept only entries annotated up to species and taxon level. Then removed the strings ‘s__’ and ‘t__SGB’ and ended up having characters in the rownames and was not able to match the tip labels of tree. Any ideas how to overcome this?

  2. I noticed for the calculation of all the diversity metrics (alpha and beta) the Rscript used the mpa_table and not the filtered one (filt_mpa_table), meaning that all taxa levels are included and not only the ones at the species level. This includes duplicates based on the way the metaphlan output is structured.

Example below:

# Bray-Curtis
  if (opt$metric == "bray-curtis"){
    mat <- rbiom::beta.div(as.matrix(mpa_table), method="bray-curtis", weighted=TRUE)
  }
  
  # Jaccard
  if (opt$metric == "jaccard"){
    mat <- rbiom::beta.div(as.matrix(mpa_table), method="jaccard", weighted=FALSE)
  }

Only in the case of unifrac distances the filtered mpa table is used.

 if (opt$metric == "weighted-unifrac"){
      mat <- rbiom::beta.div(as.matrix(filt_mpa_table), tree=filt_tree, method="unifrac", weighted=TRUE)
      
    } else if (opt$metric == "unweighted-unifrac"){
      mat <- rbiom::beta.div(as.matrix(filt_mpa_table), tree=filt_tree, method="unifrac", weighted=FALSE)
    } 

Is this the case? I might be missing something so I would highly appreciate your help.
Thank you in advance.

Best,
AG