Differentiating between "kSGBs" and "uSGBs" in MetaPhlAn 4

Hi,

I ran MetaPhlAn 4 (version 4.0.1 (24 Aug 2022)) with the following script:
metaphlan SAMPLE_paired_1.fastq,SAMPLE_paired_2.fastq --bowtie2out SAMPLE_hostremoved.bowtie2.bz2 --nproc 5 --input_type fastq -t rel_ab_w_read_stats --unclassified_estimation --bowtie2db=/home/bmtalbo/miniconda3/envs/metaphlan4/lib/python3.7/site-packages/metaphlan/metaphlan_databases --index mpa_vOct22_CHOCOPhlAnSGB_202212 -o metaphlan4_output/$SAMPLE_profiled_metagenome.txt

I think it’s really interesting to have the SGBs identified in the output. However, I’m unclear about two things:

  1. How can you tell the difference between “kSGBs” and "uSGB"s in the output? As far as I can see, there is only “SGB”, included, sometimes preceeded by “unclassified”. Is that the delimiter?

  2. How should I interpret the level of certainty for lower taxonomic grouping (ie a species) for lines where the higher taxonomic order is determined to be “unclassified”? I’m not interested in the strain-level differences for my analysis, but want to accurately identify all my unique species in the sample for alpha and beta diversity measures.

For example:
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Oscillospiraceae_unclassified|s__Oscillospiraceae_bacterium|t__SGB4247

The genus level is “unclassified”, the species name seems like a generic call to then assign a SGB strain. Would this line truly represent a unique species, or is the species level tag redundant to the genus level tag?

As a separate example, but the same issue, I also have this kind of species classification:
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Oscillospiraceae_unclassified|s__Oscillospiraceae_unclassified_SGB15252|t__SGB15252

In this instance, I would likely consider the species level more a “unique species” with a SGB tag and would probably include it in my taxonomic diversity analysis at the species level.

Thanks for your help!

Dear @bmtalbot
Answering your questions:

  1. From the species level taxonomy (s__) you can distinguish k and uSGBs by the inclusion of the “_SGB” string in the species name. E.g. s__Escherichia_coli → kSGB s__Escherichia_SGB10061 → uSGB.
  2. In MetaPhlAn 4, the t__ level is not considered the strain level but the species-level genome bin (SGB) level (Redirecting) and thus you can consider the t__ level as species-level taxonomic units.