Unable to extract unmapped reads from sam file

Hi,
We used MetaPhlAn version 4.0.4 to perform taxonomy profiling using the lastest vOct22 marker database. It worked well on the majority of our samples but there were 90% unclassified reads for 2 samples. It maybe because of the sample conditions but we were curious about the unclassified reads and wanted to investigate them separately.
I wanted to extract the unclassifed reads from the sam file created along with the bowtie2 output and Metaphlan profile using samtools but ran into errors below indicating I have a lot of duplicated sequence in my sam file:

Does this indicate something wrong in the alignment?

Do you have any other suggestions on how to extract unclassified reads from the input fastq?

Thank you.

Best,
Fangxi

Hi,
I looked at the sam file and I think the error occurred because there were duplicated @SQ header lines in the sam file, indicating many of the reference sequences have the same names.
I wonder how that happened as the maker genes should be unique in the database. Duplicates make the sam file not valid.

It seems that all the duplicated refs were prefixed with “VDB”. What are those? Can you please provide some insights on this?

Thank you so much.

Best,
Fangxi

Hi @Fangxi_Xu
The bowtie2 mapping in metaphlan is run with the --no-unal parameter, so the sam files do not contain the unaligned reads

Thank you so much for your reply Aitor!
I was thinking to extract the unaligned reads from the input fastq using sam but then I noticed that the unaligned reads had nothing to do with the “UNCLASSIFIED” category in the Metaphlan profile, is that right? Is that because the reads were mapped to marker gene sequences for each species and even though it aligned to certain markers, that may not be enough to call a species? Can you please confirm?
Thank you so much.

Best,
Fangxi

Hi @Fangxi_Xu
As MetaPhlAn only maps reads against a set of marker genes and not to full genomes, the UNCLASSIFIED category is calculated by estimating the number of reads that would be assigned to the detected clades (using the mean genome length of the species multiplied by the coverage of the markers). So the unclassified category would englobe both species that are still not present in our database as well as those species with not enough markers mapped