Unable to extract unmapped reads from sam file

Fangxi_Xu · May 1, 2023, 7:52pm

Hi,
We used MetaPhlAn version 4.0.4 to perform taxonomy profiling using the lastest vOct22 marker database. It worked well on the majority of our samples but there were 90% unclassified reads for 2 samples. It maybe because of the sample conditions but we were curious about the unclassified reads and wanted to investigate them separately.
I wanted to extract the unclassifed reads from the sam file created along with the bowtie2 output and Metaphlan profile using samtools but ran into errors below indicating I have a lot of duplicated sequence in my sam file:

Does this indicate something wrong in the alignment?

Do you have any other suggestions on how to extract unclassified reads from the input fastq?

Thank you.

Best,
Fangxi

Fangxi_Xu · May 2, 2023, 8:00pm

Hi,
I looked at the sam file and I think the error occurred because there were duplicated @SQ header lines in the sam file, indicating many of the reference sequences have the same names.
I wonder how that happened as the maker genes should be unique in the database. Duplicates make the sam file not valid.

It seems that all the duplicated refs were prefixed with “VDB”. What are those? Can you please provide some insights on this?

Thank you so much.

Best,
Fangxi

aitor.blancomiguez · May 9, 2023, 6:58am

Hi @Fangxi_Xu
The bowtie2 mapping in metaphlan is run with the --no-unal parameter, so the sam files do not contain the unaligned reads

Fangxi_Xu · May 9, 2023, 2:14pm

Thank you so much for your reply Aitor!
I was thinking to extract the unaligned reads from the input fastq using sam but then I noticed that the unaligned reads had nothing to do with the “UNCLASSIFIED” category in the Metaphlan profile, is that right? Is that because the reads were mapped to marker gene sequences for each species and even though it aligned to certain markers, that may not be enough to call a species? Can you please confirm?
Thank you so much.

Best,
Fangxi

aitor.blancomiguez · May 16, 2023, 9:46am

Hi @Fangxi_Xu
As MetaPhlAn only maps reads against a set of marker genes and not to full genomes, the UNCLASSIFIED category is calculated by estimating the number of reads that would be assigned to the detected clades (using the mean genome length of the species multiplied by the coverage of the markers). So the unclassified category would englobe both species that are still not present in our database as well as those species with not enough markers mapped

Topic		Replies	Views
Sam file & number of reads MetaPhlAn	1	681	April 12, 2021
Too many unclassified reads? MetaPhlAn	4	1161	May 29, 2023
MetaPhlAn 4.2.2 release (initial long-read sequencing support and database update) MetaPhlAn	2	616	June 11, 2025
Metaphlan marker level analysis MetaPhlAn	6	908	July 22, 2020
Large amount of unmapped reads metaphlan v4.0 soil community MetaPhlAn	2	105	June 18, 2024

Unable to extract unmapped reads from sam file

Related topics