I am trying to figure out if I can get with humann3 descent info out from shallow shotgun metagenomics sequencing data of human stool samples, Novaseq paired-end 250 bp reads . For now it looks like I have high rates of unmapped reads.
I’m using humann v3.6, with MetaPhlAn version 4.0.3.
Command on the server to run concatenated kneaddata cleaned reads:
sbatch --ntasks=1 --cpus-per-task=5 -p shared-cpu -t 02:15:00 --mem=30000
humann --threads 5 --input cat_reads/MCS.fastq --output humann3_out_MCS/
The sequencing depth ranges between 43000 reads (neg controls) to 5M reads per pair of reads, so in the end up to 10M reads per sample. This is a pilot of 40 samples, where we aimed to have ~2M reads per sample. Median per sample in total is 4327477 reads currently. After removing human reads etc with kneaddata median per sample 3834510.
After running humann, taking into consideration only samples that had in total (both read pairs) at least 2M reads, I have an median of 63.2% unmapped in the gene families file, ranging from 36 to 94 %.
I was thinking that the high percentage of unmapped could be linked to low read numbers, but there seems to be no clear correlation:
When I blasted (blastx) some of the diamond unaligned reads, I had various results, ranging from some reads not mapping anywhere, but many mapping just partially - as my reads are rather long 250bp :
I was wondering if I might have too long reads?? Or is it still potentially a problem of not having enough depth (though I saw somewhere mentioned that even with 1M you could get some results…) ?
Any advice on how to increase the mapping rate? Any advice on analyzing shallow shotgun data with humann3 - maybe changing some parameters would make sense?
Thank you very much in advance!!
And have lovely holidays
I’ve been having same issues, but with metatranscriptomic mice data. Are there any advices on how to adjust tool parameters in order to get better mapping rate?
Hi Both - For shallow sequencing in particular, you may need to lower the coverage thresholds that HUMAnN uses by default to determine if a gene is present. For example, if a read hits a gene, that gene won’t be included in the output (and the read won’t count as mapped) unless 50% of the sites in that gene were hit by reads (default value). You can remove that filter by setting:
I have a feeling that this is the main issue driving the low mapping rate. Separately, HUMAnN assumes that reads should align fully within target genes, and this becomes harder with longer reads. However, I would not expect the effect here to be extreme with reads of 250 nt vs. our more typical 100-150 nt (it will matter more when working with very long reads, say 1 knt or more).
That said, if you also want to relax this filter, you would probably need to use a local alignment mode in bowtie2 (e.g. passing
--very-sensitive-local via the
--bowtie2-options flag) and tone down the following two parameters:
The above setting would require HALF of a read to align within a target gene to count as mapped. The default is 90% (~all of the read).
Hi @franzosa ,
Thank you so much for you advice! I have now tried the recommendations and here are the results.
With the below command the unmapped dropped from original median 63.6% to 55.5%:
humann --threads 10 --translated-subject-coverage-threshold 0.0 --nucleotide-subject-coverage-threshold 0.0 --input cat_reads/Neg1.fastq --output humann3_out_20230217_query2/ --nucleotide-database /home/users/p/parkr/.local/lib/python3.10/site-packages/humann/data/chocophlan --protein-database /home/users/p/parkr/.local/lib/python3.10/site-packages/humann/data/uniref
And with this one the unmapped dropped from original median 63.6% to 17.3%:
humann --threads 10 --translated-subject-coverage-threshold 0.0 --nucleotide-subject-coverage-threshold 0.0 --bowtie-options=“–very-sensitive-local” --nucleotide-query-coverage-threshold 50.0 --translated-query-coverage-threshold 50.0 --input cat_reads/Neg1.fastq --output humann3_out_20230217_query2/ --nucleotide-database /home/users/p/parkr/.local/lib/python3.10/site-packages/humann/data/chocophlan --protein-database /home/users/p/parkr/.local/lib/python3.10/site-packages/humann/data/uniref
So here the second modification with lowering the percentage of how much read needs to align within a target gene to count as mapped, had a much bigger effect. Can I conclude that the problem was then mainly that the reads are long? What percentage would you advise in the final analyses? To keep the 50% or would it be wiser to make it a bit higher to have reliable results?
Thank you so much in advance!!
Sorry I missed this reply. Yes, it does seem that lowering the % of the read that needed to be used in the alignment improved your mapping rate a lot. Depending on how long your reads are, 50% of a long read could still be a very long (and highly confident) alignment, so I wouldn’t worry about that value being too low.