Low mapping rates, humann, shallow shotgun

Hi,
I am trying to figure out if I can get with humann3 descent info out from shallow shotgun metagenomics sequencing data of human stool samples, Novaseq paired-end 250 bp reads . For now it looks like I have high rates of unmapped reads.
I’m using humann v3.6, with MetaPhlAn version 4.0.3.
Command on the server to run concatenated kneaddata cleaned reads:
sbatch --ntasks=1 --cpus-per-task=5 -p shared-cpu -t 02:15:00 --mem=30000
humann --threads 5 --input cat_reads/MCS.fastq --output humann3_out_MCS/
–nucleotide-database .local/lib/python3.10/site-packages/humann/data/chocophlan
–protein-database .local/lib/python3.10/site-packages/humann/data/uniref

The sequencing depth ranges between 43000 reads (neg controls) to 5M reads per pair of reads, so in the end up to 10M reads per sample. This is a pilot of 40 samples, where we aimed to have ~2M reads per sample. Median per sample in total is 4327477 reads currently. After removing human reads etc with kneaddata median per sample 3834510.

After running humann, taking into consideration only samples that had in total (both read pairs) at least 2M reads, I have an median of 63.2% unmapped in the gene families file, ranging from 36 to 94 %.
I was thinking that the high percentage of unmapped could be linked to low read numbers, but there seems to be no clear correlation:
image

When I blasted (blastx) some of the diamond unaligned reads, I had various results, ranging from some reads not mapping anywhere, but many mapping just partially - as my reads are rather long 250bp :

I was wondering if I might have too long reads?? Or is it still potentially a problem of not having enough depth (though I saw somewhere mentioned that even with 1M you could get some results…) ?

Any advice on how to increase the mapping rate? Any advice on analyzing shallow shotgun data with humann3 - maybe changing some parameters would make sense?

Thank you very much in advance!!
And have lovely holidays :slight_smile:
Rahel

Hello,

I’ve been having same issues, but with metatranscriptomic mice data. Are there any advices on how to adjust tool parameters in order to get better mapping rate?

Best,
Matija

1 Like

Hi Both - For shallow sequencing in particular, you may need to lower the coverage thresholds that HUMAnN uses by default to determine if a gene is present. For example, if a read hits a gene, that gene won’t be included in the output (and the read won’t count as mapped) unless 50% of the sites in that gene were hit by reads (default value). You can remove that filter by setting:

--translated-subject-coverage-threshold 0.0
--nucleotide-subject-coverage-threshold 0.0

I have a feeling that this is the main issue driving the low mapping rate. Separately, HUMAnN assumes that reads should align fully within target genes, and this becomes harder with longer reads. However, I would not expect the effect here to be extreme with reads of 250 nt vs. our more typical 100-150 nt (it will matter more when working with very long reads, say 1 knt or more).

That said, if you also want to relax this filter, you would probably need to use a local alignment mode in bowtie2 (e.g. passing --very-sensitive-local via the --bowtie2-options flag) and tone down the following two parameters:

--nucleotide-query-coverage-threshold 50.0
--translated-query-coverage-threshold 50.0

The above setting would require HALF of a read to align within a target gene to count as mapped. The default is 90% (~all of the read).