Hello! I wonder whether I understood the output provided by Humann3 correctly.
I’ve analysed a sequencing file containing 22 599 340 reads (pulled forward and reverse reads from pair-end data) using Humann3 and UniRef90 database.
As a result I’ve got genefamilies.tsv profile, where every gene family had a value represented in RPK (reads per kb) format.
Do I understand correctly, that if I multiply every RPK value to the length of the reference of corresponding gene family (in kb), sum up obtained values and add the value specified under “UNMAPPED” in genefamilies file, I should get my original number of reads from the fastq file? Because the value I am getting is 15 527 250, which is ~69% from the number of reads subjected for the analysis.
Thank you
That’s the right idea, but it is somewhat more subtle. A read doesn’t “see” the full length of the gene, it sees (gene_length - read_length + 1) positions where it could begin to align, so that’s the factor that the RPKs are adjusted by (on a per-gene and per-read basis). Hence it’s tricky to back-calculate from RPKs to counts perfectly. If you prefer counts, we have added a mode in HUMAnN 4 where you can directly output counts instead of RPKs. If you want to stick to the back-calculation approach, subtracting your average read length from the gene lengths should improve your estimate.
1 Like
Thank you very much for the explanation! However, when I do as you suggest, the total number of reads becoming even smaller, because the length of gene in kb becoming smaller per 0.15 kb (average read length) for each gene. Thus the discrepancy between the original number of read and the reconstructed one is getting larger. Are ambiguous alignment filtered out for example? if yes, would they be counted in the “UNMAPPED” field?
Talking about “UMAPPED” field, it was also not fully clear for me. In my case the number in “UNMAPPED” field is 4126011, which does not correspond to the number of reads recorded in the diamond_unaligned.fa file (6541903).
Thank you in advance for your response!