Invalid quantification from concatenating paired reads?

Thanks for creating HUMAnN2, it’s a fantastic tool

I would appreciated some insight on the rationale for ignoring paired read info and in particular concatenating paired reads. My concern is for the validity of the quantification. Am I correct in saying that this is the process used by HUMAnN2:

If both reads in a pair align to the same protein, then the protein family will receive a count for each read, and so 2 in total

If only one read in a pair aligns to a protein (and the other presumably would align to a non-coding region), then the protein family will receive a single count for the read

But quantitatively, there could be no difference in the abundance of the 2 families

It seems to avoid this, the best option at the moment would be to dump one of each read pair, rather than concatenate

However, it could be possible for HUMAnN2 to take into account paired read information without a strict requirement for both reads to align to the same protein. For example:

  1. Both reads in a pair map to same protein = 1 read count for the protein family
  2. One read in pair maps to a protein, the other maps to nothing = 1 read count for the protein family
  3. Reads in pair map to 2 different proteins. This could indicate misalignment (unlikely if the individual alignments are good) but also other situations, such as mapping of the reads to different domains of a multi-domain protein, or different transcripts in a polycistronic mRNA. In this case, each protein family could receive 1 read count

Any thoughts on this issue/suggestion are much appreciated

What you’re describing here is related to the difference between counting sequencing fragments that hit a gene (yielding FPK abundances) vs. sequencing reads that hit a gene (yielding RPK abundances). Both are reasonable; HUMAnN focuses on the latter. I could absolutely imagine a HUMAnN variant that uses your rule 1-3 logic to count in fragments rather than reads, the main challenge being that – to my knowledge – current translated search engines don’t implement paired alignment, so we’d need to do the checking ourselves after the fact. Not impossible, but also not very convenient.

To your initial point about counting single vs. paired reads hitting genes, this was something we evaluated. It happens that, as a gene gets longer, the likelihood of “absorbing” both reads in a pair gets higher, such that the abundance in RPK units remains unbiased.

In the analysis above, we sampled paired reads from a genome at constant coverage (i.e. constant RPK) and then computed the RPK-level abundance of genes within that genome, only counting reads that aligned fully within the gene. As desired, the genes have a constant RPK abundance ~250, with no bias for or against longer genes. We do observe more error in the abundances of shorter genes, but this makes sense as their abundances are based on smaller counts, and so adding or removing one read has a more dramatic effect on the estimated abundance.

All of this is to say that I think our current approach, while perhaps not perfect, does perform very well in practice.

Thanks very much for the informative explanation

So if I understand correctly, the rationale is that mapping one or both reads within the coding region is a fairly random process that affects all genes equally, and so shouldn’t introduce any biases

In which case I agree it makes sense to concatenate the reads into a single file to improve sensitivity

My only thought is: could there be a possibility for a negative bias for transcripts which have longer non-coding regions, which are therefore more likely to align one read in the non-coding region, and receive more counts of 1 rather than 2?

Many thanks for your help in understanding this

Hmmm… I’m not sure I understand your thought experiment? HUMAnN is specifically aligning to protein-coding sequences or protein sequences themselves, so non-coding lengths shouldn’t factor into the calculation. It’s possible to have a pair of reads where one of them maps to a protein-coding region while the other would map to an adjacent non-coding region (though HUMAnN would not find/consider the latter mapping). The main factor determining the number of reads/pairs/fragments recruited to a protein-coding sequence is the sequence’s alignable length, which is something we adjust for when computing RPK units.

Sorry for the confusing terminology. As you mention, I am referring to reads which hypothetically would align to (unavailable) non-coding regions upstream or downstream of the CDS.

I have no idea if it would be important or not, but if a transcript had a longer UTR portion, could it be more likely for a read to fall within this region, which would then not align within the CDS, and so would show a slightly reduced read count compared with a transcript with shorter UTRs, which is more likely to have both reads aligned within the CDS?

Of course the alignable length is more important, just playing devil’s advocate here to try to improve my understanding!

Thanks again for your thoughts on this

I see what you’re getting at. If you imagine a pathological case of a transcript A with no UTR, then you know an A fragment produces two reads that map to A’s coding region, whereas a transcript B with a UTR could produce a pair of reads that span the UTR and coding region. It’s possible this could have a weird effect in the small-counts limit (it would take an experiment to be sure).

However, if you picture both transcripts at constant coverage depth, then what we’re doing is estimating the coverage over the coding region by counting the number of reads that align completely within that region divided by its effective length. This should give the same value (agnostic to partially aligned reads or reads aligning outside the coding region).