Does Genefamilies, Pathabundance and pathcoverage work with split reads?

Hi there,

I am running HUMAnN on a number of fastq files that are part of a public database. Each sample has R1, R2 and R3. For most of my samples, I would concatenate R1, R2 and R3 to generate R4 and run HUMAnN on R4. For some of them, however, R1 and R2 are too big and HUMAnN takes a long time to run (my HPC permissions for durations of jobs do not extend for that long unfortunately). I was wondering if the output would make sense if I split R1 and R2 into two equal halves e.g. R1A and R1B + R2A and R2B and then ran HUMAnN on all of them individually and then combining the output files? I imagine because the output is in RPKs combining genefamilies should not be much of an issue but I am more curious with the other two.

Thanks for your help in advance!

This sort of thing can work in principle. You are right that if you divide the file in half and compute gene RPKs separately that the RPKs can then be summed to get a total RPK value (since RPKs behave like sequencing coverage). The one drawback with this approach is that HUMAnN often uses sequence coverage to decide if a sequence should be considered at all. It’s possible that a given sequence would fail to be sufficiently covered in either partial file but would be covered in the full file, thus making it a false negative under this approach.