Hello fellow bioBakers! 
Apologies if this isn’t the right category for my question — please let me know if I should move it elsewhere.
I’ve recently started working on metatranscriptomics of the gut microbiota, and I’m currently using HUMAnN v4.0.0.alpha.1 together with MetaPhlAn v4.1.1 (11 Mar 2024) .
I’m benchmarking several microbial community profiling tools using synthetic communities as ground truth — meaning I know both the identities and abundances of the species present.
My goal is to assess each tool’s performance in terms of:
- Detection accuracy: how well the tool detects species that are truly present.
- Quantification fidelity: how closely the predicted abundance distributions match the ground truth.
For quantification fidelity, I’m using MetaPhlAn’s -t rel_ab_w_read_stats and HUMAnN 4’s new --count-normalization "Counts".
Now I’d like to compare these Counts with the known species abundances (in reads) from my synthetic datasets.
Since I’m providing HUMAnN 4 with a single concatenated FASTQ file containing both forward and reverse reads, my first thought was to divide the resulting Counts by two. However, I suspect this might not be the right approach because:
-
HUMAnN counts reads, not fragments. Concatenating R1 and R2 simply doubles the input read list; the tool doesn’t know which reads belong to the same fragment.
-
Paired reads aren’t always both informative. Sometimes only one mate maps, or they map to different targets. Dividing by two could undercount valid contributions.
-
The “two-reads-per-fragment” assumption rarely holds. You’ll get a mix of 2, 1, or 0 mapped reads — a uniform halving might introduce bias.
Moreover, normalization methods (e.g., CPM or relative abundance) might already correct for library size, so manual halving could be unnecessary.
Should I instead rely on humann_renorm_table for normalization (CPM or relative abundance), or is there a better way to compare Counts with the ground truth abundances?
Any clarification or suggestions would be greatly appreciated!
Best,
Fran 
If you have expected (paired) read counts for a list of species, and HUMAnN output in read counts per gene family, you ought to be able to add up the HUMAnN counts by species and then compare those with your expectation. For example, sum all the rows of the form Gene_X|Species_Y by species to get a total Species_Y read count.
If you then plot the observed vs. expected counts they ought to follow a nice linear relationship with a slope slightly below 1 (maybe ~0.8?). That’s due in part to the fact that HUMAnN won’t see 1) reads that spanned protein-coding genes, 2) reads from non-protein-coding genes, and 3) reads from intergenic regions. Hence we expect it to undercount species’ total reads.
1 Like
Thank you @franzosa for such a quick reply! 
Indeed, I’m already summing all the rows of the form Gene_X|Species_Y by species to get a total Species_Y read count.
However, since these results are based on the (single end) sample with double the number of reads of my original (paired-end) ground truth, should I divide by two these total Species_Y read counts?
Otherwise, HUMAnN 4 results would be based on a “duplicated” replica of my original sample, wouldn’t they? I could end up with HUMAnN 4 quantifying more (single end) reads than my original (paired-end) reads from the (synthetic) ground truth 
I guess I’m confused about the units that your gold standard is in, and maybe the structure of the gold standard itself. If the ground truth is actually in units of reads, then you would compare with the HUMAnN counts and expect the sort of trend I described above. If the gold standard is reported in fragments, then you’d want to multiply the gold standard counts by two before comparing with the HUMAnN result. If it’s in some other abundance units you could still do the comparison, but you would no longer be expecting a y = x trend.
As context, when we do simulations here, we select a set of genomes and assign them target coverage values. These coverage values are directly proportional to the genomes’ perceived relative abundances. We then sample sequencing fragments from each genome up to the target coverage, and then each of the fragments gets converted into a pair of sequencing reads. Hence at the end of the day we know the target relative abundance, coverage, fragment count, and read count for each genome.