Thanks for making the synthetic FASTQ datasets available that were used in the benchmarking, but I wonder if you could also make available the Uniref90 and Uniref50 gold standard values (i.e. the calculated quantities / RPKs) that were used as gold standard to compare to the HUMANN2 / 3 output?
I am having trouble recreating them from the synthetic fastqs alone.
Here is a TAR.GZ of all the gold standards from the paper:
The prefix of each file indicates the sample: “gutmain” for the main gut metagenome, “gutmain50” for the same metagenome analyzed at the UniRef50 level, “gutmtx” for the gut metatranscriptome, and “complex” for the 100-member even metagenome. The infix “pred_gsgf” means “predicted gold standard gene families”, with predictions coming from the source genomes’ average coverage values. The infix “samp_gsgf” refers to a sampled gold standard that accounts for random variation in read sampling. All of these are in RPK units (i.e. they have not been normalized to CPMs yet, but you can do that with the humann2_renorm_table utility).
Once again, many thanks for this reply and these data!
As I mentioned, we are interested in producing “gold standard” datasets for other species, and I am looking at this from the Methods of the HUMANN2 paper:
We produced a gene family abundance gold standard by incrementing the
abundance of each gene family found in a genome by the product of the
genome’s coverage [in reads per kilobase (RPK) units] and the gene
family’s copy number
For the life of me I cannot understand what this means (I am sorry!) nor why it was done in this way.
Is there a simple worked example that provides more detail on (1) what exactly was done, and (2) why each step was performed and what it produces?
Say you have genome A at 5x coverage with 1 X gene and 2 Y genes encoded.
You also have genome B at 2x coverage with 2 X genes and 1 Y gene encoded.
The community coverage of X = 1 * 5x coverage + 2 * 2x coverage = 9x coverage.
The community coverage of Y = 2 * 5x coverage + 1 * 2x coverage = 12x coverage.
We used that procedure except in RPK units (reads per kilobase) as opposed to coverage (read-nucleotides per site). Note that this procedure doesn’t account for variation in coverage along the length of a genome, which tends to overestimate some genes and underestimate others.
As I understand it, HUMANN3 first uses MetaPhlAn to select genomes, and then aligns reads to genes from those genomes.
Unaligned reads are then searched against UniRef proteins
So every time a read aligns to a gene/protein, either in nucleotide space or in protein space, you count that as “1x” coverage? i.e. you count the presence of the entire gene/protein once for each read that aligns to it?
I have a slight worry that the gene lengths include the stop codon and the protein lengths don’t, and so the protein lengths are 3bp shorter than the gene lengths for the same gene/protein… did you account for that?
Your description of the algorithm is close (in the first step we are aligning to pangenomes – i.e. bags of representative genes – rather than whole genomes).
The algorithm’s counting procedure is different from the procedure I described above for building a gold standard, however. When we see a read hit a gene, assuming the read has only one valid hit, we increment the gene’s abundance by 1 / (effective length in kb) to yield RPK units. If the genome is evenly covered, then after considering all reads this procedure will have produced the RPK values computed in the gold standard (assuming everything else is working ideally).
We do account for slight differences in gene length when normalizing. The “effective length” I noted above is (actual length - alignment length + 1), which gives the number of sites where the alignment could’ve started. This is the proper length to normalize against when aligning shorter sequences inside of longer sequences (here, reads into genes). The calculation accounts for the fact that stop codons are included in DNA-level pangenes but not AA-level UniProt proteins.
Could you comment on this a bit further? The second description is not all that clear to me. Also, for evaluating sensitivity, specificity, and accuracy which of these infix data sets were used?
Also, this accounts for 3 of the 5 gold standards: DNA gut, complex, RNA gut. Is there also this gold standard data for new species and new isolate synthetic microbiomes? Thank you so much for the thorough generation of these microbiomes btw. I think they are very useful for benchmarking across classifiers.
Predicted = if a species was expected to be at 5x coverage (based on the sample read depth and its desired relative abundance), then it contributes 5x expected coverage for all of the functions it encodes at single copy (or 10x for duplicated functions, and so forth).
Sampled = actually quantify the coverage of each function individually based on how reads were sampled. For example, two functions in the same genome could receive different coverage values by random read sampling. This can have a big impact on sensitivity in weakly covered genomes (if a read was never drawn from gene X, then you can’t quantify gene X, even if its source species had non-zero coverage).
We use the former (predicted) approach for calculating accuracy stats in all the main-text figures since it gives a LESS rosy view of performance. There is a figure in the supplement where we quantify how much of a hit to optimal performance is explained by not accounting for (typically) unknowable variation in read sampling. It was about 0.1 units of Bray-Curtis divergence in that example.
Here’s a link to an archive with the additional gold standards:
These were constructed at the COG level rather than UniRef90 level for the purposes of comparing with non-HUMAnN methods (as outlined in the paper). new_choco refers to the community of new isolate genomes for species that were known to HUMAnN, while non_choco refers to the community of new isolate genomes from species unknown to HUMAnN. There are three gold standards for each community: one inferred directly from IMG annotations, another inferred from UniRef90 annotations, and a third inferred from UniRef50 annotations.