The bioBakery help forum

HUMANN2 / HUMANN2 ground truth values

Hey

Thanks for making the synthetic FASTQ datasets available that were used in the benchmarking, but I wonder if you could also make available the Uniref90 and Uniref50 gold standard values (i.e. the calculated quantities / RPKs) that were used as gold standard to compare to the HUMANN2 / 3 output?

I am having trouble recreating them from the synthetic fastqs alone.

Thanks
Mick

Here is a TAR.GZ of all the gold standards from the paper:

The prefix of each file indicates the sample: “gutmain” for the main gut metagenome, “gutmain50” for the same metagenome analyzed at the UniRef50 level, “gutmtx” for the gut metatranscriptome, and “complex” for the 100-member even metagenome. The infix “pred_gsgf” means “predicted gold standard gene families”, with predictions coming from the source genomes’ average coverage values. The infix “samp_gsgf” refers to a sampled gold standard that accounts for random variation in read sampling. All of these are in RPK units (i.e. they have not been normalized to CPMs yet, but you can do that with the humann2_renorm_table utility).

1 Like

Hello!

Once again, many thanks for this reply and these data!

As I mentioned, we are interested in producing “gold standard” datasets for other species, and I am looking at this from the Methods of the HUMANN2 paper:

We produced a gene family abundance gold standard by incrementing the
abundance of each gene family found in a genome by the product of the
genome’s coverage [in reads per kilobase (RPK) units] and the gene
family’s copy number

For the life of me I cannot understand what this means (I am sorry!) nor why it was done in this way.

Is there a simple worked example that provides more detail on (1) what exactly was done, and (2) why each step was performed and what it produces?

Thanks
Mick

Say you have genome A at 5x coverage with 1 X gene and 2 Y genes encoded.
You also have genome B at 2x coverage with 2 X genes and 1 Y gene encoded.

The community coverage of X = 1 * 5x coverage + 2 * 2x coverage = 9x coverage.
The community coverage of Y = 2 * 5x coverage + 1 * 2x coverage = 12x coverage.

We used that procedure except in RPK units (reads per kilobase) as opposed to coverage (read-nucleotides per site). Note that this procedure doesn’t account for variation in coverage along the length of a genome, which tends to overestimate some genes and underestimate others.

1 Like

Ok OK great!

As I understand it, HUMANN3 first uses MetaPhlAn to select genomes, and then aligns reads to genes from those genomes.

Unaligned reads are then searched against UniRef proteins

So every time a read aligns to a gene/protein, either in nucleotide space or in protein space, you count that as “1x” coverage? i.e. you count the presence of the entire gene/protein once for each read that aligns to it?

I have a slight worry that the gene lengths include the stop codon and the protein lengths don’t, and so the protein lengths are 3bp shorter than the gene lengths for the same gene/protein… did you account for that?