Loss of information when doing renorm then regroup

I recently noticed that a bunch of my KO and PFAM tables had large numbers of zeros in them. This seems to be caused by the fact that I used humann2_regroup_table on a TSS-normalized genefamilies table, which seems to cause loss of a ton of information:

$ humann2_regroup_table -i output/humann2/main/C0200-3F-1A_S42_genefamilies.tsv -o ~/Desktop/test.tsv -g uniref90_pfam
#...

$ head ~/Desktop/test.tsv
# Gene Family	C0200-3F-1A_S42_Abundance-RPKs
UNMAPPED	6126697.0
UNGROUPED	3035995.28
UNGROUPED|g__Adlercreutzia.s__Adlercreutzia_equolifaciens	816.084
UNGROUPED|g__Akkermansia.s__Akkermansia_muciniphila	1854.831
UNGROUPED|g__Alistipes.s__Alistipes_finegoldii	3530.86
UNGROUPED|g__Alistipes.s__Alistipes_senegalensis	9883.201
UNGROUPED|g__Alistipes.s__Alistipes_shahii	5081.175
UNGROUPED|g__Anaerostipes.s__Anaerostipes_hadrus	33344.871
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei	47247.771

$ humann2_renorm_table -i ~/Desktop/test.tsv -o ~/Desktop/test_relab.tsv -u relab -p
# ...
$ head ~/Desktop/test_relab.tsv
# Gene Family	C0200-3F-1A_S42_Abundance-RELAB
UNMAPPED	0.446235
UNGROUPED	0.221125
UNGROUPED|g__Adlercreutzia.s__Adlercreutzia_equolifaciens	5.94391e-05
UNGROUPED|g__Akkermansia.s__Akkermansia_muciniphila	0.000135096
UNGROUPED|g__Alistipes.s__Alistipes_finegoldii	0.000257168
UNGROUPED|g__Alistipes.s__Alistipes_senegalensis	0.000719838
UNGROUPED|g__Alistipes.s__Alistipes_shahii	0.000370085
UNGROUPED|g__Anaerostipes.s__Anaerostipes_hadrus	0.00242866
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei	0.00344127

So far so good, but when I regroup from genefamilies_relab…

$ humann2_regroup_table -i output/humann2/main/C0200-3F-1A_S42_genefamilies_relab.tsv -o ~/Desktop/test2_relab.tsv -g uniref90_pfam
# ...

$ head ~/Desktop/test2_relab.tsv
# Gene Family	C0200-3F-1A_S42_Abundance-RELAB
UNMAPPED	0.495
UNGROUPED	0.245
UNGROUPED|g__Adlercreutzia.s__Adlercreutzia_equolifaciens	0.0
UNGROUPED|g__Akkermansia.s__Akkermansia_muciniphila	0.0
UNGROUPED|g__Alistipes.s__Alistipes_finegoldii	0.0
UNGROUPED|g__Alistipes.s__Alistipes_senegalensis	0.001
UNGROUPED|g__Alistipes.s__Alistipes_shahii	0.0
UNGROUPED|g__Anaerostipes.s__Anaerostipes_hadrus	0.003
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei	0.004

… everything is rounded to 3 decimal places.

First way, none of them are zeros:

$ julia -e "count(line-> parse(Float64, split(line, '\t')[2]) == 0, readlines(\"/home/kevin/Desktop/test_relab.tsv\")[2:end]) |> println"
0

Second way, tons are:

$ julia -e "count(line-> parse(Float64, split(line, '\t')[2]) == 0, readlines(\"/home/kevin/Desktop/test2_relab.tsv\")[2:end]) |> println"
50630

Not sure if it’s a precision error or some assumption made in the regrouping script, but I couldn’t find any mention of it in the documentation.

Looks like I added a --precision flag to this script that defaults to 3, probably assuming people were working with RPK/CPM units. Opening that up ought to rescue a bunch of the lost rare groups. I’ll make a note to change the default to NOT round to avoid this in the future.

2 Likes

I see that the UNMAPPED and UNGROUPED have a high relative abundance ,is that normal?

# Gene Family	C0200-3F-1A_S42_Abundance-RELAB
UNMAPPED	0.495
UNGROUPED	0.245

Thanks~

Having about a 50/50 mix of known/unknown reads is very typical. The ungrouped fraction depends on how rare the group annotation is that you are using. For example most proteins have a pfam domain, so grouping by pfam leaves a small ungrouped fraction. Conversely only ~10% of proteins have an EC annotation, so that leaves a larger ungrouped fraction.

Have you changed the default? I am running HUMAnN3 and still getting the same level of precision (3 decimals). As a result, I am getting tons of zeros. I have checked there is no optional argument for humann_renorm_table that can change the precision.

When I used uniref90_pfam for regrouping, I found 0.2-0.3 UNGROUPED and on an average 0.5 UNMAPPED (used kneaddata only for decontamination). Is this normal?

Thanks,
DC7

Hello - Yes, the default change from 3 decimals to None will be included in the next software release. For your current runs use the “–precision” option in the humann_regroup_table script (as I think this script and not the renorm table is where the precision changes occur). Sorry for any confusion about the precision settings.

Thank you,
Lauren

2 Likes