I recently noticed that a bunch of my KO and PFAM tables had large numbers of zeros in them. This seems to be caused by the fact that I used humann2_regroup_table
on a TSS-normalized genefamilies
table, which seems to cause loss of a ton of information:
$ humann2_regroup_table -i output/humann2/main/C0200-3F-1A_S42_genefamilies.tsv -o ~/Desktop/test.tsv -g uniref90_pfam
#...
$ head ~/Desktop/test.tsv
# Gene Family C0200-3F-1A_S42_Abundance-RPKs
UNMAPPED 6126697.0
UNGROUPED 3035995.28
UNGROUPED|g__Adlercreutzia.s__Adlercreutzia_equolifaciens 816.084
UNGROUPED|g__Akkermansia.s__Akkermansia_muciniphila 1854.831
UNGROUPED|g__Alistipes.s__Alistipes_finegoldii 3530.86
UNGROUPED|g__Alistipes.s__Alistipes_senegalensis 9883.201
UNGROUPED|g__Alistipes.s__Alistipes_shahii 5081.175
UNGROUPED|g__Anaerostipes.s__Anaerostipes_hadrus 33344.871
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei 47247.771
$ humann2_renorm_table -i ~/Desktop/test.tsv -o ~/Desktop/test_relab.tsv -u relab -p
# ...
$ head ~/Desktop/test_relab.tsv
# Gene Family C0200-3F-1A_S42_Abundance-RELAB
UNMAPPED 0.446235
UNGROUPED 0.221125
UNGROUPED|g__Adlercreutzia.s__Adlercreutzia_equolifaciens 5.94391e-05
UNGROUPED|g__Akkermansia.s__Akkermansia_muciniphila 0.000135096
UNGROUPED|g__Alistipes.s__Alistipes_finegoldii 0.000257168
UNGROUPED|g__Alistipes.s__Alistipes_senegalensis 0.000719838
UNGROUPED|g__Alistipes.s__Alistipes_shahii 0.000370085
UNGROUPED|g__Anaerostipes.s__Anaerostipes_hadrus 0.00242866
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei 0.00344127
So far so good, but when I regroup from genefamilies_relab
…
$ humann2_regroup_table -i output/humann2/main/C0200-3F-1A_S42_genefamilies_relab.tsv -o ~/Desktop/test2_relab.tsv -g uniref90_pfam
# ...
$ head ~/Desktop/test2_relab.tsv
# Gene Family C0200-3F-1A_S42_Abundance-RELAB
UNMAPPED 0.495
UNGROUPED 0.245
UNGROUPED|g__Adlercreutzia.s__Adlercreutzia_equolifaciens 0.0
UNGROUPED|g__Akkermansia.s__Akkermansia_muciniphila 0.0
UNGROUPED|g__Alistipes.s__Alistipes_finegoldii 0.0
UNGROUPED|g__Alistipes.s__Alistipes_senegalensis 0.001
UNGROUPED|g__Alistipes.s__Alistipes_shahii 0.0
UNGROUPED|g__Anaerostipes.s__Anaerostipes_hadrus 0.003
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei 0.004
… everything is rounded to 3 decimal places.
First way, none of them are zeros:
$ julia -e "count(line-> parse(Float64, split(line, '\t')[2]) == 0, readlines(\"/home/kevin/Desktop/test_relab.tsv\")[2:end]) |> println"
0
Second way, tons are:
$ julia -e "count(line-> parse(Float64, split(line, '\t')[2]) == 0, readlines(\"/home/kevin/Desktop/test2_relab.tsv\")[2:end]) |> println"
50630
Not sure if it’s a precision error or some assumption made in the regrouping script, but I couldn’t find any mention of it in the documentation.