Discrepancy in relative abundance and raw abundance files

Hello there,

I just ran the wmgx workflow (with default parameters), and the number of entries in “relative abundance” output files is systematically lower than in the “raw” abundance files:

wc output/humann/merged/*.tsv
9975 49877 588863 output/humann/merged/ecs.tsv
9914 49572 671235 output/humann/merged/ecs_relab.tsv
160259 801297 9380879 output/humann/merged/genefamilies.tsv
160258 801292 9105224 output/humann/merged/genefamilies_relab.tsv
789 7078 83367 output/humann/merged/pathabundance.tsv
757 6918 77632 output/humann/merged/pathabundance_relab.tsv

From my understanding, we should have the same number of lines in raw and relative abundance files. I suspect some sort of filtering, but could not find any info on that.

Could you please explain the reason of this discrepancy?

Thanks a lot!

My guess is that you are removing special features (UNMAPPED, UNGROUPED, etc.) during your normalization.

Indeed, thanks for the quick and spot-on reply!

awk ‘{print $1}’ tmp/humann/merged/pathabundance_relab.tsv > tmp/humann/merged/pathabundance_relab_names.list
grep -f tmp/humann/merged/pathabundance_relab_names.list -v tmp/humann/merged/pathabundance.tsv

UNMAPPED 29967.1437145868 115981.3770280872 102861.3404898724 100500.2299600781
UNINTEGRATED 42684.0053677411 107902.3717791706 110796.7696967103 129155.9278670251
UNINTEGRATED|g__Acidaminococcus.s__Acidaminococcus_intestini 0 5083.1125600959 0 0
UNINTEGRATED|g__Acidaminococcus.s__Acidaminococcus_intestini_CAG_325 0 2847.0181364410 0 0
UNINTEGRATED|g__Alistipes.s__Alistipes_putredinis 0 6579.9207693948 0 0
UNINTEGRATED|g__Alistipes.s__Alistipes_putredinis_CAG_67 0 1361.8073637478 0 0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_caccae 0 1978.2316749417 0 0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_cellulosilyticus 0 0 1622.6973434366 0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_coprocola_CAG_162 1366.6531819421 0 0 0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_dorei 0 2849.6180497356 26822.4802909961 5163.9951951313
UNINTEGRATED|g__Bacteroides.s__Bacteroides_eggerthii 0 0 0 10326.9369524682
UNINTEGRATED|g__Bacteroides.s__Bacteroides_faecis 0 0 1129.4210672802 0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_fluxus 0 0 0 7654.4874354997
UNINTEGRATED|g__Bacteroides.s__Bacteroides_massiliensis 0 0 0 3754.9005810025
UNINTEGRATED|g__Bacteroides.s__Bacteroides_ovatus 0 1340.7765518493 0 6540.0046354173
UNINTEGRATED|g__Bacteroides.s__Bacteroides_plebeius 0 0 0 1995.6522102630
UNINTEGRATED|g__Bacteroides.s__Bacteroides_plebeius_CAG_211 0 0 0 2059.2582283209
UNINTEGRATED|g__Bacteroides.s__Bacteroides_thetaiotaomicron 0 0 0 655.8485785403
UNINTEGRATED|g__Bacteroides.s__Bacteroides_uniformis 0 6251.8385039334 11774.4850171448 12279.0012881862
UNINTEGRATED|g__Bacteroides.s__Bacteroides_uniformis_CAG_3 0 1134.2728263491 4359.1739231876 4395.3539403961
UNINTEGRATED|g__Bacteroides.s__Bacteroides_vulgatus 0 8920.3810815709 3297.2798884464 9724.9721189971
UNINTEGRATED|g__Bacteroides.s__Bacteroides_vulgatus_CAG_6 0 2725.4170554521 562.7256560792 3030.4270121272
UNINTEGRATED|g__Bacteroides.s__Bacteroides_xylanisolvens 0 3399.6100801369 1692.4641674226 3307.6268810915
UNINTEGRATED|g__Dialister.s__Dialister_invisus 0 2247.3598830230 0 0
UNINTEGRATED|g__Dialister.s__Dialister_invisus_CAG_218 0 4996.5301415372 0 0
UNINTEGRATED|g__Eubacterium.s__Eubacterium_eligens 0 1574.2077156635 0 0
UNINTEGRATED|g__Eubacterium.s__Eubacterium_eligens_CAG_72 0 1351.8872697552 0 0
UNINTEGRATED|g__Faecalibacterium.s__Faecalibacterium_prausnitzii 11138.7293161285 11796.5848588853 17861.9183857442 0
UNINTEGRATED|g__Lachnospira.s__Lachnospira_pectinoschiza 0 5085.6496177987 0 0
UNINTEGRATED|g__Prevotella.s__Prevotella_copri 12502.6280418438 0 0 0
UNINTEGRATED|g__Prevotella.s__Prevotella_copri_CAG_164 4266.5038170823 0 0 0
UNINTEGRATED|unclassified 11601.6028965991 22488.9431624300 24055.6542762881 17258.5426240774

May I suggest updating the README (GitHub - biobakery/humann: HUMAnN is the next generation of HUMAnN 1.0 (HMP Unified Metabolic Analysis Network).)
‘“Special” features (such as UNMAPPED) can be included or excluded in the normalization process (excluded by default in the wmgx workflow).’?