The bioBakery help forum

Using HUMANn2 data in WHAM!

Hi there,

I am interested in visualizing HUMANn2 output in WHAM! (https://ruggleslab.shinyapps.io/wham_v1/). WHAM! requires a certain type of input data. The sample file contains three columns called “Acc,” “Feature,” and “Taxa” and lists the gene content as read counts.

By contrast, the HUMANn2 output has one column that combines Acc, Feature, and Taxa and gives gene content as RPK.

I am interested in converting my HUMANn2 output file into a format that can be analyzed in WHAM! or other programs that use read counts as input. My understanding is that I would need to multiply each number by the kilobases in each gene (RPK = reads per kilobase). Is there a simple way to find out the kilobases in each gene so that this conversion can be done?

Also, I can probably write an R script to split up the one column from HUMANn2 output into three columns, but is there any easier way to accomplish this?

Best,
Noah

If you can provide an example line of output formatted in the way that you need I can suggest ways to generate it efficiently (it might be possible with a command-line trick, otherwise one would probably need to write a script, as you suggest).

Generally speaking it is not possible to get raw counts from HUMAnN. The abundances you see in the output have potentially already been summed over entities with different lengths (e.g. multiple sequences mapping to the same UniRef90), so it’s not as simple as multiplying by a single sequence length to recover counts. One could get something count-like by directly counting up the hits to database sequences in the pangenome search output and translated search output.

That said, if you’re just aiming for visualization (and not, say, a statistical model that requires integer counts), it would seem that the RPK values ought to still work for you?

Hi Franzosa,

Thank you for your reply. I attached a 300-row sample file provided by WHAM as well as a 209-row sample file for my own samples. (Each has been reduced in size to make uploading easier.) You can see that the 1st column in my samples essentially needs to be split into 3 columns to match the WHAM input format.

Regarding doing statistics on HUMAnN2, I used the “humann2_renorm_table” command to normalize RPK to counts per million (cpm). In order to make this data acceptable for ALDEx2, I simply multiplied these counts by 1 million. Is this an appropriate transformation?

Are there specific statistical approaches that are recommended for RPK data? I am most interested in running differential abundance and correlation testing using genes and pathways abundance data compared to metadata categories and continuous variables. It would also be nice to do exploratory analyses like those provided by WHAM!

Thank you for your help!
Noah

WHAM_sample_input_300_rows.tsv (54.4 KB)
my_samples_209_rows.tsv (19.5 KB)

It looks like the format is FUNCTION_ID<tab>FUNCTION_NAME<tab>species. So, if you’ve already renamed your HUMAnN output, you could do something like…

grep "|" ORIGINAL | grep ": " | perl -pe "s/(: |\|)/\t/g" > NEW

To reformat the rows. That won’t handle the headers or any unnamed, unstratified rows though, so it might be better to just write a small script to capture that logic.


What format is ALDEx2 expecting? If it wants features to sum to 1.0 you can either use the renorm script in relab mode (or) divide your CPMs by 1 million.

We analyze HUMAnN outputs (normalized RPKs) using linear models as implemented in MaAsLin 2. There are no special tricks for RPKs per se, but we do log-transform the data during modeling to help with variance stabilization (using half the smallest non-zero feature value as a pseudocount). There’s also some merit to considering the zero and non-zero values separately, but doing that with a single model tends to be tricky.

Thanks for your help! I’ll give that command a try, and if it does not do the trick, I’ll write an R script.

ALDEx2 expects raw count data and applies a centered log ratio transformation. I thought that if I multiplied CPMs by 1 million, this would approximate counts.

I have looked into using MaAsLin 2 to analyze the data. However, it looks like this program expects linear combinations of explanatory variables, and I get different results if I analyze each variable separately. Statistically speaking, would it be appropriate to use MaAsLin 2 to do several separate bivariate analyses?

Otherwise, would it also be acceptable to simply use Wilcoxon rank-sum tests and Spearman correlations on RPKs?

You should be able to apply a CLR transform to RPK data; it’s an alternative to sum-normalization, but I don’t think it’s specifically tied to count data.

It would be ok (though not ideal) to use MaAsLin 2.0 to run separate single-covariate models in place of a single multi-covariate model. The advantage of the latter is being able to tease out independent signals from the data. That said, doing so comes at the cost of reduced statistical power.

The sort of non-parametric tests you mentioned are also OK (I use them all the time), but you would want to normalize your RPKs for sequencing depth before applying them (otherwise your results could be confounded by differences in sequencing depth).