I am interested in visualizing HUMANn2 output in WHAM! (https://ruggleslab.shinyapps.io/wham_v1/). WHAM! requires a certain type of input data. The sample file contains three columns called “Acc,” “Feature,” and “Taxa” and lists the gene content as read counts.
By contrast, the HUMANn2 output has one column that combines Acc, Feature, and Taxa and gives gene content as RPK.
I am interested in converting my HUMANn2 output file into a format that can be analyzed in WHAM! or other programs that use read counts as input. My understanding is that I would need to multiply each number by the kilobases in each gene (RPK = reads per kilobase). Is there a simple way to find out the kilobases in each gene so that this conversion can be done?
Also, I can probably write an R script to split up the one column from HUMANn2 output into three columns, but is there any easier way to accomplish this?
If you can provide an example line of output formatted in the way that you need I can suggest ways to generate it efficiently (it might be possible with a command-line trick, otherwise one would probably need to write a script, as you suggest).
Generally speaking it is not possible to get raw counts from HUMAnN. The abundances you see in the output have potentially already been summed over entities with different lengths (e.g. multiple sequences mapping to the same UniRef90), so it’s not as simple as multiplying by a single sequence length to recover counts. One could get something count-like by directly counting up the hits to database sequences in the pangenome search output and translated search output.
That said, if you’re just aiming for visualization (and not, say, a statistical model that requires integer counts), it would seem that the RPK values ought to still work for you?
Thank you for your reply. I attached a 300-row sample file provided by WHAM as well as a 209-row sample file for my own samples. (Each has been reduced in size to make uploading easier.) You can see that the 1st column in my samples essentially needs to be split into 3 columns to match the WHAM input format.
Regarding doing statistics on HUMAnN2, I used the “humann2_renorm_table” command to normalize RPK to counts per million (cpm). In order to make this data acceptable for ALDEx2, I simply multiplied these counts by 1 million. Is this an appropriate transformation?
Are there specific statistical approaches that are recommended for RPK data? I am most interested in running differential abundance and correlation testing using genes and pathways abundance data compared to metadata categories and continuous variables. It would also be nice to do exploratory analyses like those provided by WHAM!
Thank you for your help!
WHAM_sample_input_300_rows.tsv (54.4 KB)
my_samples_209_rows.tsv (19.5 KB)
It looks like the format is
FUNCTION_ID<tab>FUNCTION_NAME<tab>species. So, if you’ve already renamed your HUMAnN output, you could do something like…
grep "|" ORIGINAL | grep ": " | perl -pe "s/(: |\|)/\t/g" > NEW
To reformat the rows. That won’t handle the headers or any unnamed, unstratified rows though, so it might be better to just write a small script to capture that logic.
What format is ALDEx2 expecting? If it wants features to sum to 1.0 you can either use the renorm script in
relab mode (or) divide your CPMs by 1 million.
We analyze HUMAnN outputs (normalized RPKs) using linear models as implemented in MaAsLin 2. There are no special tricks for RPKs per se, but we do log-transform the data during modeling to help with variance stabilization (using half the smallest non-zero feature value as a pseudocount). There’s also some merit to considering the zero and non-zero values separately, but doing that with a single model tends to be tricky.
Thanks for your help! I’ll give that command a try, and if it does not do the trick, I’ll write an R script.
ALDEx2 expects raw count data and applies a centered log ratio transformation. I thought that if I multiplied CPMs by 1 million, this would approximate counts.
I have looked into using MaAsLin 2 to analyze the data. However, it looks like this program expects linear combinations of explanatory variables, and I get different results if I analyze each variable separately. Statistically speaking, would it be appropriate to use MaAsLin 2 to do several separate bivariate analyses?
Otherwise, would it also be acceptable to simply use Wilcoxon rank-sum tests and Spearman correlations on RPKs?
You should be able to apply a CLR transform to RPK data; it’s an alternative to sum-normalization, but I don’t think it’s specifically tied to count data.
It would be ok (though not ideal) to use MaAsLin 2.0 to run separate single-covariate models in place of a single multi-covariate model. The advantage of the latter is being able to tease out independent signals from the data. That said, doing so comes at the cost of reduced statistical power.
The sort of non-parametric tests you mentioned are also OK (I use them all the time), but you would want to normalize your RPKs for sequencing depth before applying them (otherwise your results could be confounded by differences in sequencing depth).
I found a script on GitHub that is linked on the WHAM! webpage that reformats HUMANn2 output to work with WHAM!
I want to analyze HUMAnN output with WHAM! Can you please tell me how should I normalize my data to make it compatible for WHAM? Should I convert it into
relative abundance out of 1? I cannot use RPKs as my samples are of different sequencing depths.
There is a python script that converts HUMANn2 output files to WHAM! format. Try the link indicated by the red arrow in the WHAM! home page. I think cpm or relative abundance would both be fine.
Thanks a lot Noah!!! It will really help me.
Actually, I was confused as:
- i found in the “Sample Input File” (the image you sent), they used data with RPKs unit and,
- in your post you have mentioned that “ALDEx2 expects raw count data”.
So, are you saying considering these two points I still can use CPM or relative abundance?
I think WHAM! takes care of the conversion for you. For example, I uploaded one of my files that was normalized to cpm, and I received the following message (see image). I have not tried using proportions (relative abundance out of 1), but I am guessing WHAM! would treat them the same way. I don’t know if this is the type of data ALDEx2 was designed for, as from my understanding it wants raw count data. However, Franzosa mentioned above that the CLR transformation should work for RPK, so maybe it would for CPM too.