I wonder if there way to get OTU table after all three stages of humann ( metaphlan, bowite, diamond )
I know how to get it after metaphlan stage (it called like $sample_metaphlan_bugs_list.txt), but couldn’t find one after whole pipeline. In my inderstanding it should be more correct and full since much more reads will be identified to end of pipilene ( comparing just to 1st metaphlan stage )
I would use the MetaPhlAn profiles produced by HUMAnN as the equivalent of an OTU table. The reads assigned to species-stratified functions in other parts of HUMAnN won’t be in units of species abundance. For example, if you were to sum over species in the HUMAnN gene families file, the output would favor species with larger genomes (contributing more genes).
Are you trying to get to a species x sample table of integer counts of reads assigned to each species? If so, the lines you pulled out from the log file are the closest we get to that from HUMAnN. That said, you could slice out those rows from each log and then merge them into a table to (e.g.) apply count-based statistics.
While it’s true that OTU tables are usually represented as integer counts, the concept of an OTU doesn’t really apply in HUMAnN since we are quantifying well-characterized species.
Yes, I know that I can do it by myself. I just expected some support from HUMAnN around this along with renormalization function (to RPK or CPM units), etc.
What does it mean? That’s not fit for HUMANN because I could lost species with small or medium abundances in my sample? Or I could lost species with big abundances but with bad mapping properties ( or smth like this) ? What the most dangerous issues could happen?
Sorry for the confusion. I just meant that OTUs (as a concept) arise as fuzzy taxonomic units in the field of amplicon sequencing. Whereas HUMAnN is focused on profiling characterized species from shotgun sequencing. Hence I was initially not sure what you meant by “OTU table” in the context of HUMAnN, but it sounds like what you actually want is a table of species read counts per sample. Both of these would be examples of “taxonomic profiles” or “count-based taxonomic profiles” more broadly.
Ohh… I got it ! Thanks for explanation and exuse me. I went from 16S analyzis and stolen standard terms…
Yes, from HUMAnN side I indeed need just such a table with species.
But as I understood there are no options in HUMAnN to generate such species list except parsing of log file.
Is it possible to renormalize table which I will construct from log files with HUMAnN tools? I would like transform “hits” to RPK or CPM units. And could you tell what units used by default for MetaPhlan ( I unfortunately didn’t find this information in Tutorial )
No worries. You’re right there is no “official” way to get this sort of table from HUMAnN, hence the hack with the log files. Once you have a sample x species read count table you could apply humann_renorm_table to convert to relative abundance or CPM.
That said, the power of extracting counts in this way is usually to directly work with the counts (e.g. using stats methods that are designed for count data, like DESeq2). If you sum-normalize total counts (to get to relative abundance) you will overestimate the abundance of species with bigger genomes. This is part of the problem MetaPhlAn is solving by only working with the subset of reads that aligns to a fixed set of marker genes per species.
You wouldn’t want to normalize by pangenome size because you don’t expect the entire pangenome to be present in any single sample. E. coli, for example, has a large pangenome compared to the size of a typical E. coli genome, so if you normalized E. coli hits to E. coli pangenome size you would underestimate E. coli abundance. You would need to do something like normalize the total hits to a species by the average isolate genome size for that species, but at that point you’re basically building a new taxonomic profiling method, and if that’s what you want I’d just use the MetaPhlAn profiles (since they already address these problems and more, e.g. handling non-unique genomic regions).
I don’t understand the problem. In my understanding for example there are species “A” and species “B”. And for example pangenome of A=300 Mbp, and pangenome B = 200 Mbp. It means that if we have equal amount of “A” and “B” in sample, the mapping proporiton “A/B” will be ~ 3/2 ( independently from genome size of “A” and “B” ) so, to get real relative counts - we need to divide bowtie hits “A” on 3 and bowtie hits “B” on 2.
As I mentioned early - I’m afraid of not high-pricision results of Metaphlan since of searching on only marker genes (which are very small part of whole genome )
Also I provided some initial tests on Metaphlan and seems to got not very good results - I wrote about this in topic Metaphlan target microbiomes
If A genomes are twice as big as B genomes, and A and B cells are present in equal number, then A will contribute twice as many reads as B, and the A pangenome will recruit twice as many reads as the B pangenome (regardless of the pangenome sizes). Thus in this scenario you would want to normalize the A and B read counts by 2 and 1 respectively to correctly estimate the cell fraction of the two species (which is usually the goal of a taxonomic profile).
Note that mapping to marker genes tends to be MORE precise than mapping to the whole (pan)genome for taxonomic profiling since it avoids errors from shared genes (for example, if reads from a third species C also map to the B genome you might overestimate B’s abundance by the approach above). The fact that MetaPhlAn is missing some of your species of interest in the other post is noteworthy though - I will let the MetaPhlAn developers address that post. I don’t think that mapping more reads would solve the problem (my guess is that some of those species just aren’t well represented in our databases).
I thought that genome and pangenome sizes are independent things isn’t it? Isn’t possbile that genome A twice as big as genome B but pangenomes are equal?
Anyway it seems that you confirmed my suggestion - just to divide counts by pangenome size, isn’t it?
That’s true, my only concern that if for example you marker gene size is only 100bp – you can never hit them on metaphlan stage. So I mean that for small marker genes result could be not very consistent
It is genome size that determines how many reads a species contributes to a metagenome per cell, not pangenome size, so that is what you want to normalize against in this scenario. Species A and B could have similar genome sizes (and contribute similar numbers of reads per cell), but B has a larger pangenome because B strains have fewer genes in common. The Wikipedia article on pangenomes has some great discussion of these ideas.
If we talking about proportion of reads in raw fastq file, than yes, it will be proportional to whole genome.
But I was talking particular about mapping to pangenomes at bowtie stage. After that stage it looks like correct normalization is to divide bowtie hits on pangenome size for that species. So again, I’m talking about normalization after bowtie, not normalization after diamond or normalization after 100% fastq correct mapping
Anyway it looks like our discussion is become off-topic, so probably would be better to create separate thread for this