Hi there,
I’ve ran the MetaPhlAn4 with --ignore_eukaryotes --ignore_archaea -t rel_ab_w_read_stats options and got the absolute count table of the human gut microbiome.
For further analysis, I wonder if there is a well defined quality control pipelines.
Do you recommend quality control with raw read counts, such as removing samples or species with low read counts.
I guess the most common quality control of metagenomic data is the minimum relative abundance and prevalence. However, do you have any recommendation for a specific threshold, for example removing species with minimum relative abundance 0.0001 at more than 5% of the total samples. If not, how can we choose one that fits our own data?
Afterall, do you have any golden number for the number of species present in the gut microbiome?
Thanks in advance for your answers.
Hi! I’m just a fellow user, but I’ll weigh in anyway. In our experience Metaphlan’s marker-based approach is pretty conservative. kmer based tools tend to have a long tail of rare bugs that may or may not be artifacts that need removal; Metaphlan doesn’t. I generally recommend people trust the results out of the box. If your sample has insufficient read depth (and you can understand this some by the coverage estimate you get running with --rel_ab_w_read_stats), you risk missing low-abundance organisms. However, the authors of Maaslin3 recommend accounting for this by including the read depth as a model feature, rather than making arbitrary filtering decisions beforehand.
There are no golden thresholds, but in our experience those threshold are more relevant for things like 16S analysis or kmer-based mgx profiles, both of which result in that long tail of low abundance and/or spurious taxa. Everything will depend on the analyses you are trying to perform, but I’d start with taking the results as-is.
Good luck!