Hi,
I’m just wondering how HUMAnN3 handles paired-end reads? I’m pretty new to metagenomics data analysis and was wondering if you can help.
Thanks,
a
Hi,
I’m just wondering how HUMAnN3 handles paired-end reads? I’m pretty new to metagenomics data analysis and was wondering if you can help.
Thanks,
a
I’ve written a bit about this here:
In short, we recommend concatenating your paired reads upstream of MetaPhlAn and HUMAnN.
Hi… Is it necessary to run kneaddata
before running MetaPhlAn 3.0 even if I use
--ignore_eukarya
option with metaphlan
command?
--ignore_eukarya
will only ignore reads recruited to eukaryotic microbial marker genes; it will not ignore/exclude generic human contaminant DNA. That said, MetaPhlAn itself is fairly robust to un-removed human contamination, but downstream steps (e.g. HUMAnN’s translated search) might not be.
In other words,whether to merge two files directly?for example:
cat sample_R1.fq sample_R2.fq > merge_sample.fq
Thanks!
Correct, this is what we mean when we say to concatenate paired reads.
Hi Eric,
I just wanted to follow up on the questions about how humann3 handles paired-end reads above. I wonder if we pass the concatenated reads to humann3 (e.g., concatenated reads = cat forward.fastq reverse.fastq), how does humann3 treat it? Will it treat the forward and reverse reads as two separate reads or it will consider them as a pair (e.g., a pair of reads hitting the same gene will be counted as 1 hit instead of 2 hits?
Thank you so much!
It will treat the forward and reverse reads as if they were independent unpaired reads. This is obviously an approximation, but from my simulations it doesn’t introduce any obvious biases in the downstream gene abundances.
Hi all, could I merge the forward and reverse file by Pear (Paired-End reAd mergeR), and then input the merged file to run the humann3?
For example,
pear -f test_1.fq -r test_2.fq -o test
Many thanks
This style of merging tries to overlap the paired reads into a new single read. This is only helpful if the single reads are long enough / sequencing fragment is short enough for the single reads to overlap. When that happens, it can be useful for applications like amplicon profiling since having a longer read gives more mapping specificity.
I don’t have experience using this procedure upstream of MetaPhlAn and HUMAnN. When we “merge” the reads for MetaPhlAn and HUMAnN we are really just concatenating them into one long file of single-end reads.
Hi Franzosa, thanks for your reply. I have ran humann3 with some demo samples using the “cat” command to concatenate forward and reverse files. This command worked successfully.
Thanks,
Jun
Great, glad to hear that worked!
I recognize that this merging the paired read files into one is unlikely to introduce significant bias based on latter comments. Would there be a benefit to including the unmatched reads since humann isn’t looking at the mates anyway?
Would it make sense to merge the forward reads from the paired ends, with both sets of unmatched reads to try to cover more of the genome rather than double-counting regions captured by paired-ends?
Our usual procedure is to merge the still-paired reads with both sets of orphaned reads as input to our methods. The procedure of throwing out one half of the still paired reads seems weird to me - I’m not convinced that it will have any notable result on unbiasing the results but it WILL hurt your effective sequencing depth / coverage.
I have occasionally seen people do things like analyze all read1s (paired + orphaned) vs. all read2s as a form of replication, although those ought to be nearly identical.
I wonder if it is ok to use just one pair for downstream analysis instead of concatenate the reads.
Another question, what is the disadvantages to use trimmed reads after kneaddata vs paired reads? I understand that the trimmed reads after trimmomatic and paired reads after bowtie2 but would that introduce any bias to gene analysis?
The main issue with analyzing the pairs separately is that the combination of the pairs can allow features to reach a coverage that the individual pairs wouldn’t. Hence you will be less sensitive to low-abundance features in the split analysis approach.
I don’t follow the second part of the question? Removing low-quality nucleotides (trimmed reads) tends to help alignment since the low-quality part might not align, which can limit the alignment of the read itself. KneadData uses Bowtie 2 to deplete reads that map to a reference database (e.g. host genome). Either of these processes can remove pairs of reads or a single read in a pair, resulting in orphans. Hence a typical KneadData run results in a set of intact pairs + read1s and read2s that have been orphaned from their original pairs. Most of our tools don’t need the pairing information at all, but we maintain it for processes that care, e.g. assembly.
Thank you franzosa!
For clarification because “Most of our tools don’t need the pairing information at all, but we maintain it for processes that care” This part confuses me. Are the paired files were the assemblies maintained bur shouldnt go to other tools?
Which output should go in to the humann3? after kneaddata output, I have files.
kneaddata.trimmed.1.fastq kneaddata.trimmed.2.fastq
kneaddata.repeats.removed.1.fastq kneaddata.repeats.removed.2.fastq
kneaddata_paired_1.fastq kneaddata_paired_2.fastq
Tangential question to understand workflow better. If the paired files already removed the reads that were not pairing, then, concatenating them will not make a difference or improve the sensitivity because to a coding region that read1 didnt align, its pair will not align either. Is that correct? Am I missing something here?
Sorry I was not being very clear there. I should’ve said that KneadData maintains the pairing information during QC so we don’t lose track of it. The final output of KneadData should be four files:
$SAMPLE_kneaddata_paired_1.fastq
$SAMPLE_kneaddata_paired_2.fastq
$SAMPLE_kneaddata_unmatched_1.fastq
$SAMPLE_kneaddata_unmatched_2.fastq
The larger table in the KneadData tutorial helps to explain the other files. There may also be an output file where the four files above have been concatenated (or this might be something specific to the bioBakery workflows).
Either way, that concatenated file is the typical input to tools like MetaPhlAn and HUMAnN. Those tools use the paired reads, they just don’t consider the pairing information as they work (treating all reads as independent, which is an approximation that works well in practice). Conversely, methods like metagenome assembly use the pairing information since it provides valuable structural details about the source genomes.
To your final question, it’s not necessarily the case that read1 aligning or failing to align to a coding region means that read2 will do the same. That’s because read1 might be inside the region while read2 is outside (or vice versa). This is part of the reason why we don’t enforce paired mapping in tools like MetaPhlAn that work with restricted gene catalogs.
Thank you franzosa!
I understand better now. So for Metaphlan tutorial there is a section *1.fastq,*2.fastq as an entry so instead of doing do you suggest to concatenate all four files above and use that as an input, or in the case of metaphlan would that be paired_1.fastq,paired_2.fastq, unmatched_1.fastq and unmatched_2.fastq?
Correct. The four post-QC files for a given sample (paired 1/2 and unmatched 1/2) are all concatenated as a single input for MetaPhlAn and HUMAnN.