The bioBakery help forum

Paired-end files HUMAnN2


I am new to shotgun metagenomics and I want to use your bioBakery toolbox, starting with kneadData, MetaPhlAn2 and HUMAnN2, and automate the analysis of shotgun data. I’m questioning myself about the best way to handle paired-end sequencing data.

Correct me if I’m wrong, but from what I understood, there is no benefit in providing paired-end files as MetaPhlAn2 and HUMANnN2 will basically use them like if they were two single-end files.

With that in mind, I am thinking about concatenating the forward and reverse files in a single file before performing any analysis. That way, I would have a similar workflow no matter if my data is single-end or paired-end, and it would be easier to handle it technically speaking (no paired.1, paired.2, single.1, single.2 files to deal with). Is there any drawback to this approach ?

Some other questions (actually related to the yes/no answer to my latter question) :

  • When dealing with overlapping paired-end reads, do you merge them at the beginning of the process ?
  • Is there any bioBakery tool that uses the paired-end info ?

KneadData is the only tool that uses end-pairing information (optionally). If you’re working with host-contaminated reads, we can use the end-pairing information when aligning the reads back to the host genome.

The tools that align to isolated gene sequences (including MetaPhlAn and HUMAnN) do not consider end-pairing information, and so we concatenate our first- and second-end reads into a single input file for those.

Hi franzosa,

I am wondering what is the point of merging paired end sequences if HUMAnN2 (or 3) doesn’t consider the end-pairing information? To provide more coverage?

Also, there are several merging programs available (e.g. bbmerge, ngmerge, vsearch), if you have to recommend one, what would be your choice? or do you think the simple cat can serve better under this context? Thank you so much!

Definitely cat - the goal is to convert the paired reads to a single input file, not to combine potentially overlapping reads.

In our experience, end-pairing information is very informative when aligning to longer sequences (contigs, genomes) when you expect both reads from a fragment to hit the same target in close proximity. When aligning to individual genes (as in most of our tools), it’s common for one read to align to a given gene while its mate overhangs that gene (aligning elsewhere or not at all). We’ve found it more straightforward to just align the reads separately rather than to check for and enforce concordant alignment in the fraction of cases where it would’ve been possible.

This totally makes sense to me. Thank you for answering!