Paired-end files HUMAnN2

Hello,

I am new to shotgun metagenomics and I want to use your bioBakery toolbox, starting with kneadData, MetaPhlAn2 and HUMAnN2, and automate the analysis of shotgun data. I’m questioning myself about the best way to handle paired-end sequencing data.

Correct me if I’m wrong, but from what I understood, there is no benefit in providing paired-end files as MetaPhlAn2 and HUMANnN2 will basically use them like if they were two single-end files.

With that in mind, I am thinking about concatenating the forward and reverse files in a single file before performing any analysis. That way, I would have a similar workflow no matter if my data is single-end or paired-end, and it would be easier to handle it technically speaking (no paired.1, paired.2, single.1, single.2 files to deal with). Is there any drawback to this approach ?

Some other questions (actually related to the yes/no answer to my latter question) :

  • When dealing with overlapping paired-end reads, do you merge them at the beginning of the process ?
  • Is there any bioBakery tool that uses the paired-end info ?

KneadData is the only tool that uses end-pairing information (optionally). If you’re working with host-contaminated reads, we can use the end-pairing information when aligning the reads back to the host genome.

The tools that align to isolated gene sequences (including MetaPhlAn and HUMAnN) do not consider end-pairing information, and so we concatenate our first- and second-end reads into a single input file for those.

1 Like

Hi franzosa,

I am wondering what is the point of merging paired end sequences if HUMAnN2 (or 3) doesn’t consider the end-pairing information? To provide more coverage?

Also, there are several merging programs available (e.g. bbmerge, ngmerge, vsearch), if you have to recommend one, what would be your choice? or do you think the simple cat can serve better under this context? Thank you so much!

Definitely cat - the goal is to convert the paired reads to a single input file, not to combine potentially overlapping reads.

In our experience, end-pairing information is very informative when aligning to longer sequences (contigs, genomes) when you expect both reads from a fragment to hit the same target in close proximity. When aligning to individual genes (as in most of our tools), it’s common for one read to align to a given gene while its mate overhangs that gene (aligning elsewhere or not at all). We’ve found it more straightforward to just align the reads separately rather than to check for and enforce concordant alignment in the fraction of cases where it would’ve been possible.

This totally makes sense to me. Thank you for answering!

Is it ok to only use the forward reads when using humann3? Is there an advantage to concatenating the R1 and R2 files? Also how do you concatenate the files? Sorry for all the questions I’m really new to metagenomics and am trying to make sure I am on the right track.

Please see my reply on your other thread.

Hello franzosa, I used Bowtie to remove the host and resulting in a “sample#.sam” file. I am wondering should I use the cat forward/reverse read or directly using sam file as I also saw same input in the tutorial.
Thank you! -Yike

Sorry for the delayed reply. The SAM files that HUMAnN can accept are mappings that HUMAnN itself has generated; we don’t take SAM as a generic sequence input format. You would need to find a way to dump your reads from the SAM file to (e.g.) FASTQ to start a fresh HUMAnN run.

Also, if the SAM file you have is an alignment against the host genome, you’d presumably want to only dump the unmapped reads for analysis?

Hello Eric,
Thanks for your answer. yes, in fact I tried to use Bowtie output Sam files to humann3. It successfully ran but it gave me funny/non human readable results. I cat forward and reverse reads and did the humann3 mapping.

Hi,
I understand paired-end samples should be concatenated before being input in the basic humann pipeline. We ran Kneaddata in paired-end mode for quality control. Should we include the single (unmatched) reads in the concatenation, i-e concatenate all 4 output files from kneaddata? (we ran it without decontamination because the host genome is unkwnown).