Paired end WGS sequence analysis

Hi all (@fbeghini , @NSegata)!!!
Please help me regarding this basic concept I’m struggling to understand. I want to profile already submitted paired-end WGS data (SRR2155174_1.fastq, SRR2155174_2.fastq). But I’m confused with the following statement and have the following questions:

“MetaPhlAn 2 can also natively handle paired-end metagenomes (but does not use the paired-end information)”

  • What is the meaning of “paired-end information” here? What is the difference between handling paired-end metagenome but not using paired-end information?

  • Should I concatenate the forward and reverse read files or just use command:

    $ metaphlan2.py metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq > profiled_metagenome.txt
    
  • Why we don’t make contigs from the two reads (as I made during analysis by “MOTHUR” software)?

Thanks and Regards,
DC7

" What is the difference between handling paired-end metagenome but not using paired-end information?"

The reads are treated as independent single reads for mapping to reference. Paired-end reads are expected to align to reference at more-or-less a fixed distance apart and in opposite orientation. These ‘expectations’ are ignored in MetaPhlan2.

“Why we don’t make contigs from the two reads (as I made during analysis by “MOTHUR” software)?”

In WGS the reads are generally <150 nt and the fragments being sequenced often >300nt. So, the reads don’t overlap enough to be assembled into contigs. In 16S sequencing the 16S gene amplification primers and the method of sequencing are designed to produce overlapping paired-end reads in order to be able to assemble them.

1 Like

Waao… Such a nice explanation. Thanks @bigdoyle. Now it’s very clear to me.

Thanks @bigdoyle for the explanation, I had the same question.
What would be the recommended read length (75 or 150) for paired end sequencing to use with Metaphlan pipeline for WGS?

Thanks,
Reeba

Longer is generally better for less likelihood of ambiguous mapping, and potentially greater classification accuracy. But, longer reads cost more… we run PE 150. I think most labs do.

1 Like

What do you mean by “fragments being sequenced often >300nt”?

1 Like

For a DNA to be sequenced, it is at first fragmented and then it is sequenced from forward end first and then from the opposite (reverse) end. He wanted to tell that this fragment is overall >300 nt, but the sequencing done from each end is of <150 nt long.
Thanks,
DC7

1 Like

@DEEPCHANDA7 Thanks dear! I got it. :slight_smile:

1 Like