Same ID for paired-end reads

This section of the HUMAnN2 help says:

If you have paired-end Illumina sequencing data, processed by CASAVA v1.8+ the sequence identifiers for the pairs have the same id after truncating by space. To keep track of the individual reads throughout the HUMAnN2 workflow, the software will initially remove the spaces from the identifiers. This prevents read pairs from having the same id after being processed by bowtie2 and diamond. If this is the case for your read set, you will see an informative message printed to the screen and the log file indicating the removal of spaces from the sequence identifiers.

In some paired-end files from SRA, removing spaces from identifiers to prevent truncation isn’t enough to prevent read pairs from having the same ID if they have the same length, e.g.:


@SRRnnnnnnn.1 1 length=96


@SRRnnnnnnn.1 1 length=96

Would this be a problem when these files are concatenated?

Indeed, if two distinct reads have identical headers, then HUMAnN2 will not be able to tell them apart. Our assumption is that the end-pairing information will be encoded in the read in some manner such that read1/read2 end up with distinct headers.

Can you point to a specific file/sample that has this property? I’d be interested to evaluate how widespread it is.