Same ID for paired-end reads

Levi_Waldron · December 6, 2019, 9:13am

This section of the HUMAnN2 help says:

If you have paired-end Illumina sequencing data, processed by CASAVA v1.8+ the sequence identifiers for the pairs have the same id after truncating by space. To keep track of the individual reads throughout the HUMAnN2 workflow, the software will initially remove the spaces from the identifiers. This prevents read pairs from having the same id after being processed by bowtie2 and diamond. If this is the case for your read set, you will see an informative message printed to the screen and the log file indicating the removal of spaces from the sequence identifiers.

In some paired-end files from SRA, removing spaces from identifiers to prevent truncation isn’t enough to prevent read pairs from having the same ID if they have the same length, e.g.:

R1:

@SRRnnnnnnn.1 1 length=96

R2:

@SRRnnnnnnn.1 1 length=96

Would this be a problem when these files are concatenated?

franzosa · December 9, 2019, 8:12pm

Indeed, if two distinct reads have identical headers, then HUMAnN2 will not be able to tell them apart. Our assumption is that the end-pairing information will be encoded in the read in some manner such that read1/read2 end up with distinct headers.

Can you point to a specific file/sample that has this property? I’d be interested to evaluate how widespread it is.

Topic		Replies	Views
Is it necessary to run trimmomatic before HUMAnN? My read number decreased from around 10000000 to 7000000 in forward read and 2000000 in reverse read HUMAnN	11	3707	August 29, 2020
CRITICAL ERROR: Unable to remove spaces from identifiers in input file HUMAnN	5	821	September 22, 2020
All paired-end read unmatched KneadData	34	5972	December 25, 2024
Humann3 Paired end reads HUMAnN	20	6227	August 27, 2025
Paired-end data results in unpaired output KneadData	27	5947	June 20, 2024

Same ID for paired-end reads

Related topics