Is it necessary to run trimmomatic before HUMAnN? My read number decreased from around 10000000 to 7000000 in forward read and 2000000 in reverse read

Hi @franzosa @lauren.j.mciver !!! I am using already submitted data in SRA. I have run trimmomatic with default in paired end reads. In forward read file, read number reduced from 10000000 to 7000000 and reverse from 10000000 to 2000000. It shows a huge reduction in number of reads. What should I do? Should I skip the trimmomatic step and run decontamination only? Will it be ok?

(NOTE: Earlier I have run metaphlan without any quality check step as fransesco beghini suggested trimming+ filtering by metaphlan is sufficient)

Thanks

Hi - Thank you for the detailed post. Are you using kneaddata? If so it looks like kneaddata is having an issue tracking pairs. My guess is there are spaces in the SRA read identifiers which is truncated by bowtie2. We are working on adding a new feature to kneaddata to handle this case of spaces in read identifiers. Would you check your input files to look for spaces? Also the kneaddata log should have read counts at each step of the qc process. Can you double check the log (apologies if you already did this) just to confirm the reads are being filtered at the trimmomatic step? If so check the options being passed to trimmomatic. You might just want to rerun kneaddata with different trimmomatic settings. Yes it is a good idea to run quality filtering and host contamination prior to running HUMAnN.

Thank you,
Lauren

1 Like

Hi @lauren.j.mciver - I am using the kneaddata in paired-end reads for decontamination. And, it seems read number increased after the step. Here’s my command and output. PLease note the read counts in the output.
Command:
for R1 in *_1.fastq.gz ; do sampleName=${R1%%_1.fastq.gz}; kneaddata --input ${sampleName}_1.fastq.gz --input ${sampleName}_2.fastq.gz --reference-db /home/deepchandagcp/kneaddata_db/ --bypass-trim --threads 8 --cat-final-output --max-memory 20000m --output kneaddata_output/; done

Output:

Decompressing gzipped file …
Initial number of reads ( /data/Desktop/prjeb2054_test/kneaddata_output/decompressed_jgdduff0_ERR011093_Metagenomics_of_human_gut_metagenome_2.fastq ): 10856019.0
Initial number of reads ( /data/Desktop/prjeb2054_test/kneaddata_output/decompressed_lyqwh2a8_ERR011093_Metagenomics_of_human_gut_metagenome_1.fastq ): 10856019.0
Bypass trimming
Decontaminating …
Running bowtie2 …
Total reads after removing those found in reference database ( /data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneadd
ata_hg37dec_v0.1_bowtie2_paired_clean_1.fastq ): 19540240.0
Total reads after removing those found in reference database ( /data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneadd
ata_hg37dec_v0.1_bowtie2_paired_clean_2.fastq ): 2171138.0
Total reads after merging results from multiple databases ( /data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneaddata
_paired_1.fastq ): 19540240.0
Total reads after merging results from multiple databases ( /data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneaddata
_paired_2.fastq ): 2171138.0
Total reads after removing those found in reference database ( /data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneadd
ata_hg37dec_v0.1_bowtie2_unmatched_1_clean.fastq ): 594.0
Total reads after merging results from multiple databases ( /data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneaddata
_unmatched_1.fastq ): 594.0
Total reads after removing those found in reference database ( /data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneadd
ata_hg37dec_v0.1_bowtie2_unmatched_2_clean.fastq ): 0.0
Total reads after merging results from multiple databases ( /data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneaddata
_unmatched_2.fastq ): 0.0

Final output files created:
/data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneaddata_paired_1.fastq
/data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneaddata_paired_2.fastq
/data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneaddata_unmatched_1.fastq
/data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneaddata_unmatched_2.fastq
/data/Desktop/prjeb2054_test/kneaddata_output/ERR011093_Metagenomics_of_human_gut_metagenome_1_kneaddata.fastq

Why this is happening. I have gone through a previous post. But did not get any solution to it. Please help.

Thanks

Hi, Thank you for the detailed log post. I think kneaddata is having a hard time tracking the read pairs. Can you check to see if the input files have “/1” and “/2” as the pair identifier at the end of each read’s sequence identifier? Also double check there are not any spaces in the sequence identifiers? I think it is likely one of these two cases that is causing kneaddata to loose track of the read’s pair id.

Thank you,
Lauren

1 Like

Hi… there is no “/1” or “/2”. There is also no space in the read identifier. the identifiers look like : @ERR260205.10868.

Thanks

Hi, Thank you for checking. Yes with out the read identifier kneaddata will not be able to track the read pairs. Is it possible for you to add the read identifiers ("/1" to R1 and “/2” to R2) and then rerun kneaddata? On our end we will work on adding this use case to the code so you will not have to edit your input files in the future.

Thank you,
Lauren

1 Like

Thanks… but what should I do now. Should I rely on the output data? I don’t know how to add that. I am working on altready submitted data in SRA

Hi, You could run sed to generate new files with the pair identifiers (see commands below). Sorry again for the inconvenience. We will have the use case of input files without pair identifiers covered in the next version of kneaddata.

$ sed '1~4 s/$/\/1/g' < original_R1.fastq > new_R1.fastq
$ sed '1~4 s/$/\/2/g' < original_R2.fastq > new_R2.fastq

Thank you,
Lauren

2 Likes

Thanks a lot @lauren.j.mciver . I will let you know how it works.

Hi… @lauren.j.mciver sorry to bothering you many times… I am seeing in another dataset the problem is the space. It contains “/1” and “/2” in the identifier. Can you please also give a solution to this?

Thanks

Hi, No worries! Sorry again for this inconvenience. We will get the use cases of 1) missing pair ids and 2) space additions in sequence ids covered in the next version of kneaddata.

For now use the following command to remove any spaces.

$  sed 's/ //g' < original.fastq > new.fastq

Thanks,
Lauren

1 Like

Hi @lauren.j.mciver … Just to inform you regarding the space issue, your command worked like magic. And, I have not checked the other one you have provided. Will let you know once I use that.

Many many thanks for your prompt help.

DC7