HUManN3 functional annotation doubts

Hello @franzosa !

  1. Doubt 1:

Can we use assembled contigs for humann3 input?

  1. Doubt 2:

Also, do I need to run KneadData before running HumanN3?
For running metaphlan3 I did not use kneadData because I thought the chocophlan database has only microbial species specific markers, it was not necessary to remove human reads. Please let me know if my understanding is not correct. But before running HumanN3 should I run KneadData. If yes why?

  1. Doubt 3:

I had profiled the community using metaphlan3 earlier, now I am planning to do functional annotation of the same datasets, and since they are paired-end data so you have explained in one of the topics that the paired-end files should be concatenated before running humann3 but is that going to change the metagenome community profiling results that I had already obtained by running metaphlan3 for the paired-end files because at that time I did not concatenate the fatsq files?

Hi @saras22

Doubt 1 : I’m pretty sure humann was not designed for contigs, and @franzosa has answered this question here

Doubt 2 : you should always preprocess your reads when using sequencing data, as the quality of the bases in each read varies by the nature of the sequencing process itself. Preprocessing paired-end data also allows you to align your reads and do that quality trimming in a pairwise manner, yielding more confidence in each base call.

Kneaddata is a nice wrapper that conjugates quality control (FastQc), tandem repeats removal (trf), quality trimming (trimmomatic) and decontamination (or host dna removal; Bowtie2). fastp is also great, although it does not come with Bowtie2 so you have to run the decontamination standalone.

Even though your reference database only contains microbial DNA, having human DNA reads (or any host, depending on the source of your samples) in your reads could increase your false positives, i-e some reads that should align to humann sequences could align to microbial DNA. It also lengthens the processing time. An extreme example is saliva samples, which may contain up to 90% human DNA, in which case the false-positive rate could be very high; the computing time could be greatly reduced by removing human DNA upstream, which is what preprocessing is for.

If you’re not sure how contaminated your sample is, maybe run decontamination on a few samples (either directly using Bowtie2, or using Kneaddata with a couple of bypass options to speed things up) and look at the number of reads in the “contaminated” files it will produce.

Doubt 3 : humann does not use paired-end information and, as far as I know, metaphlan doesn’t either (as mentioned here). Keep in mind that the Humann pipeline actually starts by running Metaphlan to build a community profile, so it’s two birds with one stone. If you’ve already profiled your community, you could start humann by using the --taxonomic-profile bypass mode detailed here, which will save you some time.

Hope this helps!

Hey @jorondo1
Thank you for addressing all my queries, now that I have installed humann version 3.0.0, I am encountering an error that says:

CRITICAL ERROR: The directory provided for ChocoPhlAn contains files ( mpa_v30_CHOCOPhlAn_201901.fna.bz2 ) that are not of the expected version. Please install the latest version of the database: 201901b

It is asking for a specific chocophlan database version but what I had used for metaphlan3 taxonomic profiling was this: mpa_v30_CHOCOPhlAn_201901.fna.bz2. Is it possible to run humann with my version of database?

Thanks in Advance

Getting caught up here. Many thanks @jorondo1 for your informative replies above! I agree with your responses. @saras22 The error you are seeing there is because you have an unexpected file in your pangenomes (ChocoPhlAn) folder. The file appears to be a compressed MetaPhlAn marker database. If you remove this file then HUMAnN won’t raise that error. For running HUMAnN 3.0.0 you need to use the 201901b version of the HUMAnN databases, which offer some fixes and improvements relative to the original (201901) databases we released with the alpha lineage of the software. Your existing MetaPhlAn profiles should work just fine with the updated databases (passed via the --taxonomic-profile flag), though you could reprofile the samples taxonomically within HUMAnN just to be safe if desired.

Hi @franzosa
I tried installing the full version of chocophlan database, which was completed successfully and then I checked the database directory it has 11286 files which have names like this
: g__Abditibacterium.s__Abditibacterium_utsteinense.centroids.v296_v201901b.ffn.gz

Although after the download it says the Database has been downloaded successfully but still I want to know if the database has been downloaded correctly because it does not have any other kind of files except the ones that I mentioned above?
Also when I am giving the command:
humann --input /home/rakesh/Subsampled_70bp_fastq/concat_fastq/A1_.fq.gz --nucleotide-database /home/rakesh/Humann_Database/chocophlan/ --output /home/rakesh/Humann_out/ --threads 10

it is giving me this error:
Output files will be written to: /home/rakesh/Humann_out
Decompressing gzipped file …

Removing spaces from identifiers in input file …

CRITICAL ERROR: The metaphlan executable can not be found. Please check the install.

Can you help me resolve the error?

Also I have one more question: After downloading the chocophlan database do I need to download other translated search databases? or the chocophlan database is enough for the finding the metabolic pathways?

thanks

@franzosa
When I am running the following command:
humann --input /home/rakesh/Subsampled_70bp_fastq/concat_fastq/A1_.fq.gz --nucleotide-database /home/rakesh/Humann_Database/chocophlan/ --output /home/rakesh/Humann_out/ --threads 10, in the environment where I have installed metaphlan3, I get this error:

**Output files will be written to: /home/rakesh/Humann_out
Decompressing gzipped file …
Removing spaces from identifiers in input file …

WARNING: Can not call software version for bowtie2
ERROR: You are using the demo UniRef database with a non-demo input file. If you have not already done so, please run humann_databases to download the full UniRef database. If you have downloaded the full database, use the option --protein-database to provide the location. You can also run humann_config to update the default database location. For additional information, please see the HUMAnN User Manual.**

Do you have any idea what is happening here… I am getting totally confused…

Please help