Eukaryotic Uniref90 Gene Families in Gene Family TSV Files

Hi, I’ve run Humann2 using the following code: *humann2 -i input.fastq.gz -o $OUTPUT_DIR --threads 8*` (using default settings on the chocophlan database and uniref90 database) on lung microbiome metatranscriptome samples that were quality trimmed and dehosted (reads mapping to hg19 were removed).

From the resulting output directory, I viewed the gene_families.tsv file took the resulting unifref90 gene family IDs and manually looked them up in the uniref90 database to observe what organisms the gene families may map to.

Some of the results were to human and other eukaryotic organisms. For example here are a few listed below:

UniRef90_B8YIA7 - Cluster: CYP1B1 protein (Fragment) – Organism: Homo sapiens (Human)

UniRef90_P38571 - Cluster: Lysosomal acid lipase/cholesteryl ester hydrolase – Organism: Homo sapiens (Human)

UniRef90_G7NZS0 - Cluster: Uncharacterized protein (Fragment) – Organism: Macaca fascicularis (Crab-eating macaque) (Cynomolgus monkey), Macaca mulatta (Rhesus macaque)

Are these false positives? Or is it possible to make the uniref90 database unique for only prokaryotes?

Are these UniRef90s assigned to the unclassified stratification, meaning that their abundance was identified from translated search? If so, it is possible that they represent residual host contamination. Conversely, UniRef90 abundance assigned to specific species is much less likely to be host-derived.

Since you’re dealing with RNA, you’ll want to host-deplete against a human transcript database in addition to the human genome. It’s possible that you have host reads in your sample that cover fused exons, in which case the read might not map to the genome (where the exons are not adjacent).

The UniRef90s are assigned to unclassified stratification. Do you have a suggestion for which transcript database would be best to use?

Is there a way to make the uniref90 database prokaryotic only?

We use a human EST database from NCBI to remove host-derived RNA when performing quality control on metatranscriptomic sequences:

https://bitbucket.org/biobakery/kneaddata/wiki/Home

You can use the infer_taxonomy script to attach the original sources of UniRef90s to their identifiers. This might help you to weed out UniRef90s in the unclassified stratum that were due to host contamination.

https://bitbucket.org/biobakery/humann2/wiki/Home#markdown-header-humann2_infer_taxonomy

I tried to follow the link you posted for infer_taxonomy, but it leads a page that says the link has no power. Has the page migrated to another page in the bitbucket?

Hi Kying,
Yes, the humann2 was migrated to github and is now referred to humann 2.0. You can find the same information using this link below.

Thanks,
Sagun

Thanks Sagun. I was able to use infer_taxonomy, and in the github it states:

“The modified gene families output files can then be reprocessed through HUMAnN 2.0 to compute pathway abundance/coverage using the inferred taxonomic stratifications.”

What function would I be using?

You can provide the resulting gene families file (or any gene families file) to HUMAnN as an --input. It will know that it is starting from genes rather than raw sequencing reads (the typical input) based on the file formatting.

I put it into HUMAnN2 as --input, but when asked to provide it an --input-format, TSV is not an accepted format.

If you are using --input-format you would specify “genetable”, but the format should be automatically detected from --input with the TSV extension.

When I put in the command humann2 --input file.tsv --input-format "genetable" --output output_folder and it runs thru the humann2 command.

I get the following error message: CalledProcessError: Command '['python', $/humann2/quantify/MinPath12hmp.py', '-any', '$output', '-map', '$output', '-report', '$output', '-details', '$output', '-mps', $output']' returned non-zero exit status 1

Seems that there is an error with glpsol from Minpath when it tries to compute the pathway abundance and coverage from the new infer_taxonomy genefamilies file.

I’m not sure where the issues lies to be able to fix the error.

Have you inspected the inferred gene families file to make sure it looks OK? If you’re able to share that file it might help us to diagnose this error.

I have looked, and the inferred gene families look fine. I would be happy to share the inferred gene families file, what would be the best way to send it to you?

If you’re not able / interested to attach it here, you can email it to me at franzosa@hsph.harvard.edu. If it’s too large to email you can attach just the first ~1000 lines or so.

I have just sent to your e-mail. Thanks

Sorry for the long delay! Your genes file looked fine to me. I renamed it to test.tsv and ran it through HUMAnN with the following command:

humann2 --input test.tsv --output . --input-format genetable

And it produced pathway-level output files successfully. It seems like there might be something wrong with your installation?

I think there might be, and it seems that the error message is coming from when running glpsol from Minpath.

Is there a way to update Minpath and all its prerequisites or a way to update all the packages in Humann2?

You could have a look at this thread from the archived forum which dealt with upgrading glpk to fix minpath problems:

https://groups.google.com/d/topic/humann-users/MSVmXC7DSW0/discussion

Thank you Eric. Will report back when I find out the answer

Was able to fix the issue by reinstalling Humann2 using: conda install -c biobakery humann2

I am now able to get the same 2 files that you produced.