Eukaryotic Uniref90 Gene Families in Gene Family TSV Files

kying · February 5, 2020, 6:11pm

Hi, I’ve run Humann2 using the following code: *humann2 -i input.fastq.gz -o $OUTPUT_DIR --threads 8*` (using default settings on the chocophlan database and uniref90 database) on lung microbiome metatranscriptome samples that were quality trimmed and dehosted (reads mapping to hg19 were removed).

From the resulting output directory, I viewed the gene_families.tsv file took the resulting unifref90 gene family IDs and manually looked them up in the uniref90 database to observe what organisms the gene families may map to.

Some of the results were to human and other eukaryotic organisms. For example here are a few listed below:

UniRef90_B8YIA7 - Cluster: CYP1B1 protein (Fragment) – Organism: Homo sapiens (Human)

UniRef90_P38571 - Cluster: Lysosomal acid lipase/cholesteryl ester hydrolase – Organism: Homo sapiens (Human)

UniRef90_G7NZS0 - Cluster: Uncharacterized protein (Fragment) – Organism: Macaca fascicularis (Crab-eating macaque) (Cynomolgus monkey), Macaca mulatta (Rhesus macaque)

Are these false positives? Or is it possible to make the uniref90 database unique for only prokaryotes?

franzosa · February 5, 2020, 6:29pm

Are these UniRef90s assigned to the unclassified stratification, meaning that their abundance was identified from translated search? If so, it is possible that they represent residual host contamination. Conversely, UniRef90 abundance assigned to specific species is much less likely to be host-derived.

Since you’re dealing with RNA, you’ll want to host-deplete against a human transcript database in addition to the human genome. It’s possible that you have host reads in your sample that cover fused exons, in which case the read might not map to the genome (where the exons are not adjacent).

kying · February 11, 2020, 3:54pm

The UniRef90s are assigned to unclassified stratification. Do you have a suggestion for which transcript database would be best to use?

Is there a way to make the uniref90 database prokaryotic only?

franzosa · February 12, 2020, 3:28pm

We use a human EST database from NCBI to remove host-derived RNA when performing quality control on metatranscriptomic sequences:

https://bitbucket.org/biobakery/kneaddata/wiki/Home

You can use the infer_taxonomy script to attach the original sources of UniRef90s to their identifiers. This might help you to weed out UniRef90s in the unclassified stratum that were due to host contamination.

https://bitbucket.org/biobakery/humann2/wiki/Home#markdown-header-humann2_infer_taxonomy

kying · March 2, 2020, 5:44pm

I tried to follow the link you posted for infer_taxonomy, but it leads a page that says the link has no power. Has the page migrated to another page in the bitbucket?

sagunmaharjann · March 2, 2020, 6:04pm

Hi Kying,
Yes, the humann2 was migrated to github and is now referred to humann 2.0. You can find the same information using this link below.

Thanks,
Sagun

kying · March 3, 2020, 7:30pm

Thanks Sagun. I was able to use infer_taxonomy, and in the github it states:

“The modified gene families output files can then be reprocessed through HUMAnN 2.0 to compute pathway abundance/coverage using the inferred taxonomic stratifications.”

What function would I be using?

franzosa · March 3, 2020, 7:42pm

You can provide the resulting gene families file (or any gene families file) to HUMAnN as an --input. It will know that it is starting from genes rather than raw sequencing reads (the typical input) based on the file formatting.

kying · March 3, 2020, 10:34pm

I put it into HUMAnN2 as --input, but when asked to provide it an --input-format, TSV is not an accepted format.

franzosa · March 3, 2020, 10:46pm

If you are using --input-format you would specify “genetable”, but the format should be automatically detected from --input with the TSV extension.

kying · March 10, 2020, 8:53pm

When I put in the command humann2 --input file.tsv --input-format "genetable" --output output_folder and it runs thru the humann2 command.

I get the following error message: CalledProcessError: Command '['python', $/humann2/quantify/MinPath12hmp.py', '-any', '$output', '-map', '$output', '-report', '$output', '-details', '$output', '-mps', $output']' returned non-zero exit status 1

Seems that there is an error with glpsol from Minpath when it tries to compute the pathway abundance and coverage from the new infer_taxonomy genefamilies file.

I’m not sure where the issues lies to be able to fix the error.

franzosa · March 16, 2020, 4:19pm

Have you inspected the inferred gene families file to make sure it looks OK? If you’re able to share that file it might help us to diagnose this error.

kying · March 16, 2020, 4:43pm

I have looked, and the inferred gene families look fine. I would be happy to share the inferred gene families file, what would be the best way to send it to you?

franzosa · March 16, 2020, 4:54pm

If you’re not able / interested to attach it here, you can email it to me at franzosa@hsph.harvard.edu. If it’s too large to email you can attach just the first ~1000 lines or so.

kying · March 16, 2020, 5:02pm

I have just sent to your e-mail. Thanks

franzosa · April 1, 2020, 3:37pm

Sorry for the long delay! Your genes file looked fine to me. I renamed it to test.tsv and ran it through HUMAnN with the following command:

humann2 --input test.tsv --output . --input-format genetable

And it produced pathway-level output files successfully. It seems like there might be something wrong with your installation?

kying · April 1, 2020, 4:28pm

I think there might be, and it seems that the error message is coming from when running glpsol from Minpath.

Is there a way to update Minpath and all its prerequisites or a way to update all the packages in Humann2?

franzosa · April 1, 2020, 4:30pm

You could have a look at this thread from the archived forum which dealt with upgrading glpk to fix minpath problems:

https://groups.google.com/d/topic/humann-users/MSVmXC7DSW0/discussion

kying · April 1, 2020, 4:32pm

Thank you Eric. Will report back when I find out the answer

kying · April 2, 2020, 3:44am

Was able to fix the issue by reinstalling Humann2 using: conda install -c biobakery humann2

I am now able to get the same 2 files that you produced.

Topic		Replies	Views
Count of individual genes from ChocoPhLan database rather than UniRef gene family based RPK HUMAnN	2	468	January 8, 2021
No UniRef90 IDs from Humann3 have information in UniProfKB site? HUMAnN	2	511	September 18, 2020
Humann2_regroup_table for kegg : UNGROUPED! HUMAnN	3	664	December 15, 2022
Different UniRef90 ID has the same nucleotide sequences in ChocoPhlAn database HUMAnN	3	514	August 4, 2020
About retrieving the sequences of gene familiess HUMAnN	0	183	July 14, 2023

Eukaryotic Uniref90 Gene Families in Gene Family TSV Files

Related topics