The bioBakery help forum

Only e.coli detected

Hi,
I am running a WGS dataset through humann3 and in 5 of the 5 first samples, I only detect E.coli. I don’t expect a lot of bacteria in these samples since it is blood samples, but I wonder why E.coli turns up in most samples? Something wrong with my annotations? I updated databases based following the tutorial.

Gene Family SRR3163377_Abundance-RPKs

UNMAPPED 192889548.0000000000
UniRef90_A0A329XME7 5588517.9572251709
UniRef90_A0A329XME7|g__Escherichia.s__Escherichia_coli 5588517.9572251709
UniRef90_UPI000BB85CB4 2435365.2982355496
UniRef90_UPI000BB85CB4|g__Escherichia.s__Escherichia_coli 2435365.2982355496
UniRef90_A0A1X3K551 2374320.5282771089
UniRef90_A0A1X3K551|g__Escherichia.s__Escherichia_coli 2374320.5282771089
UniRef90_A0A2I2ZPI1 1890437.5465917028
UniRef90_A0A2I2ZPI1|g__Escherichia.s__Escherichia_coli 1890437.5465917028
UniRef90_A0A2B3TG24 1570004.5673570558
UniRef90_A0A2B3TG24|g__Escherichia.s__Escherichia_coli 1570004.5673570558
UniRef90_A0A1X3K7C5 1564615.7956286967
UniRef90_A0A1X3K7C5|g__Escherichia.s__Escherichia_coli 1564615.7956286967
UniRef90_UPI000928192A 1214206.6654682336

All of these proteins are annotated as “Uncharacterized”, some with a high similarity with macaque/human proteins. It is plausible that are coming from an assembly slightly contaminated by host sequences.

I like @fbeghini’s suggestion a lot. Have you already run host-read depletion on these samples? Another possibility with E. coli is reagent contamination, but based on the nature of the proteins hit I think the previous suggestion is more likely.

I am running the raw reads, without any depletion. Is depletion recommended?

Best,

Highly recommended - for host-associated metagenomes there is almost always host DNA present in the sample, and it can range from low (in environments like the gut) to extremely high (in environments like the skin) depending on the microbial biomass present. I would guess that blood would be in the latter group.

We offer a pipeline for general QC and depletion here:

https://huttenhower.sph.harvard.edu/kneaddata/

In addition, any post-QC measurements that correlate with the amount of host reads removed should be treated suspiciously (they may represent uncharacterized host elements).