Results have variable name as value

Hi MaAsLin2 team,

I’m running into a similar problem as this post, but the Value column of the output is being filled in with the metadata variable instead of a number. I installed MaAsLin2 using the GitHub release as mentioned there and I’m still having the problem.

Here’s a snippet of the output table:
image

The variable how_born only has two options (vaginal and caesarean) and I set vaginal to be the reference. The fed variable has four options with breast_milk being the reference.

The issue also translates to the heatmap. In addition to the how_born and fed variables, it just says Site on the heatmap, but the significant results table correctly shows that as Site/Milwaukee.

Any information to fix this would be great!
Thank you!
~Samantha

Here are my version/code information

packageVersion(“Maaslin2”)
[1] ‘1.18.0’

fit_data <- Maaslin2(
  input_data = df_counts,
  input_metadata = df_meta,
  min_abundance = 10,
  min_prevalence = 0.1,
  normalization = "TSS",
  output = "out_covars_ST_minAbund10_meanPrev0.1",
  fixed_effects = c("Site","Age_Years", "Sex","how_born",
                    "fed","adi_2013_natl_rank","Subject_Type"),
  random_effects = "Family",
  reference = "Subject_Type,Unrelated_Control;how_born,vaginal;fed,breast_milk"
)

Hi @Samantha,

First off, thanks for using our tool. I’m looking at your data and trying to figure out what might be going wrong but it’s tricky without an example of how the metadata is encoded.

In general the “value” column will show the level within the covariate that the model coefficient is associated with. So it shouldn’t ever be a number unless you have purposely encoded numbers as levels within a covariate.

That being said I’m not sure how you are getting both “how_born” and “caesarean” in the value column if there is indeed only two levels within the how_born column. As generally those two values are usually only the same if you are testing a continuous variable.

If you could provide an example of how the metadata is encoded (the format would be enough) that would help with trying to figure out what’s going on. Perhaps running str(df_meta) might give some hints into what’s happening at the very least.

Thanks,
Jacob

Hi @nearinj!

Thanks for your response and sorry for my delay. Here is a snippet of the mapping file and the output of str(df_meta).

Metadata file:

Sample_Name PatientID Timepoint_Type Site Subject_Type Subject_Type_2 Age_Years Sex adi_2013_natl_rank Family how_born weeks33 fed
580-108-BL 580-108 Baseline Milwaukee Unrelated_Control Control 4.42 Male 89 KK caesarean After
580-118-BL 580-118 Baseline Milwaukee case case 12.25 Female 33 PP vaginal After both
580-119-BL 580-119 Baseline Milwaukee case case 15 Female 33 PP vaginal After both
580-11-BL 580-11 Baseline Milwaukee case case 4.67 Female 95 D vaginal Before breast_milk

Output of str(df_meta):

str(df_meta)
'data.frame':	240 obs. of  15 variables:
 $ SampleID.1        : chr  "580-108-BL_S20" "580-118-BL_S155" "580-119-BL_S75" "580-11-BL_S1" ...
 $ Sample_Name       : chr  "580-108-BL" "580-118-BL" "580-119-BL" "580-11-BL" ...
 $ PatientID         : chr  "580-108" "580-118" "580-119" "580-11" ...
 $ Timepoint_Type    : chr  "Baseline" "Baseline" "Baseline" "Baseline" ...
 $ Site              : chr  "Milwaukee" "Milwaukee" "Milwaukee" "Milwaukee" ...
 $ Subject_Type      : Factor w/ 3 levels "Unrelated_Control",..: 1 3 3 3 3 3 3 2 3 2 ...
 $ Subject_Type_2    : chr  "Control" "case" "case" "case" ...
 $ Age_Years         : num  4.42 12.25 15 4.67 3.25 ...
 $ Sex               : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 2 1 1 2 1 ...
 $ adi_2013_natl_rank: int  89 33 33 95 94 94 60 66 34 54 ...
 $ Family            : Factor w/ 149 levels "A","AAAAA","AAAAAAA",..: 66 96 96 20 103 103 109 75 27 70 ...
 $ how_born          : Factor w/ 3 levels "","caesarean",..: 2 3 3 3 2 2 3 3 3 2 ...
 $ weeks33           : chr  "After" "After" "After" "Before" ...
 $ fed               : Factor w/ 5 levels "","both","breast_milk",..: 1 2 2 3 2 2 4 4 3 4 ...

The only thing potentially weird in the str command is that blank/empty cells are being counted as a factor in how_born and fed. It seems the software correctly excludes those samples that are missing metadata (the results file has N as 233, which is less than the 240 samples in the dataset), but could the inclusion of a missing/blank value as a factor level be the issue? If yes, should I not explicitly make those variables factors?

These are the commands I ran to load/format the metadata:

df_meta <- read.csv("mapping_file_maaslin.txt", header = TRUE, sep = "\t", row.names = 1,
                    stringsAsFactors = FALSE)
rownames(df_meta) <- gsub("-",".", rownames(df_meta), fixed = TRUE)
df_meta[1:5,1:5]

df_meta$Subject_Type <- factor(df_meta$Subject_Type, 
                            levels = c("Unrelated_Control", "Related_Control", "case"))
df_meta$Sex <- as.factor(df_meta$Sex)
df_meta$how_born <- as.factor(df_meta$how_born)
df_meta$fed <- as.factor(df_meta$fed)
df_meta$Family <- as.factor(df_meta$Family)

Thanks!
~Samantha

Hello,

It’s certainly possible that leave blanks in for a factor level may be what is causing the issue (and when its blank the output then defaults to the factors name and not the name of the level). Is it possible to try and give them either explicit labels or filter them out and see if that fixes your issue.

Jacob

Hi Jacob,

I ended up giving the blank cells a value of “NA” and that appears to have fixed the problem! The only curious thing is that “NA” isn’t a significant value in the table, which I thought would be the case since the blank cells were apparently coming back as significant? Is this weird/wrong?

Also, is replacing blank cells with “NA” an acceptable work around with the statistics/math behind the model? The missing values aren’t the same across all variables, so I don’t necessarily want to throw them out fully.

Thanks!
Samantha

Hi Samantha,

I’m going to link you to this explanation by @himel.mallick (one of the original authors).

In essence during model fitting the models will not use any data that is missing values in the covariates. If you want to avoid this you could try imputing them in sensible manner or modeling the lack of data directly (although this could be messy depending on the data etc.).

thanks,
Jacob

Hi Jacob,

Thank you so much for the link, that was very helpful!

I think everything is running properly now, so thank you again for the troubleshooting help as well :slight_smile:

Best,
Samantha

1 Like