When we use:
df_input_metadata$CD_dysbiosis = (df_input_metadata$diagnosis_modified == "CD") *
The values for other entries except CD will be 0 which I think it is not fair. It should be “NA”. Am I right?
It should be zero. From a statistical point of view, this model is Y ~ beta_0 + beta_1 * CD + beta_2 * dysbiosis + beta_3 * (CD * dysbiosis). If we set the variable CD * dysbiosis to zero when not CD, beta_3 will exactly be the estimate for the effect of dysbiosis interacting with CD. Setting it to NA instead will only cause R to treat the variable as missing values.