StrainPhlan4: No markers were found for the clade

I am running the following code in order to extract species specific markers for Strainphlan:

extract_markers.py -c s__Haemophilus_parainfluenzae -o $METAPHLAN_4_DIR/db_markers

Here is the output:

Fri Nov  4 18:28:13 2022: Start extract markers execution
Fri Nov  4 18:28:17 2022:       Extracting markers from the Bowtie2 database...
Fri Nov  4 18:47:08 2022:       Done.
Fri Nov  4 18:47:08 2022:       Loading MetaPhlAn mpa_vJan21_CHOCOPhlAnSGB_202103 database...
Fri Nov  4 18:48:00 2022:       Done.
[e] No markers were found for the clade "s__Haemophilus_parainfluenzae".
Traceback (most recent call last):
  File "/bioware/metaphlan-4.0.2/bin/extract_markers.py", line 33, in <module>
    sys.exit(load_entry_point('MetaPhlAn==4.0.2', 'console_scripts', 'extract_markers.py')())
  File "/bioware/metaphlan-4.0.2/lib/python3.10/site-packages/metaphlan/utils/extract_markers.py", line 138, in main
    extract_markers(args.database, args.clades, args.output_dir)
  File "/bioware/metaphlan-4.0.2/lib/python3.10/site-packages/metaphlan/utils/extract_markers.py", line 122, in extract_markers
    return output_file
UnboundLocalError: local variable 'output_file' referenced before assignment

As you can see, I am using the mpa_vJan21_CHOCOPhlAnSGB_202103 database. I checked the marker_info text file and there are indeed 200 marker genes for Haemophilus parainfluenzae.

cat mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt | grep "s__Haemophilus_parainfluenzae" | wc -l # 200

Any have any suggestions? I don’t know what the is meant by “UnboundLocalError: local variable ‘output_file’ referenced before assignment” in the traceback or why no markers were found for the clade “s__Haemophilus_parainfluenzae”.

Hi @JJGiacomini
With the adoption of the SGB system in the MetaPhlAn / StrainPhlAn version 4, now the clade selection should be done at the SGB level (t__SGB…). This is due mainly to the reason that some species (e.g. F. praus or P. copri) span multiple SGBs.

1 Like

Thanks @aitor.blancomiguez

Please note that specifying ‘t__SGB9712’ did not work, where as ‘t__SGB9712_group’ did work.

If we look for that clade in the marker info text file, we see that the clade is specified as ‘t__SGB9712_group’

cat mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt | grep "t__SGB9712" | awk '{print $1, $3}'

And if we look for that clade in the list of detected clades in the metagenomes found by running strainphlan with the --print_clades_only flag, we see that again the clade is specified as ‘t__SGB9712_group’.

cat $METAPHLAN_4_DIR/db_markers/print_clades_only.tsv | grep "t__SGB9712"

However, if we look for that clade in Table S1 from the MetaPhlAn 4 paper, it is only ever specified as ‘t__SGB9712’ and does not have the ‘__group’ suffix.

Is the ‘__group’ suffix always required, or is that only for certain clades?

Thank you!

Hi @JJGiacomini
In table S1 of the preprint, we report the initial SGB clustering of all the ~700k M-HQ non-redundant genomes in the initial genomic database. However, for the MetaPhlAn profiling, we perform a further quality control based on the number of genomes assigned to uSGBs as well as the grouping of some of them due to their closeness and impossibility to find markers for them separately. For those latest cases, the _group specification is necessary, but not for all the SGBs. When using the print_clades_only option you can check which of the SGBs need the _group specification