VSG sequence duplications found in MetaPhlAn4 vOct22 pre-built bowtie2 index file

Hi, I’ve downloaded mpa_vOct22_CHOCOPhlAnSGB_202212 pre-built bowtie2 index (on March 17th, 2023) and find that VSG (i.e., sequences prefixed with ‘VDB’) being included in the ‘fna’ file twice before building the bowtie2 index. To be specific, one can use following commands to inspect the pre-built bt2 index to see what’s going on:

bowtie2-inspect -n [path_to_prebuilt_bt2_index]/mpa_vOct22_CHOCOPhlAnSGB_202212 \
    > mpa_vOct22_CHOCOPhlAnSGB_202212.bt2.ref_name

# Notice: there is a blank char before the '2' after command grep
sort mpa_vOct22_CHOCOPhlAnSGB_202212.bt2.ref_name | uniq -c | grep ' 2' | less

With those codes, one should be able to see there are two copies of sequence for each VSG sequence (prefixed with ‘VDB’). I’m not sure if this is done on purpose or just inadvertently (since VSG sequences are already included in the SGB fasta fie ‘mpa_vOct22_CHOCOPhlAnSGB_202212_SGB.fna.bz2’).

The pre-built bt2 index was downloaded from:

Hi @ProsperP
Thanks for noticing this, it has not been done on purpose. Since the VSG profiling is not available and we only run bowtie2 profiling with the best hit, it should not affect the metaphlan 4 results. We will fix the files anyway so at least there is not redundancy on the database