Hi, I’ve downloaded mpa_vOct22_CHOCOPhlAnSGB_202212
pre-built bowtie2 index (on March 17th, 2023) and find that VSG (i.e., sequences prefixed with ‘VDB’) being included in the ‘fna’ file twice before building the bowtie2 index. To be specific, one can use following commands to inspect the pre-built bt2 index to see what’s going on:
bowtie2-inspect -n [path_to_prebuilt_bt2_index]/mpa_vOct22_CHOCOPhlAnSGB_202212 \
> mpa_vOct22_CHOCOPhlAnSGB_202212.bt2.ref_name
# Notice: there is a blank char before the '2' after command grep
sort mpa_vOct22_CHOCOPhlAnSGB_202212.bt2.ref_name | uniq -c | grep ' 2' | less
With those codes, one should be able to see there are two copies of sequence for each VSG sequence (prefixed with ‘VDB’). I’m not sure if this is done on purpose or just inadvertently (since VSG sequences are already included in the SGB fasta fie ‘mpa_vOct22_CHOCOPhlAnSGB_202212_SGB.fna.bz2’).
The pre-built bt2 index was downloaded from:
http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/bowtie2_indexes/mpa_vOct22_CHOCOPhlAnSGB_202212_bt2.tar