Gzip fastq as input

Hi, I am using MetaPhlAn version 3.0.14, in the help, the gzip file format is not mentioned. May I ask whether the gzip fastq can be directly used as input for MetaPhlAn? If so, the --input_type should be selected “fastq”? For example, the part of command as below:
metaphlan xxx_1.fastq.gz,xxx_2.fastq.gz --input_type fastq

When running successfully, there is “WARNING: The metagenome profile contains clades that represent multiple species merged into a single representant.
An additional column listing the merged species is added to the MetaPhlAn output.” for some samples, may I ask whether this is normal, and can be ignored or what to do?


Hi @farmer2020
Yes, it is possible to run MetaPhlAn 3.0.14 with FASTQ files compressed with gzip. Exactly, the --input_type should still be fastq.
The warning you are describing is normal. MetaPhlAn 3 includes markers describing species groups (for 1328 species as they were unlikely to be distinguishable in metagenomic samples, Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3 | eLife). When some of these species groups are detected, MetaPhlAn reports that warning for the user to be aware of it.

Hi Aitor, many thanks for your fast reply.

Hi Aitor,
I am moving from MetaPhlAn from version 3.0.14 to version 4.0.2. May I further ask that the same warning is still produced, right? If so, the warning can be still ignored without any influence on the MetaPhlAn output, right? In addition, the compressed forward and reverse fastq.gz files can be as input files in the version 4.0.2, right?

Hi @farmer2020
Yes, in metaphlan 4 the input files are managed in the same way as in version 3

Hi Aitor,
Thanks for your reply. So the produced warning as same as MetaPhlAn3 should be still ignored with no any influneces on the taxonomic results, right?
Thanks again.

Hi @farmer2020
Yes, Exactly!

Hi Aitor, thank you very much for your confirmation.

Hi Aitor,
I noticed that in the output profile from MetaPhlAn version 4.0.2, the number of reads processed sometimes is lower than the total paired reads, for example, “#4807601 reads processed” from fastqs with 3104148 reads in forward and reverse fastqs, respectively. May I ask if it is normal and the reason?
The header for forward and reverse fastq are exactly same, is it fine for MetaPhlAn version 4.0.2 to proceed?
The command used is: metaphlan fastq_forward,fastq_reverse --input_type fastq --bowtie2db bowtie2db/ --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2out bowtie2.bz2 --unclassified_estimation --nproc 6 --output_file profile.txt

Hi @farmer2020
Yes, before mapping, metaphlan will perform a quality mapping of the reads passed as input removing short reads (<70bp length)

MetaPhlAn do not account for paired information when using paired-end reads, and will account each reads of the pair a a single-end read.

Hi Aitor, thanks for your reply for both questions, which make sense to me. So for the second question, whether the headers of forward and reverse fastq are same or not will make no differences for the output of metaphlan, am I right?

It will not as internally metaphlan will append an autoincremental number at the end of each read ID

Will not produce any differences?

Whether they are the same or not will not produce any difference in the output, they will still be accounted as independent reads

That’s great to learn more about MetaPhlAn. Thank you very much.

Hi Aitor,

So, to continue on this (somewhat)…If I want to know how many reads actually mapped to the database (with the idea to use this number to multiply the relative abundance by to get absolute counts), what number should I look at? I’m guessing not the “reads processed”, but the ‘#estimated_reads_mapped_to_known_clades’?

Hi @MalbertR
You can extract that information from both the sam output file or the bowtie2out.