Phylum only abundance table

Hi,
I am new to this. I used the following command to get the specie only abundance table as mentioned in the tutorial.

grep -E "s__|clade" merged_abundance_table.txt | sed 's/^.*s__//g' | cut -f1,3-8 | sed -e 's/clade_name/body_site/g' > merged_abundance_table_species.txt

But I am also interested in getting the phylum only abundance table. Can someone with expertise in shell help me to modify the above script in order to get phylum only abundance table?

many thanks in advance.

The way I do this is to first grep the desired level, and then grep -v the next lowest level. The -v does an “inverted” match. So for example, if you want just phylum, you’d do

$ grep "p__|clade" merged_abundance_table.txt | grep -v "c__" | # other stuff...

The grep -E "p__|clade" matches every line that contains the phylum (and the header, which is what |clade is for), and then grep -v "c__" matches every line that doesn’t have the class.

I am so sorry… can you please also write the remaining part of the code? like what #other stuff… i should write? I am new to shell. I tried the first part but it doesn’t work for me. I get an empty file…

I was thinking all the stuff you had before (from sed on), though if you’re getting an empty file, there’s some other problem, I think.

With these sorts of built-up shell commands, it’s typically a good idea to do them in pieces. The | is passing the output from the previous command into the following command. So for example, if you start with cat merged_abundance_table.txt, you’ll see the whole file printed to the screen (if the output is really long, you can do something like | head -20 at the end of each of these commands, which will only take the first 20 lines). Then add | grep "p__|clade"*, and the output should change (in this case, you will only see lines that have p__ or clade). Then add on | grep -v "c__", and you should see any line with c__ go away. Just keep tacking stuff on…


*In writing this, I think I found the problem, which is that I didn’t include the -E flag in my example above, which might cause it to look for lines matching literally "p__|clade" rather than p__ OR clade. The -E flag gives grep more regex symbols to work with. Anyway, if you follow my procedure above, you’d see no output when you add that piece on, which should be a clue that that’s where the problem is.

$ grep -E "p__|clade" merged_abundance_table.txt | grep -v "c__" | sed 's/^.*s__//g' | cut -f1,3-8 | sed -e 's/clade_name/body_site/g' > merged_abundance_table_species.txt

Should do it? But regardless, I recommend you build it up piece by piece and see how each pipe changes things. Also - https://explainshell.com/ is a godsend. Plus, keep in mind that even people that use the shell regularly are constantly looking up syntax. You’ll get there, just keep practicing!

1 Like

Great. This was very helpful. Thank you sooo much kescobo for your help!

Cheers!

1 Like