Bioinformatics skills – How to get them and not get scared

by: Samridhi Chaturvedi
If you are working in the field of Ecology and Evolution, it is important to build a skills toolbox which can come in handy to visualize, analyze and work through your data. These skills are a set of standard practices that you could start developing which make your work easy and smooth. Here are some skills which can constitute your toolbox:

1) Learn a language

 

 

If you are dealing with big data (ecology or evolution), you will eventually have to develop programming skills to help you manipulate, visualize and analyze your data. This can be quite overwhelming if you do not have prior programming experience. To start developing these skills, you generally have to learn a programming language. I feel there are several languages which are being used in the field at present and each of them have their pros and cons. There is Perl, which is older and is great when working with regular expressions (read more about regular expressions here). There is Python which is newer and is more intuitive. Then there is C/C++ which are compiling languages and using them involves a steeper learning curve. C/C++ are used extensively for theoretical/mathematical modelling and also for developing packages. These are all important skills to build but it is important to recognize that you can spend a lot of time learning and implementing these languages.

I say choose one language between perl and python to start learning the ropes of programming. There are some really helpful interactive learning spaces for both these languages.

Here are the ones I have found helpful:

  • Jupyter (http://jupyter.org/). This is an IDE and interactive website which helps you practice Python interactively.
  • Learn perl (http://www.learn-perl.org/). This is my goto online IDE for learning and testing my perl code.
  • Regex (https://regexr.com/). If you are working on genomic data and use Regular Expressions in scripts to detect specific sample IDs, lines etc to modify your data or calculate number of mapped and unmapped reads in fasta files (just an example), this tool really helps you learn how to write regular expressions for your data. What more, you can actually paste a piece of your text and write a regex to detect specific matches.

The other language widely used for statistical analysis and data visualization is R (free) or MATLAB (paid). I use R extensively so here are my pointers for learning R:

The more you practice the more you learn. The earlier you can start learning R, the better!! This will save you a lot of time in dealing with your data. There are many different only resources for getting started in R, but here is one basic one: http://tryr.codeschool.com

Even after working with R for a long time now, I still find myself Googling for specific commands and options. Here are the three most trustworthy sites which always have an answer to my questions:

These websites also have defined sections which walk you through simple R commands. For example, the sections on plots tell you how to make different kinds of plots, how to modify them and customize them to your data.

Beyond all of these resources, a simple google search can always help you and StackOverflow always has some amazing solutions to problems. It was only in my second year that I realized a PhD in Evolutionary Genomics requires some kick-ass googling skills!

 

2) Choose your favorite script editor/text editor

 

While you learn programming, it is ideal to select a script/text editor and to fall in love with it! I say this because switching between various script editors can be confusing and time consuming.Script/text editors help you edit your code and keep everything organized. Here are some of my suggestions:

  • VIM/VI (http://www.openvim.com/). If you are a programming nerd and like to work on a Terminal, this might be a good choice. However, recognize that it requires some time to learn vim and it is not unusual to find yourself trapped in the editor and even wonder how to exit the editor! Having said that, once you get a hang of it, you really can do a lot with just one editor.
  • Gedit. This comes preinstalled with linux/ubuntu systems and is very very easy to learn. Probably the simplest text editor to start working with and to use everyday.
  • EMACS (http://emacs.sexy/). Similar to VIM and requires some learning but again really powerful.
  • ATOM/SUBLIME?NOTEPAD++ = Editors which are more user friendly and are almost like MS Word.

All these are great tools, but you can choose one which works best with your work environment. I personally use VIM because I work on a TERMINAL and usually work on a remote computer cluster.

 

3) Managing data – scripts, sequences

 

When it comes to managing your data and mostly your analyses, I find it useful to keep detailed notes for my workflow for every project I am working on. I use my lab’s google notes website to do this but you can use alternative notes taking tools to do this (Endnote, Google Drive). The main things which each project page consists of are : a) my folder and file details where I literally write out each folder and details of each of the file in the folder. It is also good to keep a “readme” file in the folder to help you remember what each file in the folder is about, b) I write out step by step notes of each analysis and describe the scripts I used for analysis. If there are some specific options used through command line tools, I describe each of these options, and c) A list of analyses I hope to do for the project in the future.

Believe me, you can forget your data locations within a week/month if you step away from the computer. I store all my scripts remotely on the institution computer clusters. Another way to archive important scripts which will be reused or modified in the future, is to submit them to GitHub (https://github.com/). This is also a good way to make your scripts public if you think this will help a broader audience in your field.

4) Latex

 

 

I learnt about LaTex after starting my PhD and I cannot emphasize the importance of the tool enough. It is very, very helpful in organizing your text (read manuscripts), saves time adjusting sizes of images and tables by clicking multiple times and definitely is cleaner. LaTex basically gives you the power to design and manipulate your text the way you like it and have the control on it. This can be amazing and definitely saves you a ton of time adjusting page limits. I highly recommend learning to use LaTex. I even reuse most of my manuscript outlines and then I have to make minimal adjustments for different texts.

You can basically use any of the editors above to write LaTex documents (this is also a language).  But there are several desktop versions for Linux, Mac, Windows which are user friendly and help you visualize your PDFs from the source Tex document. The best of them all is Overleaf (https://www.overleaf.com/). This is like Google Doc but for LaTex. It is online, autosaves and has many, many templates for various journals and for various documents (CV, Thesis etc.). In addition, it is easy to collaborate on overleaf as many people can work and edit the document at the same time. You can also see the PDF in real time which changes as you write you LaTex code (Super cool!!).

 

5) Important resources

 

Beyond these tips, here are some important resources and articles which helped me learn these skills better: