As a bioinformatician who works mostly inside of Emacs, I strongly dislike having to open a web browser, type in a gene identifier and search for information on that gene just to view a simple summary of what a gene does. Instead, I’d rather be able to see that right inside of my text editor, fast. Since Emacs is so customizable, I decided to try to see if I could get find a solution by hacking out some elisp, bash scripting and R.
My first solution
My first attempt used the Gene Cards website to try to find information. The function below could be used to get the thing at point and plug it into the URL to open the website with the appropriate gene information.
So if there was a gene that my point was on (say NAT2), I could hit C-c g and this page would open in my default web browser. This helped me save time but it still opened in a browser and took a few seconds to load. This is not usually that big of a deal but I often work on a remote server where opening a GUI program like firefox over ssh with X11 forwarding is even slower (~ 10 seconds).
A more elegant solution
I was browsing the gene information from NIH’s NCBI Website recently and discovered that the Display Settings in the top left corner included a plain text output. Now when I see “plain text”, my first thought is Emacs so I started to seek out a way to get those files inside of emacs.
Now my first thought was to use the browse-url-emacs function to browse the plain text site inside of Emacs. I discovered (and I’m not sure why) that this was even slower than opening a browser.
Not quite discouraged enough to stop my last idea was to download all the files and store local copies to view inside of Emacs.
Creating local database of HTML files
So I did that. All HTML files for genes in Homo Sapiens from the website I saved in ~/.emacs.d/spa-find-gene-info/database/.
This is obviously a very naive way to do it (and took about 5 hours) but I figured that I wouldn’t need to update the database too often. Now I basically had a local version of all the gene information from their website and I could simply open the files on my machine if they existed and (otherwise) still use the browser solution.
Here is an example of the first few lines of the HTML file for the gene NAT2 (with entrezID 10). You can see the online version here
Modifying my elisp function to work
Then I modified my Emacs lisp function to open this local HTML file (if it exists) and otherwise still open a browser.
So, in the end, I can hit 3 keys C-c g and practically instantaneously see not only the gene summary information I was after but also additional details provided by NCBI on the gene (entrez ID or gene symbol) at point inside of Emacs. Pretty Cool!
So…do I need to know the Entrez Gene ID?
Well this is great if you see entrez gene IDs but what about other gene identifiers like symbols? Since the gene symbol is write in the HTML file (see above for NAT2), I wrote a simple bash script to create symbolic links from files like NAT2.html to 10.html.
This could be extended to other identifiers as well (one-to-many mappings might be tricky so I only used official gene symbol).
I called this an “elegant solution” but really what I should say is this is the “most efficient way I’ve found so far.” I would love to hear feedback on alternative strategies from others!