Random Conversations Between Me and Dot B. Com: 2007

Monday, December 17, 2007

Rrrrrrrr

After doing a fresh install of OSX Leopard on my Mac and having to re-familiarize with some of my Biostatistics/Microarray education background it was time do a fresh install of R on my machine.

R is a freely available statistical programming language modeled after S programming language. Matthew Keller, a founder of the Richmond R group describes some of R's advantages over other statistical packages as the following:

Its fast and free

State of the art: Researchers provide their methods as R packages

Its second only to MATLAB for graphics

Some points that I would like to add to the list are:

Its graphs and tables can be used with the ever popular LaTeX

Its programming/scripting like interface makes it rather extensible

Its a great starting point for people interested in Biostatistics and Microarrays

The BioConductor project is primarily based on R

Did I mention that its free?

R is a rather powerful tool and I consider it to be a must have for any Biostatistician or Bioinformatician's toolset. So check it out, and let me know what you think.

Other Important R Links:

CRAN, the Comprehensive R Archive Network (similar to perl's CPAN)

The R Help Mailing List a good place for getting answers to R related questions

R Seek, a useful tool as google isn't designed to just search for "R"

Tuesday, December 11, 2007

MIA... the trilogy

This is my third MIA post and I'd like to think that taking time to finish my Master of Bioinformatics degree, get a job, move, and get engaged would be suitable excuses for my lack of posting. But who knows, maybe I'm just not cut out to be a blogger. But as with all good/bad trilogies, things must come to an end so expect this to be my last MIA post.

Currently on the queue of posts that I have begun to write about are the following:

~~The R Programming Language~~

Links for free Bioinformatics and Biology classes

Consed's .ace file format

NCBI's OMIM

If there is anything else that anyone is interested in, leave the topic ideas in the responses.

Thursday, June 21, 2007

BIND and SOAP

I'm in the process of developing an application that utilizes information from protein-protein interaction databases. One specific database I am working with is BIND, the Biomolecular Interaction Network Database. As my application will be looking at a large number of genes I had to figure out how to write an application that interacted with the database.

The solution I found was BIND SOAP, an API designed to help developers interface with BIND using either C, Perl, Java, or VB .NET. SOAP or the Simple Object Access Protocol provides a basic messaging framework which allows for communication between applications across the interenet.

After some research I decided that Perl was the best way to go as there was already a Perl module available for use with SOAP, conveninetly named SOAP::Lite. If you have the CPAN module installed on your system the best way to get the SOAP::Lite module on Linux systems is to start the CPAN shell:

[root]# perl -MCPAN -e shell

Once the shell is started, run:

cpan> install SOAP::Lite

If you are having problems installing the module or installing from another system you can go here [soaplite.com] for additional instructions.

Sunday, June 17, 2007

LAMP and Wordpress

As I mentioned in a previous post, I'm currently working on a new look for the site. This entire week I was constantly changing the design. Realizing that I may be causing some inconvenience to my readers, I decided to go "old school" and set up a local LAMP (Linux, Apache, MySQL, Perl or PHP) server that ran Wordpress.

I just happened to have an old Dell Dimension L400c laying around so I went ahead and did a FC6 (Fedora Core 6) LAMP install as covered at howtoforge.com. The install was pretty straight forward and I ended up skipping steps 8 (Chrooted DNS Server) and 10 (Postfixas) as I felt I had no need for those options as the machine is just for testing purposes.

The WordPress Install was even simpler taking only about four commands and the modification of only one file. More information can be found at linuxjournal.com and wordpress.org. Now I have a reliable system to test the design of my page without wasting too much bandwidth. Sorry if I caused an inconvenience to anyone. If you have questions about the install feel free to contact me.

Tuesday, June 12, 2007

2007 Systems Biology Summit

Last week, I attended the Systems Biology Summit in Richmond, Virginia. The opening session in the Summit was entitled "the Systems Biology Challenge in 21st Century Biomedical Research". It consisted of speakers from the Research Institute, the National Institute of Health, Academia, and the Pharmaceutical industry providing their various viewpoints of Systems Biology.

Dr. Leroy Hood began the session with his keynote lecture on systems approaches in Biology and Medicine. The following is Dr. Hood's thoughts on where we are in systems biology:

The information we are finding represents the "parts" of the system, when we move into the realm of establishing functionality of the system we are determining the blueprints for these parts.

A later speaker, Dr. Keith Elliston of Genstruct, expanded the discussion with his research on biological causal networks and their use for diagnostic reasoning or predictive inference. The following was his entertaining quote on networks and pathways that was repeated throughout the weekend:

System biology is not pathways but networks...stupid. A pathways is a specific path through the network.

Another entertaining quote was from Dr. Burt Adelman, representing Industry's perspective and their thoughts on the transition of animal research to human treatments.

We treat humans. They're very complex not inbred... mostly. We have to find what aspects of human biology are animal research reproducing.

The session ended with a panel discussion on systems biology. The most intriguing of the topics covered was the current problems in systems biology:

The peer review system for grant applications in the United States.

Researchers fear of failure.

Lack of effective collaborations.

The lack of tools for non-elite scientists.

The need of better leadership in the scientific community.

Overall I thought the summit was a great experience and I would go again if another opportunity arose. I got to network with different people and learned some new things that I will discuss on this blog in the next couple of weeks. My biggest gripe with the summit was that it was 90% presentations and 10% workshop. As a programmer coming into biology I know I should not expect anything like the WWDC, but if we are to build better collaborations and novel tools I just think the summit could have spent more time with people working together rather than gathering in a room and listening to one person talk. It would be interesting to put something like that together one day, what does everyone think?

Monday, June 11, 2007

Genetic Discrimination

As covered by Nature magazine a couple weeks ago, the full genome of James D Watson, one of the fathers of DNA, has been sequenced. The article also describes how Watson's DNA sequence revealed his predisposition to cancer. This revelation brings on several important questions. Will people come forward to see what diseases they are prone to have? More importantly how can future employers, health providers or insurance companies use this information to genetically discriminate against you?

As covered by Slashdot a month ago there is a bill currently waiting the approval of one senator to getting passed that addresses genetic discrimination. This bill will make it illegal US citizens to be denied jobs or insurance because of an implication of a disease provided by their genetic code.

I just hope this bill passes soon as it is essential to the use of novel Bioinformatic practices in the medical field.

Sunday, June 10, 2007

MIA Once Again

Sorry to be MIA once again. I got busy with wrapping up my course work, fending off hackers from this site and attending last week's Systems Biology Summit (more on this in a follow up post). But besides that, I was privileged enough to be asked to contribute some thoughts on working in Bioinformatics with regards to Academia at Bioinformatics Zen for their 11th Bio:blogs. You should check out the article, there is a various assortment of information provided by some of the more prominent bloggers in the Bioinformatics community.

Wednesday, April 11, 2007

Bridging the Gap: Alcohol Deprivation Effect

The biologists in my lab study the effects of ethanol (alcohol) on the brain. To do this they have to come up with animal based experiments to model various alcohol based conditions. One of these models is known as the alcohol deprivation effect (ADE). What it models is the possible increase in alcohol craving or consumption after a period of withdrawal (deprivation).

One such experiment may expose mice to a volunteer intake of ethanol. Then after a measured amount of time (i.e. two weeks) the ethanol is taken away (i.e. another two weeks); this is known as the deprivation period. Once the deprivation period is over the mouse is reintroduced to choice bottle drinking of ethanol versus a plain solution. This gives the researcher a variety of things to study (i.e. average amount of ethanol consumed, ratio of ethanol versus plain solution consumed, etc).

Monday, March 26, 2007

St. Baldrick's Day

This past weekend I said goodbye to all my hair to support kids with Cancer. I was fortunate enough to exceed my goal of $500 (donations are still welcome and appreciated)! By bundling my hair I also I got to donate to Locks of Love.

As someone who is in academics and benefits from funds raised by this type of event, I felt it was necessary to participate. Not only did I have the opportunity to give back but it also helped to remember why we do the research we do; why we participate in science. I came back to school so I could pursue a career that benefited other people's well being. Prior to that I was just twiddling my fingers away working for a credit card company. I enjoyed this event very much and encourage anyone else who has the opportunity to participate in various things outside of the lab to remember why we do what we do.

For everyone who donated:

THANK YOU!

Tuesday, March 20, 2007

Bridging the Gap: Stem Cells

My "bridging the gap" posts were intended to help teach other computer scientists biology jargon. If you've been here for a while you know I haven't really been followed through (only two posts) with this concept, but starting today I'll to give it another run.

Today I attended a seminar and found myself looking up various terms related to stem cell research. I'm sure you all have heard all the buzz going on about stem cell research the past couple of years. But I'm sure you didn't know that there were two kinds branches of stem cells. More specifically there if a cell can differentiate into a mutre type then it is classified as either a pluripotent stem cells and multipotent stem cells [Stem Cell Research Foundation].

As I have very little knowledge in this field, does anyone care to share what they know about stem cells and the research?

Looking Up Genes

I attended a seminar today where the speaker mentioned a gene whose name or function I've never heard of before. I used to use Wikipedia to look up a gene but that source is frowned upon by the scientific communtiy due to its unreliability. Now I use NCBI's Online Mendelian Inheritance in Man (OMIM) which gives a nice condense summary of common knowledge on a gene.

Another one of the graduate students in my lab suggested iHOP that not only has a cool looking monkey on the front page but is also presents a page describing a gene that is loaded with links to various abstracts contained within PubMed.

What tools does everyone else out there use?

Friday, March 16, 2007

Vertical text selection

Want to select that list of genes without the pain of closing and reopening the file in Excel? Diana Higgins at Windows Fanatics reminds us to do this using most text editors (it can also be done in Word). Simply hold down the Alt key (or the option key in Macs) when making a selection.

Unfortunately this little trick doesn't work in Microsoft's notepad and I wasn't able to find such a key in Gnome for Linux either. Does anyone else out there know?

UPDATE: Here are some tips from the comments

In editplus (at least in windows) the same can be done with Alt+C.
In Vim you can use the Ctrl-v combination and then HJKL (or arrow) keys to adjust your selection.
Alt-mouse drag will select columns in TextPad
Option Drag works with TextWrangler on Mac too. Cool Tip. I was looking for similar.

Go Rams!

Sorry that this wonders off the purpose of the blog but I feel its necessary.

Congratulations Rams!

Thursday, March 15, 2007

Eye Color

I found an interesting post on today explaining the genetic properties of eye color. The article describes how eye color is a polygenetic trait (i.e. more than one gene involved) and of the genes involved one particular gene, OCA2, has more of an influence than the rest.

Its a brief article but I thought it would be useful as it has some jargon that is commonly used in biology and bioinformatics.

Key Terms: single nucleotide polymorphisms (SNPs), gene expression [Wikipedia]

Wednesday, March 14, 2007

"The Iguana"

This actually happened about three months ago but I never got time to share.

Since I came into my lab two years ago I have become involved in an age long battle between biologists and computer scientists (when I say age long I really mean never existing). The biologists and technicians called me "Neo", asked me regularly if I was hacking into the FBI's website, and complained when they couldn't use my computer (I'm running Fedora Core 6). I called their bench work fancy biological hand waving.

An exceptionally good prank of theirs was wrapping my computer, mouse, monitor and keyboard all unplugged in bubble wrap. I actually thought we were moving labs. Pure comedy. I got the tech back by remotely logging into his Mac and having it sing a little tune for him. It seemed like this kind of fun would last forever but alas all good things come to an end and our lab technician was offered a position somewhere else. On his last day as a truce at work I allowed him to go ahead and fool around with my workstation one last time. The following pictures are his depiction of his thoughts on Linux.

Photo Sharing and Video Hosting at Photobucket

Hello Me!

It seems like my public ramblings are not a waste. Not only am I talking to myself and former classmates but people from other blogs as well. Even Google's first search result for "bioinformatics blogs", nodalpoint, references this site. Although I don't like the heading "perl hacking", I shouldn't complain for being acknowledged by my own peers. Plus its my fault that I don't post more often.

Who knows maybe this recognition will inspire me to post more when Spring Break ends. You'll just have to tune in and see =).

Tuesday, March 13, 2007

Network Theory

This network made digg about a month ago. I thought it was interesting because I actually saw it a year ago. You'll find that network/graph theory is a big topic in Bioinformatics.

I personally find its use in Bioinformatics to be a little bit of a double edge sword. Their importance has emerged as these networks are used to present a systematic overview of various biological processes (i.e. all the gene interactions at a given time in the cell). Which is one of the overall goals of Systems Biology as I briefly touched on in my previous post.

But at the same time their novelty has also caused their misuse in biological community. You may find biologists who want to include these networks in there study but have no knowledge of how they are constructed. One of the Ph.D. students in my lab terms these networks use as fancy bioinformatic "hand waving". Which is what they are some of the time. The point is, these theoretical networks should be taken for what they are, a tool that facilitates further interpretation, not a concrete view of how a system works.

Systems Biology

Here is an interesting quote form my school's site that a professor recently pointed out in class:

... systems are more than a sum of the parts, and that nonlinear interactions of components and processes result in emergent properties that can not be predicted from knowledge of the individual components and their behavioral processes.

In lamen's terms, the study of entire biological systems (i.e. looking at all the genes of a cell at once) provides more insight to properties of the system that could not be seen or identified with the old biological dogma of single gene studies.

This is what Bioinformatics has done to the study of Biology. It has transcended the study from a micro exploration of individual gene function to the macro examination of the system as a whole by observing all the parts simultaneously.

Beginner's Guide to Bioinformatics

As a computer scientist coming into Bioinformatics I was faced with the heavy task of catching up on my Biology and Chemistry (I was a Physics minor in undergrad but that wasn't applicable to my Bioinformatics catch up). This meant two semesters of General Chemistry, a semester of Organic Chemistry and a semester of Cell Biology. Though all this course work was very educational and useful for my degree I don't think its all that necessary for a someone who may be interested in fooling around with Bioinformatics problems on the side.

Here is a very general overview of cell biology for Non-Biologists wanting to get involved in Bioinformatics:

Proteins are the essential part of all living organisms. Proteins have a variety of functions and are involved in every process within our cells. [Wikipedia]

DNA is the blueprint for proteins. Segments of DNA (genes) translate into proteins. For more detail look into the Translation and Transcription of DNA to proteins.

Cell function is determined by which proteins are expressed and their quantity. This means that some kind of gene regulation must take place. Also one can argue if you know the amount of genes expressed in a cell you can possibly infer that cells function.

For a more specific overview, the following are some of the essential key points for biology and bioinformatics:

Genome - all the DNA in a cell.

DNA - a string of nucleic acids (i.e. GATCACTTâ€¦ATCG).

Gene - a substring of DNA that encodes proteins.

Proteins - a string of amino acids (i.e. ACDEFâ€¦RSTY).

Gene expression is regulated by the product of other genes. It is a network of interactions.

Post-translation modifications are an important regulation mechanism for gene expression.

You may notice that the above deals quite a bit with string manipulation, hence the strong emphasis for Perl experience in Bioinformatic job postings. You will find that string manipulation is not the only driving force for computer science in Bioinformatics. I will try to explain other topics in subsequent posts.

As for Biologists wanting to do Bioinformatics I can not provide the best advice since I didn't come into Bioinformatics from that direction but I would imagine that you may want to look into the following:

Learn how to program. You want to know how to use a scripting language (preferably Perl) for smaller every day tasks and an object-oriented language such as C, C++, or Java for larger projects.

Learn how to use databases. Bioinformatics deals with very large datasets. At some point your are going to have to deal with either retrieving information from databases or building your very own database so you might as well begin playing with them now.

Install and run a Unix/Linux OS (Optional). This might be my personal bias but I believe if you are going to be working in Bioinformatics and its large data sets eventually you will find yourself either maintaining a server or SSHing into one so you might as well become familiar with that type of environment. At the very least XP users should install Cygwin.

Useful Links:

Bioinformatics intro offered at my university.
Graduate level of the Bioinformatics intro course.

Library of videos that cover a wide range of biological topics (theoretical and practical).

RT-PCR a common molecular biology method practiced in the lab.

Virtual lab which provides a virtual lab for non-biologists to actually work through basic molecular biologist techniques.

Finally I must say that I am far from an expert so any constructive suggestions to help clarify or expand the above is welcomed and appreciated.

Monday, March 12, 2007

MIA...

Sorry I've been missing in action. I'm completing my last semester of classes.

Since I'm on Spring Break (AKA catch up with all my work break) there will be quite a bit of posts popping up on information I've gained throughout the semester but have not had the time to post on.

Sunday, January 14, 2007

Getting the transpose of a CSV

Today my Boss/P.I. approached me with an application problem he was having. He had several large comma separated value files that needed to be transposed (i.e. switching data that are in a row to a column) to work with an application known as Jqtl. Now typically this would be no problem for him as he would simply have to just pop the file into Excel or Datadesk but he was dealing with files that had about 45,000 rows and 30 columns. Now if any of you have worked with Excel and large datasets you would know that Excel used to have a row limit of 256 columns (until Excel 12 according to this blog) so using that as a method was definitely not a solution.

So I simply wrote a quick Perl script for this as I didn't see any available in my 10 minute search online. I'm sure there is probably a module for it, but I thought it would be easy enough.

It took around three seconds to transpose the 45,000 by 30 dataset without any fancy code optimization. Here's the script.

If you're running in a Unix/Linux environment make sure you chmod to make the file executable. To run the script on lets say a file called foo simply run the following form a terminal

$ ./transpose_csv.pl foo

You'll end up with a file with "tr_" appended to the original file name such as tr_foo.

Wednesday, January 3, 2007

Perl and different text file formats

I recently ran into a text file format problem while writing a Perl script in OS X. I had been testing the script and it worked fine with test text files but did not work with the text file I was given. For instance, I was scanning the text file for a particular Affymetrix gene ID and would never come up with a match using Perl's "eq" string comparison. I believed it was not a text file issue as I usually see carrige returns or "^M" at the end of lines when inspecting data in Vi.

What I discovered was what anyone who has ever worked with data from multiple OSs might know; carrige returns are not the only thing that might be carried over from an application exporting text files on another platform. What should of hinted me to this was the little "[dos]" message at the bottom of the screen when I opened the file in Vi. This is why I couldn't see the extra characters carried over from the Windows export. To work around this you can simply open a file using the -b option with Vi to open the file in binary mode.

So in my case I saw all the addtional null characters (^@) after every character in the file I was using. The file was actually encoded in UTF-16-LE format which includes a null high-order byte, after each ASCII byte (Allan from the Richmond Perl Mongers group explained this to me). This explained why the "eq" comparison was not working in my Perl Script. To solve this I tried three different approaches:

Go back to the original application and ensure that data is exported in UTF-8 format which will look like plain ASCII. While this may work its rather inconvenient, especially if you're working on data from a client.

Use a regular expression in Vi to replace the null characters with nothing.
In Vi's navigation mode you would type ":%s/\//g".

While this is a great solution it can be rather slow depending on the size of the file you are working with.

Use Perl's nifty encoding capability in their open function.

open (INPUT_FILE, "<:encoding(UTF-16)", "$input_path") or die;

While good this assumes your Perl script is only going to only work with that specific file encoding.

All three solutions worked out perfectly fine for me and its just preference with regards to which you prefer to do.

As a side note since I always forget this myself. If you are in Linux/Unix and working with OS X text files you'll discover that ^M are the end of line characters from OS X. On first instinct you might want to use "\n" for your newline character in your Vi regular expression ":%s/\

/\n/g" but this won't work, the actual line feed to use with this method is "\r". So your regular expression would look like ":%s/\

/\r/g"