Monday, October 30, 2006

Average CEL Perl Script

Today I wrote a Perl script to average out Affymetrix .cel files. In the middle of my Perl hacking I ran into some issues concerning Affymetrix's file format.

In a .cel file, under a section marked [INTENSITY] Affymetrix stores a probe intensity on each row. In the row you have your x-coordinate, y-coordinate, mean (aka intensity), standard deviation, and number of pixels.

Originally, I tried using the split function to split the data on a line in their intensity section. What I tried was as follows:
($dummy, $dummy, $dummy, $total_intensity[$i], , ) =
split(/\s+/, $line);

Notice there are three dummy variables that lead prior to the actual intensity value capture. I did this because I wanted to account for the space prior to the coordinate values. So this split function actually splits the line into six variables instead of the five mentioned above to account for the space. This actually caused me a problem because once you get to coordinates three digits long (i.e. y=100) that leading space is no longer there. What I ended up doing was creating a regular expression instead as follows:
if ( $line =~
m/\s*(\d+)\s+(\d+)\s+(\d+.\d+)/g){$total_intensity[$i]= $3;}

I know this is a "hack" but it works. I'm sure there is a way to get rid of the leading spaces but none came to mind. I tried the chomp function but all this does is get rid of trailing new lines. Does anyone have an idea how to get rid of these leading spaces?

No comments: