What I discovered was what anyone who has ever worked with data from multiple OSs might know; carrige returns are not the only thing that might be carried over from an application exporting text files on another platform. What should of hinted me to this was the little "[dos]" message at the bottom of the screen when I opened the file in Vi. This is why I couldn't see the extra characters carried over from the Windows export. To work around this you can simply open a file using the -b option with Vi to open the file in binary mode.
So in my case I saw all the addtional null characters (^@) after every character in the file I was using. The file was actually encoded in UTF-16-LE format which includes a null high-order byte, after each ASCII byte (Allan from the Richmond Perl Mongers group explained this to me). This explained why the "eq" comparison was not working in my Perl Script. To solve this I tried three different approaches:
- Go back to the original application and ensure that data is exported in UTF-8 format which will look like plain ASCII. While this may work its rather inconvenient, especially if you're working on data from a client.
- Use a regular expression in Vi to replace the null characters with nothing.
In Vi's navigation mode you would type ":%s/\//g".
While this is a great solution it can be rather slow depending on the size of the file you are working with. - Use Perl's nifty encoding capability in their open function.
open (INPUT_FILE, "<:encoding(UTF-16)", "$input_path") or die;
While good this assumes your Perl script is only going to only work with that specific file encoding.
All three solutions worked out perfectly fine for me and its just preference with regards to which you prefer to do.
As a side note since I always forget this myself. If you are in Linux/Unix and working with OS X text files you'll discover that ^M are the end of line characters from OS X. On first instinct you might want to use "\n" for your newline character in your Vi regular expression ":%s/\
No comments:
Post a Comment