A while back, I had put a Perl script together to read in some large files, do some data mining, and dump out the results I wanted to see (well, maybe they weren’t the results I was hoping for, but results nonetheless). Now, when the script ran, it would go through it’s first few routines nice and quick-like, but when it went to the following routine, it would become unbearably slow. The first couple of times, I didn’t think much of it; I liked feeling that the script was hard at work (of course).
However, after a few more runs, it was getting kind of ridiculous: the routine in question was the one that did reads from the file handle. Now, I can understand on a multi-gigabyte file it might take a while (read in each line, parse it through a regex, and populate a data structure). However, it was causing the entire system to hang, and that made no sense at all given it’s relatively easy instructions. Here’s an example of what I was doing:
1 2 3 4 5 | my $fd = IO::File->new("data.out",O_RDONLY); foreach my $line(<$fd>) { @data = $line =~ /regex_pattern/; } |
Simple enough. So, on the next run, I started tracking the PID. Now, I saw two things: 1) the CPU was pinned, and 2) memory utilization was going through the roof. Bingo. Basically, my multi-gigabyte file was being read into memory, eating up all the remaining free physical memory, which was forcing the box to start swapping (which was eating up the CPU). Now it made sense why things were going ridiculously slow.
Now, what didn’t make sense was the fact this was happening at all. For each line of data, I was only pulling about 10% of it into my data structure (bytes parsed vs bytes per line), so I shouldn’t see memory utilization match that of the file. Additionally, I had worked with large files like this before, on systems with less memory, and never had an issue. What was different?
I took lunch, thought about it, and somewhere in there thought to myself, “hrmm, I wonder if it has to do with the foreach loop.” In the past, I had always used a while-loop, but didn’t really consider there might be an extreme difference between how each was interpreted when reading a file handle. I went back to my desk and modified the code to use a while loop instead:
1 2 3 4 5 | my $fd = IO::File->new("data.out",O_RDONLY); while(my $line = <$fd>) { @data = $line =~ /regex_pattern/; } |
Sure enough, it tore right through the file, did it’s parsing, and was done very shortly after. Interesting! So, what happened?
Basically, it’s a major difference in how Perl handles while loops vs. foreach loops.
In a while-loop, Perl will blindly shift lines one-by-one off of the array/fd you pass until it reaches an EOF. Once it does, it breaks out and you’re done. This is how it should work (and does exactly what we expect it to do), so that makes sense.
Now, in a foreach-loop, the Perl interpreter needs to know the end of the array before it begins iterating over it (whereas with a while loop we just go until EOF). If the data is already an array, that’s not a problem. However, if the data is a file handle, then the only way it can know the end is by reading the entire file into memory and ultimately create an array of the data to return back to the foreach iterator.
Most of the time, people aren’t running through multi-gigabyte files, so doing a foreach or a while on a file handle won’t really be noticeable, even though there are two completely different logic paths being followed behind the scenes.
Ultimately, use a while loop when reading off of a file handle. It’s cleaner, uses less overhead, and has obviously proven itself to be more fit for this task.
There are a lot of Perl hackers out there (many of whom I have worked with) who may end up commenting or shed some deeper technical insight. By all means, I would love to hear it! If you think I’ve done a good job summarizing and detailing the issue, that works just as well! Enjoy!




Thanks for the words, its helpful.
for one… F perl…
two. How do you get the syntax hilighting with line numbers? Is that a WP plugin?
hahah.. don’t hate! and, yeah, it’s the WP-Syntax plugin. i think there’s one that is a little bit “prettier”, but i havent had the time to look around yet. it gets the job done for now
Thank you,
very interesting article