School’s Out is a RoR app, running on a cluster of Mongrels. Every so often, during high traffic days, one of the Mongrel processes would go goofy and start chewing through memory and becoming unresponsive. I use the top command and sort processes by memory usage. The bad process will be at the top using over 20% of system memory. This only happens maybe once or twice a month, and I could never duplicate it on my development environment.

Mongrel has a neat feature built in where you can turn on debug mode by sending a USR1 signal to that process. In this case the only information I got from debug mode was that there were requests that were hung up somewhere in my Rails code. Not super useful, but it was a start. Now I needed to figure out where and why this hangup was happening.

Normally I would restart all the Mongrels to get the site running again. This time I just took the offending process out of the loadbalancer (I am using Apache httpd + mod_proxy_balancer)

Now the question is, how do I try and debug a running process?

I had no clue, but luckily Jamis Buck did.

I attached gdb to the bad mongrel and followed Jamis’ instructions to figure out where the process was stuck.

[root@www iwarshak]# gdb /opt/local/bin/ruby 9489
...
Attaching to program: /opt/local/bin/ruby, process 9489
...
(gdb) set $ary = (int)backtrace(-1)
(gdb) set $count = *($ary+8)
(gdb) set $index = 0
(gdb) while $index < $count
 >x/1s *((int)rb_ary_entry($ary, $index)+12)
 >set $index = $index + 1
 >end
0x9653a50: "/opt/ruby/lib/ruby/gems/1.8/gems/postgres-pr-0.4.0/lib/buffer.rb:64:in `read'"
(gdb)

Ok, it looks like it had something to do with the postgres-pr driver. I ran this several times and always got the same result. Just to compare, I did the same thing with the working Mongrels. I got something like this for all of the other ones.

The good Mongrels looked like this

[root@www iwarshak]# gdb /opt/local/bin/ruby 9498
...
Attaching to program: /opt/local/bin/ruby, process 9498
...
(gdb) set $ary = (int)backtrace(-1)
(gdb) set $count = *($ary+8)
(gdb) set $index = 0
(gdb) while $index < $count
 >x/1s *((int)rb_ary_entry($ary, $index)+12)
 >set $index = $index + 1
 >end
0x9bc2768: "/opt/ruby/lib/ruby/gems/1.8/gems/mongrel-0.3.14/lib/mongrel/configurator.rb:274:in `sleep'"
(gdb)

I am no Mongrel expert, but it the responsive Mongrels looked like they are sleeping, waiting for a request to come in.

After digging around, the only potential solution I found was to use the native postgres gem, instead of the pure Ruby postgres-pr driver. So that’s what I did.

[root@www iwarshak]# gem uninstall postgres
[root@www iwarshak]# gem uninstall postgres-pr

I am hoping that this solves the problem.