Tracking Down Slow System Calls

We use a number of asynchronous servers at SmartFile. Asynchronous servers, as opposed to forking or threaded servers have a single thread of execution, and handle multiple clients simultaneously by using an event loop and polling.

In general async servers scale better than their forking/threaded counterparts because they consume less resources per client. The downside to async servers is that because all clients are served by the same thread of execution, any blocking calls will cause the entire server to “hang” or not respond to any client connection. Forking or threaded servers handle blocking more gracefully since any blocking will affect the thread of execution for just a single client connection. Other clients can be serviced normally.

Some of our async servers depend on external resources. These external resources can block during interaction, which results in our server being unavailable for the duration of the interaction. We use both MySQL and Redis as data stores. We also use a distributed file system for storing file contents. The first step in troubleshooting such behaviour is to identify the blocking operations. The next step would be to eliminate blocking by tuning the service, but that is for another post. This post gives a simple way to identify blocking system calls.

Because our async servers are all single process / single threaded, it is fairly easy to observe their behaviour using strace. What is needed is a way to filter out fast system calls from the strace output, so that we can locate the blocking calls. Below is a command that does just this. It took me a bit of time reading man pages to put this together, so I present it here to save you some time.

strace -Tp 9342 2>&1 | awk ' BEGIN { FS="[<>]" }; { if ( >= 0.5) print /bin/sh } '

First of all, we execute strace and instruct it to attach to our server process (pid 9342) and to include timing information for each syscall (-T). Strace will output lines like:

gettimeofday({1365787784, 837500}, NULL) = 0 <0.000005>

Where <0.000005> is the number of seconds the call took to complete. We then combine stderr and stdout into a single stream (strace outputs to stderr) using 2>&1, then pipe the output to awk using |.

The awk program is fairly simple, the first phase is BEGIN { FS=”[<>]” }; which runs one time, it sets the field separator to a regular expression specifying that < and > should be used to split the strace output into fields. Below are the separated fields (| delimits fields).

gettimeofday({1365787784, 837500}, NULL) = 0 | 0.000005 |

The next phase { if ( >= 0.5) print /bin/sh } runs for each input line, and simply performs a numeric comparison between the second to last field (syscall time) and half a second. If the syscall takes more than a half second to complete, the original strace output line (/bin/sh) is printed to the console all other output is not printed.

You could then monitor this output using your console, or write it to a log file for later review. This should make it quite easy to track down where your async server is blocking, so that you can focus your efforts on optimizing the correct resource to eliminate blocking.

SmartFile is a business file mangement platform that gives you more control, compliance and security.