Saturday, November 5, 2011

Details about Apache Benchmarking Tool

Couple of days were really struggling. Apache Benchmarking tool gave me real hard time. There is only one page manual explaining the parameters and literally no documentation on how to interpret result.

The usual AB output looks something like below.
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd,
Copyright (c) 1998-2002 The Apache Software Foundation,

Benchmarking (be patient)

Server Software: Jetty(8.0.4.v20111024)
Server Hostname: HOST_NAME
Server Port: 8080

Document Path: YOUR_URL
Document Length: 1397 bytes

Concurrency Level: 4000
Time taken for tests: 1.649018 seconds
Complete requests: 12000
Failed requests: 0
Write errors: 0
Non-2xx responses: 15868
Total transferred: 19339416 bytes
HTML transferred: 16801719 bytes
Requests per second: 7277.06 [#/sec] (mean)
Time per request: 549.673 [ms] (mean)
Time per request: 0.137 [ms] (mean, across all concurrent requests)
Transfer rate: 11452.88 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 23 38.1 1 120
Processing: 2 134 140.0 125 861
Waiting: 1 131 140.3 118 860
Total: 3 158 163.0 175 962

Percentage of the requests served within a certain time (ms)
50% 175
66% 225
75% 244
80% 266
90% 361
95% 434
98% 481
99% 617
100% 962 (longest request)
AB tool results seem simple on first look but when we started understanding the numbers little bit more, we started noticing issues with them. For example the response time calculated by AB tool for each request is "total time taken/total number of requests". Whereas the response time should be calculated based on the time taken by each request. Since in multiprocessor environment there are more than one request getting processed at a given point of time.

As per AB if a female delivers triplet in 9 months then a baby can be delivered in 3 months, which you and I know it's not possible. :)
AB formula of calculating the mean will give us deceptive results, so instead we relied on gnuplot to get meaningful results. The connect time (standard deviation) was coming close to what firefox was showing for each request. After plotting the gnuplot data things started to make little sense.

So if you want to know really how much time each request is taking do plot the gnuplot data which can be retrieved by -g option in ab. These number would be little close to real world.

Failure also had interesting behavior. If you will notice I fired total number of 12K request but there were 15K Non 2xx responses i.e. those response failed because of some reason but failure section is showing zero. I was so confused.

AB reports failure based on length and exceptions. If the length of response does not match with first one then it starts recording those request as failure. In my case rest service should return same response hence it's worked.

Another reason for failure can be any request which does not come back within specified timeout is also recorded as failure. Hence many time I noticed failure but had no exceptions or error logged any where.

Still could not resolve why sometime result shows failure and it passed next time when I try to run.

I was also getting all weird errors most the time. .Connection reset by peers, Cann't assign address, Connection timeout. I will not see any exceptions in the logs neither jetty nor applications. To prove that it's OS related, decided to look into source code.

Ab tool uses Apache Portable Runtime library for all it's call. here is a very nice link which helped me a lot.

After going through the code it was sure that all those errors were related to socket i.e. OS errors which AB was wrapping and throwing back in apr_status variable.

Lots of information in one blog. I will try splitting in in future.