Why Application Servers Crash and How to Avoid It

Building a solid solution so your enterprise will succeed on the Web.
But, My Server Is Faster than That!

These days, a well-tuned relational database (without deadlocks, no I/O contention, separate I/O paths for data and transaction log, etc.) might have a mean service time of 2, 5 or 10ms. That is in the same order of magnitude as the specs of the best disk drives themselves. And, that is also the level of performance announced in press releases from manufacturers on new TPC performance records: 30,000, 12,000 and 6,000 TPM (transactions per minute), respectively.

Table 3 shows what can be the highest volume of requests per hour while preserving a tolerable, yet slow response time (tq <= 20 seconds) for some typical mean service times.

Table 3. Service Time vs. Highest Volume of Requests with Acceptable Response Time (tq <= 20 Seconds)

You can see in this table that speed is a very valuable feature. With a very small mean service time (5ms or less), a web application system can support an important load without reaching the limit (<rho> = 1). Not only this, but the almost flat part of the curve in graphs, such as in Figures 1 and 2, is much longer. Also, Table 3 shows that a much longer mean service time (200, 500ms or more) allows a very limited number of requests per hour.

Back to the Bottleneck Question

If the database back end easily can be as fast as 5 or 10ms, and therefore process roughly half a million transactions per hour, then why does a web application system die when it has not even one hundred thousand (or less)? In other words, if the database takes 5ms, why does the application server take 50 or 100ms (or more) to complete a request?

Once again, we are reminded of the keep it simple principle. As we will see, its antithesis, complexity, is a monster for which we will pay dearly.

How many tiers does the server side have? With a two-tier architecture, there is only a web server and a database server. With three-tier architecture, we have the two previous tiers and a third one, the application server, in between. Some installations have another tier: they split the work in two phases between two application servers in cascade, perhaps because it fits the Model-Controller-View principle of development methodology. Some other installations have yet another tier, a security server, which must answer access-control requests coming from the various servers in the system.

In all of these architectures, each additional server adds its own service time to the total mean service time of the simple model we have been looking at. Also, there is a communication delay between each server, whether a server is only another program on the same machine or another process on another host. Some of the modern communication protocols used to communicate between object-oriented languages are known to introduce significant overhead. In my experience, the smallest delay I have seen using a rather lightweight TCP protocol on the same machine was around 3ms. If you have many hops because you have three or four servers in cascade to serve one request, the sum of all these delays could easily be 20 or 30ms. And you pay that price (30ms) for each and every request, even if your last application server does a 5ms query to the database that takes only 5ms.

Then there is the question, how fast are the various application servers? Each vendor might have their claims. If you want the real answers, you have to let your application compute the elapsed time at the end of each request and write it to a message log. This will tell you what really happens with your hardware and software configuration. If you have such measurements for each server program, you will be able to identify bottlenecks quickly. The lightest application server I have seen used a few milliseconds to process any request. Your mileage will vary. It would be interesting to compile real-life statistics like this about various products as these numbers are as difficult to find as marble-sized gold nuggets in a stream.

Why the Server Crashes, Even at Moderate Loads

Let's say we started with a high-performance RDBMS crunching at 5ms, and we end up with a web service that has a total average service time of 50ms. Well, that's not so bad actually, as we should be able to process 70,000 dynamic requests per hour.

And, let's say that the main application server of MyBestSport.com crashes at less than 10,000 requests per hour. Why? We know it happens often on evenings when there is a baseball or football game on TV, 15 minutes before the game (the load gets much lighter once the game begins).

This is very likely to happen if certain categories of requests take more time and tend to occur together, without being well distributed in time. This is not like the exponential distribution that we used in our model. Certain types of unequal distributions are especially painful. Let's look at this example (not a real service) in more detail.

Suppose the server side has four tiers: HTTP server, application server, security server and database server. Also, suppose the total mean service time is 50ms: 10ms to leave the HTTP server, cross a firewall and get to the application server; 25ms in the application server itself; 10ms to do an access-control request to the security server; and 5ms to do an SQL request.

But, all these measures are averages. We know from the log files that one server has a less equal distribution: the security server does the access control checks very quickly, in less than 4ms usually, but it takes normally about 500ms to perform a login check. The average is still less than 5ms because there are hundreds of access-control requests for each login request. When we look in the application log, we see that a large number of new logins were beginning before the application server ran out of RAM and started swapping, tied its shoelaces together and jumped out of the window.

Also, we know that the programming power offered by the very high-level, object-oriented features of the application server has its price: each incoming request uses at least 800KB even before it has been forwarded to the security server or the database server. Also, each session object (used to simulate a browser-to-app-server persistent connection, something that the HTTP protocol does not provide, per se) takes 200KB. After more analysis of the log, we find out that just before the crash there was an average of 200 logins per minute, much more than the average. Looking at the 15 minutes of log before the crash, we find about one thousand login requests and only a few logouts. The application server got bogged down with too many concurrent login requests, and it exhausted the RAM. The machine started paging out virtual memory, requests took too long, users got fed up with waiting and clicked on Stop, pressed Esc and tried logging in again until the operator pressed the big red button.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Why Application Servers Crash and How to Avoid It

Anonymous's picture

Where can I find out more about "request processor resources (RPRs)" ?

As a developer how do I add them ?

Re: Why Application Servers Crash and How to Avoid It

johntfrancois's picture

RPR are not famous or off-the-shelf software components.
A developer has to create a "pool" of resources -- a little bit like a "connections pool", as already found in some app server products -- where these resources are not connections or anything concrete or visible. A "RPR" pool is simply there to limit the number of concurrent threads (requests) competing for memory and CPU inside the app server.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix