Stress Testing an Apache Application Server in a Real World Environment
We've all had an experience in which the software is installed on the servers, the network is connected and the application is running. Naturally, the next step is to think, "I wonder how much traffic this system can support?" Sometimes the question lingers and sometimes it passes, but it always presents itself. So, how do we figure out how much traffic our server and application can handle? Can it handle only a few active clients or can it withstand a proper Slashdotting? To appreciate fully the challenges one faces in trying to answer these questions, we must first understand the dynamic application and how it works.
A traditional dynamic application has five main components: the application server, the database server, the application, the database and the network. In the open-source world, the application server usually is Apache. And, often, Apache is running on Linux.
The database server can be almost anything that can do the job; for most smaller applications, this tends to be MySQL. In this article, I highlight the open-source PostgreSQL server, which also runs on Linux.
The application itself can be almost anything that fits the project requirements. Sometimes it makes sense to use Perl, sometimes PHP, sometimes Java. It is beyond the scope of this article to determine the benefits or liability of a particular platform, but a firm understanding of the best tool for the job is necessary to plan properly for adequate performance in a running application.
The database itself can mean the difference between a maximum load of one user and 5,000 users. A bad schema can be the death of an application, while a good schema can make up for a multitude of other shortcomings.
The network tends to be the forgotten part of the equation, but it can be as detrimental as bad application code or a bad schema. A noisy network can slow intra-server communications dramatically. It also can introduce errors and other unknowns into communications that, in turn, have unknown results on the running code.
As you have probably guessed, finding where our optimal performance lies and pushing those limits is more than a minor challenge. Like the formula-one race car that runs with almost absolute technical efficiency, the five main components of the Web-based application determine whether the system can handle its load optimally. By looking at those components and measuring how they react under certain circumstances, we can use that data to better tune the system as a whole.
To begin the testing, we need to create an environment that facilitates micro-management of the five components. Being as most enterprise class applications are based on large proprietary hardware configurations, setting up a testing configuration often is prohibitive in cost. But, one of the advantages of the open-source model is a lot of the configurations are based on commodity hardware. The commodity hardware configuration, therefore, is the basic assumption used throughout the testing setup. This is not to say that a setup based on large proprietary hardware is not as valid or that the methods outlined are not compatible; it simply is more expensive.
We first need to set up a testing network. For this we use three computers on a private network segment. The systems should be exact replicas of the servers going into production or ones that already exist in the production environment. This, in a simple sense, accounts for the application/Web server and the database server, with the third system being a traffic generator and monitor. These three computers are connected through a hub for testing, because the shared nature of the hub facilitates monitoring network traffic. A better but more expensive solution would replace the hub with a switch and introduce an Ethernet tap into the configuration. The testing network we use, though, is a fairly accurate representation of the network topology that exists in the DMZ or behind the firewall of a live network.
Accurately monitoring the activity of the network and the systems involved in serving the applications requires some software, the first of which is the operating system. In this article, I use Red Hat 7.3, although there are few Red Hat-isms that are specific to these setups and tests. To get the best performance from the server machines, it is a good idea to make sure only the most necessary services are running. On the application server, this list includes Apache and SSH (if necessary); on the database server the list normally includes PostgreSQL and SSH (again, if necessary). As a general preference, I like to make sure all critical services, including Apache, PostgreSQL and the kernel itself are compiled from source. The benefit of doing this is ensuring only the necessary options are activated and nothing extraneous is active and taking up critical memory or processor time.
On the application and database servers, a necessary component that should be included is the sysstat package. This package normally is installed on the default Red Hat installation. For other distributions, the sysmon package can be found here and compiled from source. Sysstat is a good monitoring tool for most activities, as it can display at a glance almost all of the relevant information about a running system, including network activity, system loads and much more. This package works by polling data at specified intervals and is useful for general system monitoring. For our tests, we run sysstat in a more active mode, from the command line--a topic discussed in more depth later in this article.
It is a good idea to be familiar with the tools collected in the sysstat package, especially the sar and sadc programs. The man page for both of these programs provides a wealth of details. One of the limitations of the sysstat package is it has a minimum data sampling duration of one second. In my experience with this type of testing, a one-second sample is adequate for assessing where problems begin to creep into the configuration.
As we move to a different testing tool, we also are moving to a different portion of our testing network, the network itself. One of the best tools for this task is tcpdump. Tcpdump is a general purpose network data collection tool and, like sysstat, is available in binary form for most distributions, as well as in source code from www.tcpdump.org.
About now you may be asking why we are looking at raw network data. On occasion, I have errors be introduced into the communications between servers. For instance, sometimes data packets can become mangled in transit. Raw network data, then, is a great resource to have to refer back to in the event of a problem that cannot be diagnosed easily.
Tcpdump could be an article unto itself due to the depth and complexity of the subject of networking as well as the program itself. Specific usage examples follow in the next section, in which the actual testing procedure is explained. For now, tcpdump should be installed on our traffic generator system.
The last major component we need for our testing is a piece of software named flood, which is written by the Apache Group and available at www.apache.org. Flood still is considered alpha software and, therefore, is not well documented. On-line support also is limited, as few people seem to use it.
To begin, we need to download the flood source. We can get the source from here. A nice and simple document on how to build the flood source can be found there as well. If the Web application to be tested runs over https, reading this document is a must.
In it's most simple form, the method to build the software is:
tar -zxvf flood-0.x.tar.gz cd flood-0.x/ ./buildconf ./configure --disable-shared make all
Flood is executed and run from its source directory using the newly created ./flood executable.
The "./flood" syntax is quite simple. It generally follows the format:
./flood configuration-file > output.file
The configuration file is where the real work and power of flood is revealed, and several example files are provided in the ./examples directory in the flood source. It is a good idea to have a working knowledge of their construction, as well as some knowledge of XML. See Listing 1 for an example configuration file.
The general form of the configuration file is:
<flood> <urllist></urllist> <profile></profile> <farmer></farmer> <farm></farm> <seed></seed> </flood>
The <urllist> is where the specific URLs are placed that flood uses to step through and access the application. Due to the way flood processes these URLs under certain configurations, it is possible to simulate a complete session a visitor may make to the Web application.
The <profile> section is where specifics are set about how the file should be processed as well as which URLs should be used. This section uses several tags to define the behavior of the flood process. They are:
<name> <description> <useurllist> <profiletype> <socket> <report> <verify_resp>
These seven 7 tags are relatively well defined in the configuration file examples. The other main sections--farmer, farm and seed--set the parameters of how many times to run through the list, how often and the seed number for easy test duplication.
A real world note about flood from my own experience: if the application has rigidly defined URLs that reference individual pages, the stock flood report is useful with little modification. If, however, the Web application uses a few pages that refresh depending on variables and change accordingly, as is the case with most dynamic Web applications, flood results can be difficult to use. In the latter case, flood's primary usefulness comes in the scripting of traffic to a test environment for the purpose of simulating traffic. It is important to understand the benefits and the shortcomings of any applications being used; testing a Web application is no different.
|Designing Electronics with Linux||May 22, 2013|
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
- Designing Electronics with Linux
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- What's the tweeting protocol?
- A Topic for Discussion - Open Source Feature-Richness?
- Validate an E-Mail Address with PHP, the Right Way
- Mediated Reality: University of Toronto RWM Project
- Home, My Backup Data Center
1 hour 26 min ago
- Kernel Problem
11 hours 28 min ago
- BASH script to log IPs on public web server
15 hours 55 min ago
19 hours 31 min ago
- Reply to comment | Linux Journal
20 hours 4 min ago
- All the articles you talked
22 hours 27 min ago
- All the articles you talked
22 hours 30 min ago
- All the articles you talked
22 hours 32 min ago
1 day 2 hours ago
- Keeping track of IP address
1 day 4 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?