Stress Testing an Apache Application Server in a Real World Environment
We've all had an experience in which the software is installed on the servers, the network is connected and the application is running. Naturally, the next step is to think, "I wonder how much traffic this system can support?" Sometimes the question lingers and sometimes it passes, but it always presents itself. So, how do we figure out how much traffic our server and application can handle? Can it handle only a few active clients or can it withstand a proper Slashdotting? To appreciate fully the challenges one faces in trying to answer these questions, we must first understand the dynamic application and how it works.
A traditional dynamic application has five main components: the application server, the database server, the application, the database and the network. In the open-source world, the application server usually is Apache. And, often, Apache is running on Linux.
The database server can be almost anything that can do the job; for most smaller applications, this tends to be MySQL. In this article, I highlight the open-source PostgreSQL server, which also runs on Linux.
The application itself can be almost anything that fits the project requirements. Sometimes it makes sense to use Perl, sometimes PHP, sometimes Java. It is beyond the scope of this article to determine the benefits or liability of a particular platform, but a firm understanding of the best tool for the job is necessary to plan properly for adequate performance in a running application.
The database itself can mean the difference between a maximum load of one user and 5,000 users. A bad schema can be the death of an application, while a good schema can make up for a multitude of other shortcomings.
The network tends to be the forgotten part of the equation, but it can be as detrimental as bad application code or a bad schema. A noisy network can slow intra-server communications dramatically. It also can introduce errors and other unknowns into communications that, in turn, have unknown results on the running code.
As you have probably guessed, finding where our optimal performance lies and pushing those limits is more than a minor challenge. Like the formula-one race car that runs with almost absolute technical efficiency, the five main components of the Web-based application determine whether the system can handle its load optimally. By looking at those components and measuring how they react under certain circumstances, we can use that data to better tune the system as a whole.
To begin the testing, we need to create an environment that facilitates micro-management of the five components. Being as most enterprise class applications are based on large proprietary hardware configurations, setting up a testing configuration often is prohibitive in cost. But, one of the advantages of the open-source model is a lot of the configurations are based on commodity hardware. The commodity hardware configuration, therefore, is the basic assumption used throughout the testing setup. This is not to say that a setup based on large proprietary hardware is not as valid or that the methods outlined are not compatible; it simply is more expensive.
We first need to set up a testing network. For this we use three computers on a private network segment. The systems should be exact replicas of the servers going into production or ones that already exist in the production environment. This, in a simple sense, accounts for the application/Web server and the database server, with the third system being a traffic generator and monitor. These three computers are connected through a hub for testing, because the shared nature of the hub facilitates monitoring network traffic. A better but more expensive solution would replace the hub with a switch and introduce an Ethernet tap into the configuration. The testing network we use, though, is a fairly accurate representation of the network topology that exists in the DMZ or behind the firewall of a live network.
Accurately monitoring the activity of the network and the systems involved in serving the applications requires some software, the first of which is the operating system. In this article, I use Red Hat 7.3, although there are few Red Hat-isms that are specific to these setups and tests. To get the best performance from the server machines, it is a good idea to make sure only the most necessary services are running. On the application server, this list includes Apache and SSH (if necessary); on the database server the list normally includes PostgreSQL and SSH (again, if necessary). As a general preference, I like to make sure all critical services, including Apache, PostgreSQL and the kernel itself are compiled from source. The benefit of doing this is ensuring only the necessary options are activated and nothing extraneous is active and taking up critical memory or processor time.
On the application and database servers, a necessary component that should be included is the sysstat package. This package normally is installed on the default Red Hat installation. For other distributions, the sysmon package can be found here and compiled from source. Sysstat is a good monitoring tool for most activities, as it can display at a glance almost all of the relevant information about a running system, including network activity, system loads and much more. This package works by polling data at specified intervals and is useful for general system monitoring. For our tests, we run sysstat in a more active mode, from the command line--a topic discussed in more depth later in this article.
It is a good idea to be familiar with the tools collected in the sysstat package, especially the sar and sadc programs. The man page for both of these programs provides a wealth of details. One of the limitations of the sysstat package is it has a minimum data sampling duration of one second. In my experience with this type of testing, a one-second sample is adequate for assessing where problems begin to creep into the configuration.
As we move to a different testing tool, we also are moving to a different portion of our testing network, the network itself. One of the best tools for this task is tcpdump. Tcpdump is a general purpose network data collection tool and, like sysstat, is available in binary form for most distributions, as well as in source code from www.tcpdump.org.
About now you may be asking why we are looking at raw network data. On occasion, I have errors be introduced into the communications between servers. For instance, sometimes data packets can become mangled in transit. Raw network data, then, is a great resource to have to refer back to in the event of a problem that cannot be diagnosed easily.
Tcpdump could be an article unto itself due to the depth and complexity of the subject of networking as well as the program itself. Specific usage examples follow in the next section, in which the actual testing procedure is explained. For now, tcpdump should be installed on our traffic generator system.
The last major component we need for our testing is a piece of software named flood, which is written by the Apache Group and available at www.apache.org. Flood still is considered alpha software and, therefore, is not well documented. On-line support also is limited, as few people seem to use it.
To begin, we need to download the flood source. We can get the source from here. A nice and simple document on how to build the flood source can be found there as well. If the Web application to be tested runs over https, reading this document is a must.
In it's most simple form, the method to build the software is:
tar -zxvf flood-0.x.tar.gz cd flood-0.x/ ./buildconf ./configure --disable-shared make all
Flood is executed and run from its source directory using the newly created ./flood executable.
The "./flood" syntax is quite simple. It generally follows the format:
./flood configuration-file > output.file
The configuration file is where the real work and power of flood is revealed, and several example files are provided in the ./examples directory in the flood source. It is a good idea to have a working knowledge of their construction, as well as some knowledge of XML. See Listing 1 for an example configuration file.
The general form of the configuration file is:
<flood> <urllist></urllist> <profile></profile> <farmer></farmer> <farm></farm> <seed></seed> </flood>
The <urllist> is where the specific URLs are placed that flood uses to step through and access the application. Due to the way flood processes these URLs under certain configurations, it is possible to simulate a complete session a visitor may make to the Web application.
The <profile> section is where specifics are set about how the file should be processed as well as which URLs should be used. This section uses several tags to define the behavior of the flood process. They are:
<name> <description> <useurllist> <profiletype> <socket> <report> <verify_resp>
These seven 7 tags are relatively well defined in the configuration file examples. The other main sections--farmer, farm and seed--set the parameters of how many times to run through the list, how often and the seed number for easy test duplication.
A real world note about flood from my own experience: if the application has rigidly defined URLs that reference individual pages, the stock flood report is useful with little modification. If, however, the Web application uses a few pages that refresh depending on variables and change accordingly, as is the case with most dynamic Web applications, flood results can be difficult to use. In the latter case, flood's primary usefulness comes in the scripting of traffic to a test environment for the purpose of simulating traffic. It is important to understand the benefits and the shortcomings of any applications being used; testing a Web application is no different.
Practical Task Scheduling Deployment
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.View Now!
|The Firebird Project's Firebird Relational Database||Jul 29, 2016|
|Stunnel Security for Oracle||Jul 28, 2016|
|SUSE LLC's SUSE Manager||Jul 21, 2016|
|My +1 Sword of Productivity||Jul 20, 2016|
|Non-Linux FOSS: Caffeine!||Jul 19, 2016|
|Murat Yener and Onur Dundar's Expert Android Studio (Wrox)||Jul 18, 2016|
- The Firebird Project's Firebird Relational Database
- Stunnel Security for Oracle
- My +1 Sword of Productivity
- SUSE LLC's SUSE Manager
- Non-Linux FOSS: Caffeine!
- Managing Linux Using Puppet
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Google's SwiftShader Released
- Parsing an RSS News Feed with a Bash Script
- Doing for User Space What We Did for Kernel Space
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide