Are Your Web Services Working Correctly?

by Massimiliano Panichi

In December of 1999, I started to work for an Italian firm as a system administrator. The firm, Infogroup S.p.A., is a provider of banking and financial applications for Italian banks. Its core business is to provide web portals and data sharing for web-trading systems. We get data from the satellite, put it in the databases and present it to internet customers for trading via a web application.

One of the most critical aspects of these activities is time. While not as important for the bank account holder, for those who invest money in the stock markets, time is a primary concern. Therefore, if our web applications hang for seconds it's a big problem. Time is important even for the development cycle. The time to market needs to be short, so not all of the application components may work correctly at launch time.

When I started my job, we had several problems with applications. In particular, the ones using old Oracle Application Server (OAS) were problematic. Two or three times in a day the server hung, forcing a restart. I think everyone can understand that we needed something more robust to build our applications. Even with support from Oracle, we didn't reach the goal of no more than one restart in a day.

Moreover, restarts didn't always execute successfully, especially when the system didn't stop correctly. So our solution was to use two application servers for load balancing (or high availability) with a load balancer in front. To do this, we needed an automated control system in order to identify rapidly the application server not working or the application error in building a web page (e.g., the wrong market data). When one of the application servers hangs, we can restart it, but in the meantime, the other server is working.

To help with this identification process, I started to think about an application that would periodically perform a series of checks on URLs to alert us in case of problems. I'd previously found that the perfect language for me was Perl. I'd learned it writing some little CGI scripts, and I've enough confidence with it to prefer it to other languages.

The following is the list of necessities we came up with for the identification application:

  • periodically check URLs

  • maintain history data for analysis

  • administer via the Web

  • view quickly all the checks and statuses

This lists, then, brought about an application formed in three parts: the web administrator (whatsdown.admin), the analyzer (whatsdown.robot) and the data viewer and monitor (whatsdown.archive). To build the application, I used a lot of open-source software: Apache, PostgreSQL (for data repository), Festival (for voice alerts) and Perl (with a lot of modules). The following list is the tools I used to build the application:

  • Net-DNS-0.12

  • Digest-MD5-2.09

  • GD-1.19

  • HTML-Parser-3.05

  • HTML-Template-1.8

  • libnet-1.0607

  • libwww-perl-5.47

  • MD5-1.7

  • MIME-Base64-2.11

  • Net_SSLeay-1.08

  • SNMP

  • URI-1.04

  • CGI

  • Net::Ping

  • DBI-1.14

  • DBD-Pg

  • DBD-Oracle-1.03

During the development cycle of the application, with URLs checked via HTTP and HTTPS, other functions were added: server disk space, UPS status, availability of servers, space on Oracle databases and generic service availability. For every check it is now possible to define the frequency (e.g., check is run every 4 hours); to define an e-mail for send alert messages; to define a mobile phone to alert an administrator via SMS; to disable a server; and to create groups of checks. All checks return a status of OK, ALARM or PROBLEM.

I'm not a developer, and my core work is system administration, so I probably haven't written perfect code. The purpose of this article, however, is to explain how I realized our goal in a short timeframe by using Perl and other useful software I found on the Net. I won't spend time explaining how to use Perl for interacting with PostgreSQL, nor how I installed Festival, because these topics are covered elsewhere abundantly.

As for the application, I mentioned that it had been divided into four parts: a web administrator, a web monitor, a robot and an archiver/viewer via the Web. All these parts of software were builded around a PostgreSQL database, so I built a little module to separate the program business logic from the database. If the application needs the list of active tests to execute, the user can initialize the database connection with use Database ; and request the data with my @tests = Database::getAllTests() ;. This function returns an array of hashes, containing all the details about the tests.

Web Administration

Software is especially useful when even an occasional user can use it without the support of the developer; easy administration is a must. To help make the application easy to use, then, I created GUIs by using two Perl modules, one to get data from the browser and one for putting data into the browser. The first module is a CGI, which permits me to obtain data passed via forms. You can see a little example of the module here:

use CGI ;
my $form = new CGI ;
$form_name = $form->param{'name'} ;
<FORM METHOD=POST ACTION=myperl.pl>
.... <INPUT TYPE=TEXT NAME=name VALUE=""> ....
</FORM>

In order to speed up the development and maintain the separation between presentation and business logic, I used templates from the Perl module HTML::Template. With this module it is simple to build dynamic content for web applications. All you need to do is build a template with particular tags that the module will understand, and pass to it the data from your application. For example, if you need to visualize data in table format you can use code like the that in Listing 1. Once this is understood, building a web administration interface for the application was a quick procedure.

Listing 1

The Web Monitor

To continually check for problems and to go directly to the details, I've grouped all the checks on a single web page. It's a global view of the checks, where every group of checks can have three statuses and three corresponding colors: green for OK, yellow for ALERT and RED for PROBLEM. To build the page, every check is stored in an archive table and in a last-test table. The monitor then reads the last-test table and lists only the last results of each test in the browser.

In the case of a red alert on a particular group, we can click on the color image and view which test of the group has failed. Not all the tests defined in the database are active at the same time, so the monitor shows only the activated tests. The monitor refreshes its status every n seconds, and I can modify this interval on-line, without restarting the application.

The Web Archiver

In order to analyze the status of an application or a server, it is useful to get all the data about old tests from the database. These old results can be useful for communicating with our clients about expected downtimes of particular applications. Every time our application is down, we have to deal with our clients about the length of downtime. This data can be useful for problem determination and statistics. We also can determine where are the most problems occur and concentrate fixes on those areas.

The Robot

The robot is the core of the entire application. Its duties are to:

  • perform all the active checks

  • populate the database with results

  • alert via e-mail

  • alert via SMS

  • alert via a text-to-speech system (Festival)

The robot has to do all these tasks quickly so that we can rapidly restart an application not working correctly. The first thing the robot does, then is retrieve global preferences about the tests (timeouts, retries. etc.) and the list of activated tests. Next it forks a process for every group of tests and, inside it, performs the test (see Listing 2). The robot then gathers previous statuses from the database and compares them with the current status.

Listing 2

The robot sends alerts only on occasions of status changes, so it checks for both problems and returns to normalcy. Every process forked from the parent process owns a connection to the database and works independently from the others, so there isn't a single process for HTTP tests or for pings. Each forked process can perform all the different tasks. This is important because some tests require more time than others. Think about a URL test and an Oracle database space check. This two tests can require up to 10 seconds on particular cases, so by working with different processes I can obtain the results in parallel. Currently we can perform 200 tests in two minutes on an HP LC3.

When all the processes have finished their tests, we have completed an entire cycle. So the robot waits for a moment, then restarts from the beginning, reading application parameters, activity tests, fork processes and so on. In this way we can perform test administration and global parameters administration directly on-line, without restarting the application. We also can add, delete and modify tests without restarting the application and even modify the basic parameters of the application, like timeouts.

The robot uses small external modules to perform the checks, and each test returns a series of parameters, equal for each test: test id, test group id, timestamp, time to perform the check, status and a descriptive message. I've built six modules for these tests.

1. Module WdSys

This is the module that needs a server counterpart because it performs a query to read the disks usage. Therefore, I've written a little dæmon in C to install it on the server. The agent reports the disk usage through a port and only to one IP address. From the Perl module WdSys, I open the port on the server and get the data on the standard input. This data contains the disk device, the mount point and the usage. You can see the part of the module that connects to the agent in Listing 3. array @remote_data is where the disk information is read.

Listing 3

2. Module WdUps

In our server farm we have three Chloride Silectron UPSes, so we need an alert every time there is a problem on any of them. To poll the status of the UPSes, we installed the SNMP module, a little box that connects to the serial interface and to the network. I then built a little piece of code that retrieves the useful information from the SNMP module, including the battery status and the battery minutes. SNMP is rather complex, so it's beneficial to poll only the data you are interested in, as shown in Listing 4.

Listing 4

3. Module WdOracle

The more important thing for us regarding the Oracle status is the space allocation and, relatedly, the free space on each tablespace. So I've written a module in order to test if there are Oracle objects that can no longer allocate another extent. We prefer to control the disk allocation status in this way instead of permitting autoextents of tablespaces. If the test alerts about a situation like this, we alter the tablespace by allocating another file for it. I use the query shown in Listing 5 to test the space. In order to execute this query, during the configuration a user is created on the database. This user can read data from the tables DBA_DATA_FILES, DBA_EXTENTS, DBA_FREE_SPACE, DBA_SEGMENTS and DBA_TABLESPACES. This user is created if it does not exist in the database, so we need dba access only for the creation and not any further operations.

Listing 5

4. Module WdConn

In order to test a generic network service, I built a module to test IP addresses, ports and protocols. This module permits us to check DNS, FTP, SMTP, POP and so on, all those services for which the simple possibility of connecting guarantee us the correct functionality of the service--at least 99.999% of the time. Here is the code for this module:

my $check = 0 ;
my $iaddr = inet_aton($hostname) ;
my $paddr = sockaddr_in($port, $iaddr) ;
my $proto = getprotobyname($protocol) ;
socket( SOCK, PF_INET, SOCK_STREAM, $proto)
                or $check = 2 ;
connect(SOCK, $paddr) or $check = 2 ;
if ( $check == 0 ) {
  close(SOCK) ;
}

5. Module WdPing

This module executes a simple ping to the server using the Net::Ping module, which requests the IP address and the timeout:

use Net::Ping() ;
my $check = 0 ;
$p = Net::Ping->new("tcp",5) ;
if ( $p->ping($server) == 0 ){
 $check = 2 ;
}
$p->close() ;

6. Module WdHttp

Last but not the least, here's the module that checks the current state of the web portals. It can accept URLs in HTTP and HTTPS format, can log in to the application and test other URLs using the HTTP::Cookies module. See Listing 6 to view the source code for this module. After getting the URL page in the $content variable, the users analyzes the HTML page in order to check if it is correct for words and timestamps. It is simple to test the presence of multiple words with logical functions (and, or and not), thanks to regular expressions.

Listing 6

Conclusion

With this application we reached our goals of being able to quickly test the functionality of our services, to be alerted when something goes wrong and to manage the checks via a web interface. The next step will be to optimize the speed for altering and porting the majority of the tests in SNMP mode (with trap functions). Then we'll work on the possibility of restarting services directly from the monitor, as well as adding PDA connectivity to the application.

Resources

Perl: www.perl.com

Perl Modules: www.cpan.org

Festival: www.cstr.ed.ac.uk/projects/festival

"Programming Silence OUT!", LJ July 2001

PostgreSQL: postgresql.org

Massimiliano Panichi has been a UNIX system administrator since 1995 and has a degree in electronic engineering. When he's not working, he's reading books, listening to music or running.

Load Disqus comments