xldlas—A Program for Statistics
Linux is a virtually unparalleled platform for using freely distributable software. The kernel source is free, the standard utilities are free and so is the X Window System. The whole concept of using free software is incredibly appealing, and many users are tempted to try running their systems without any commercial products whatsoever. Yet this desire is often thwarted by a single missing application; desktop publishing and presentation software are commonly cited as current “holes” in the Linux arsenal.
I was faced with such a problem when I decided to abandon the MS-DOS partition on my hard drive and go all Linux. Since I work with a fair amount of statistical information, I needed a straightforward way to summarize data, plot it and perform regression as needed. gnuplot is great for plotting, but that's all it does. Octave and MuPad have powerful numerical features, but they are overkill for simple statistical chores. Unable to find a program that fit this niche, I decided to write one. The result is xldlas, a program for statistics. In the grand Unix tradition, its name is a pseudo-acronym which stands for “x lies, damned lies, and statistics.” The first public release in October 1996 met with quite positive feedback from users, and one of those beta testers (Hans Zoebelein) suggested an article in Linux Journal might be a good way to introduce xldlas to a wider audience. The people at LJ agreed, and asked me to write this overview. The program runs under the X Window System, and is built using the XForms library. You'll find information on how to download xldlas and associated software at the end of this article.
The philosophy behind xldlas is to offer standard statistical tools via an easy-to-use point and click interface. To facilitate this approach, common commands are grouped together into a set of menus. In addition, frequently used commands are available via buttons (See Figure 1).
Like most statistics packages, xldlas handles a random variable as a vector of values. So a single variable name can refer to dozens, hundreds or thousands of observations.* By grouping data points together under variable names, it is easy to perform relatively complex operations by selecting a few variables and clicking on the relevant command.
*By default, xldlas has a limit of 100 variables of 10,000 observations each. These constraints can easily be adjusted by changing the values for MAX_VARS and MAX_OBS in the source code file xldlas.h.
Of course, before you can perform any kind of statistical operations, you have to get data into xldlas. Since ASCII is the de facto standard for exchanging information under Linux, xldlas allows you to read in space-delimited data from a text file by using the Import command. You supply a file name, and tell xldlas whether the data is in column or row format. The import routine automatically figures out how many variables and observations there are, and reads in the data. To take a concrete example, suppose you have a file which contains space-delimited data on rainfall, temperature and barometric pressure for a single location. After importing this file, xldlas will have three variables in memory, which will be called unknown0, unknown1 and unknown2. You can change these names to anything you like using the Rename command, which is accessible from the Data menu. In addition to this simple ASCII format, xldlas can read and write sets of data in its own proprietary file format. By convention, these files have an .lda extension. Since variable names, descriptions and other useful information are stored in these files, it's generally a good idea to save all your data this way if you plan on using xldlas frequently. The Load, Save and Import commands can all be found in the File menu. To input data by hand, erase variables or perform any kind of editing, there are a number of related commands grouped together in the Data menu. Of these, the most frequently used is probably the Describe command, which generates a table in the main xldlas window showing you the name, number of observations, and a description of every variable currently in memory. In addition to changing observation values, the Edit command can also be used to enter a description for a variable.
Another frequently used item in the Data menu is the Generate command. This routine allows you to perform mathematical transformations on existing data. To continue with the weather example from above, suppose we want to convert our rainfall variable from millimeters to centimeters. With a few clicks of the mouse, we can easily accomplish this task. We could also add some random noise, find the log of the data, or what have you. It's a far cry from Mathematica, but for simple operations the Generate command is quick and easy to use.
Once you have your data loaded, edited and transformed, the next logical step is to perform some kind of statistical work on it. To get a tabular summary of a single variable, including mean, variance, skewness and kurtosis, there's the Summarize command. If you want to check multiple variables for linear relationships, the Correlation command will produce a table of Pearson coefficients. Similarly, the ANOVA command lets you perform one-way and two-way analyses of variance by simply selecting variable names with your mouse and clicking the Go button.
The workhorse of statistical techniques, ordinary least squares regression, is available via the Regress command. Just select a single variable from the dependent browser, any number from the independent browser, and press Go. If you want to store fitted values, then you can enter a new variable name in the regression window. The output of the regression command is a set of three tables, which summarize the fit of the regression, break down the sum of squares deviations and list coefficient estimates. Relevant t-statistics and their associated probabilities are automatically included, as is the F coefficient and confidence level for a joint test of all the estimates.
xldlas also offers two experimental data fitting routines that use connectionist artificial intelligence techniques. The first, GA Fit, uses genetic algorithms to build a fit equation that minimizes the sum of squares between fitted values and actual observations of a given dependent variable. The second, NN Fit, creates a back-propagation neural network using selected independent variables for the input layer, and a single dependent variable for the output layer. In both cases, the fitted values from these techniques can be stored under a supplied variable name. These routines are sometimes useful for exploring non-linear relationships in data that are generally difficult to examine using standard OLS regression.*
*Although not part of the “standard” statistical toolkit, these sorts of AI techniques are becoming increasingly common in various contexts and are great for data mining. Although their implementations in xldlas are fairly rudimentary, more sophisticated modifications are likely if users request them.
In addition to manipulating data and performing analysis, xldlas allows you to graph variables. All of xldlas's graphical output is actually performed by gnuplot, an application which is included in all major Linux distributions. Two graphing commands are implemented: Plot and Histogram. The former lets you create line and scatter plots, while the latter generates a histogram describing a variable's distribution. Both sorts of graphs can be titled and labeled, and they can be saved in any format supported by whatever version of gnuplot is installed on your system. In addition, you can set point and line styles, and the Histogram routine includes an optional feature which will superimpose a normal distribution with the same mean and variance as the data being graphed.
xldlas also provides fairly powerful logging facilities. The Log command allows you to echo all of xldlas's output to an ASCII file. A more powerful tool is the TeXLog command, which allows you to create a PlainTeX format log file with a user-supplied name. All subsequent output, such as regression tables, is written to this file in TeX format. Under xldlas's default configuration, all saved graphs are also included as Encapsulated PostScript insertions. This makes writing statistical papers (such as homework assignments) quite fast and efficient, since much of the time-consuming TeX markup is done automatically.
Finally, all xldlas commands are documented on-line in the Help menu. There are also a number of on-line tutorials, which many users of xldlas have found to be a very useful introduction.