Let Linux Speak

by David Sugar

“User root is now on-line”. Words to be dreaded when one is away from the terminal, and not logging in otherwise. But how does one know what is going on with one's machine when not in front of it? If only the machine could tell you. In this article I discuss a tool which enables your machine to do just that.

It all started a year back, when, thumbing through one of those odd electronic magazines, I came across an ad for a little speech synthesizer. This device was essentially a low cost serial-based text-to-speech synthesizer using the SPO256-AL2 chip. I believe this was the same chip used in the original Mattel “Speak & Spell” toy.

After a couple of months, I thought about it again and decided I just had to have it. Certainly, the price was right (about US $50.00), and serial ports grew on my main Linux machine like branches on a tree. So I ordered one. After a few weeks, I called and was told my order had just been hand-made and would be out in a few days. It is a delight to find hand-made electronics in these modern times—almost like the days when furniture manufacturing involved real craftsmanship.

In any case, the unit arrived as promised, complete with schematics, a disk filled with DOS programs and a thin manual. The disk I have yet to look at; after all, this was for use with a Linux machine. The board slid into a PC slot easily enough. The card uses the PC slot for power only. An RS-232 connector in the back connects to a serial port. A separate stand-alone power unit and case is available for $29.00 more. But having another power pack to plug in was enough to keep me awake at night. A slot I could afford; though I now foresee the time when I will fill up all eight slots in the machine.

The board has its own built-in speaker and an RCA jack. The RCA jack I quickly adapted to feed the background music (BGM) source on my PBX at home. (Okay, so it's really a Panasonic digital hybrid key system, to be technical, although it has ambitions.) I connected the serial port and got a brief noise as DTR was raised. I shortly learned this was supposed to say “Okay”, but the impedance-matching on the RCA jack was poor.

Next, I changed the stty settings on the port to match the speed I had selected for the device via dip switches, and, with high expectations, I tried a simple test:

echo "Hello, my name is Rochester" >/dev/ttyS2

The monotone response I received back sounded a little like “Hewlo, my name is Rokheestar” and reminded me of my last visit to Atlanta, where they use a deliberately harsh-sounding cybernetic voice on the inter-terminal shuttle trains. Hmmm, maybe it is time to look at the manual, and maybe even that disk...

Several limitations and problems became immediately obvious. The first was the text-to-speech algorithms handled words only. Numbers are simply spoken as a series of digits. Hence 91 becomes “nine one”, instead of “ninety-one”. This can be solved by some simple look-up tables and text substitution.

Second, while technically the device acts as a text-to-phonetic speech device, no special means, such as control or escape sequences, allow direct access to the phonetic elements and sounds the device can produce; the text-to-speech code hides them. This second limitation can be resolved by using alternate spellings, though not necessarily phonetic spellings, that saturate the internal algorithm toward different phonetic choices. A little experimentation was required to get a good idea of how the device actually translated text to speech.

Since extensive table substitution was now needed, I considered the next logical step; to develop a driver as a front end for the device. Ideally, any driver should be able to read straight text the way a person normally would. First, numbers should be pronounced as numbers and not as digits. Similarly, many common numeric constructs used in normal text—such as currency amounts, standard formatted date and time fields, percentages, telephone numbers, etc.—have pronunciation rules I wished to encapsulate and emulate properly. The Internet has its own idioms, like x@y.z, which should be pronounced as “x at y dot z”. I decided to cover all of these, as well as in-line text substitution for correct word pronunciation.

In the end, I decided on a server sitting on a TCP socket. The server would accept a connection from the user application on a known port and pronounce any text received according to a reasonable set of rules (as stated above). I added an escape mode to allow for spelling words out and single-digit announcement modes. I could establish a simple telnet session with the server, then test the device by typing text.

The TCP server offered another advantage. Only one application can be serviced by the device at a time—otherwise speech would be garbled together from multiple sources. The use of a TCP session assures that only one connection would be accepted by the server and kept active until closed by the client. Other client applications can block as backlog while waiting for the current application to finish talking. The simplicity offered by backlogging, over the use of lock files was the reason I chose to use a full server instead of a task initiated by inetd.

With the server in place, it was only a matter of time before speech synthesis would pervade other system services. The first use I made of the server was to monitor my BBS system. By connecting it to the user login quota manager, I could have the device announce as users logged in and out. Similarly, the traditional sysop page can be carried over this device.

Eventually I tied the SPO server into my implementation of the wall command and then created other utilities to provide verbal monitoring of my Internet server. Verbal monitoring would watch for and announce new e-mail for me, as well as basic system stats such as uptime and disk usage every hour. As all this speech can be annoying at night, I added a simple muting schedule to the server. Most curious and entertaining is my replacement for shutdown, called simply “down”.

For system monitoring, the speech device has proven to be quite a useful tool—not a nuisance. The server was developed for the ability to read written text and properly pronounce common usages and conventions, and while I use this capability minimally, others might have more occasion for it. The pronunciation dictionary can be expanded as needed to cover a wider range of words as they are identified in everyday use.

One use for the device which was suggested to me is as a screen reader for visually-impaired computer users. Another application I am looking at is in parking incoming phone calls and paging or announcing calls through the telephone system. I have often wished the board included a DTMF tone generator and a SLICK, so I may look at modifying the schematics provided.

The SPO-256-AL2 text-to-speech board described here may be purchased through B.G. Micro, P.O. Box 280298, Dallas, TX 75228 (214) 271-5546. The Computalker lists for around $50.00 (U.S.) as a PC card or $80.00 (U.S.) stand-alone with a power adapter. Chips are available separately, and I believe the Computalker may be purchased in kit form.

While the SPO is serial-based and can be used on almost any machine or OS, I originally obtained it for use on my main server, which runs Linux. For this reason, the speech server was developed and tested under Linux. The server was originally developed using libraries and part of the code base of my BBS package, so these are included as part of the published source. I am working on a more portable public source implementation that should be more easily and widely compatible to non-Linux systems as well. I must go now, as I am being paged...

Code for the Synthesizer

David Sugar Best known for WorldVU, a public BBS system for Linux, he is currently employed as director of software engineering for Fortran Corp. and uses Linux for commercial telephony development. He maintains his own Internet server under Linux.

Load Disqus comments