HOW-TOs

Data Manipulation with Sprog

on July 3, 2008

I don't think I know anyone who relishes the task of data manipulation, and I'm certainly not different. Some of the more complicated manipulations pose a briefly satisfying technical challenge, but in the end, data manipulation is boring. Sometimes, I'm able to import a dataset into OpenOffice's spreadsheet, but usually, I have to write a Perl script or a Bash script to do what needs to be done. Usually, the programs aren't difficult, and I usually have small snippets of code laying around to take care of common tasks. Even so, it's just not... fun.

I happened to come across Sprog a few years ago; at the time I was looking for something completely unrelated, but thanks to the wonders of the Internet, I noticed the Sprog program and investigated it further.

Sprog allows you to solve data manipulation problems by dragging and connecting various gears together to build a “machine.” Sprog provides gears for reading files, fetching and parsing web pages, handling CSV files, running small Perl snippets, and finally displaying or writing the results.

Sprog does have a few software requirements:

GTK+ libraries
libgnomecanvas*
libglade
Perl - preferably version 5.8, along with the following CPAN modules:
Gtk2 Perl bindings
Gnome2::Canvas*
Gtk2::GladeXML
YAML
Pod::Simple

Once these are satisfied, installing Sprog is as:
perl Makefile.PL make install

Finally, you start the program with the sprog command. Sprog presents you with a blank machine canvas.

Figure 1.

As you can see, there is an assortment of gears available and all you have to do is drag them onto the canvas, configure them, and connect them in order. Each gear has an input tab and an output tab, as appropriate. Each tab is keyed so that you can't connect gears together in ways that don't make sense. For example, you can't connect a “Retrieve URL” gear to an “Add Field Names” gear; the gears simply don't “fit” and the results wouldn't make sense. I've started building a simple machine in Figure 2.

Figure 2.

Figure 3.

By right-clicking and selecting Properties for each gear, I can tell the machine which file to open and which pattern to look for. Once all of the gears are connected, as in Figure 3, the machine will read the file, find all of the lines in the file that match the pattern I supplied, convert the line to upper case and display the results in a text window.

As you can see, the gears fit together like puzzle pieces. A machine starts out with an input gear such as a “Read File” gear, or a “Retreive URL” gear. From there, the data flows into the next gear in the machine. Each gear performs a particular function on it's input and passes the results to the next gear. Finally, the data gets to the machine's output gear. Sprog has output gears for displaying the results in a text window, writing to a data file, or piping the results to a command.

Let's consider another rather trivial example. In Figure 4, I've created a machine that takes the out put of the ls -la command and prints it out in tab-delimited format with just the filename and permissions field. Sure, this is a simple task, but it lets us discuss various features of the Sprog machine.

Figure 4.

I configured the “Run Command” gear to run the “ls -la” command. The first “Perl Code” gear simply had this snippet of Perl code in it:
s/\ +/,/g;
This code took the input, from the default ($_) variable, and changed each group of one or more spaces into a single comma, and outputted a CSV data stream.

The “CSV Split” gear accepted that data stream and split it out for use by the “Select Columns” gear. The “Select Columns” gear was configured to select columns 1 and 9 and send the results to the next gear.

The “List to CSV” gear converts the input for use by the “Perl Code” that simply converts the comma to a tab.

Finally, the results are displayed in a text window.

Now sure, I could have coded this up in Perl and been done with it in less than a minute. But, someone who didn't know Perl, or didn't want to learn it would be able to assemble these, and other gears, into a machine that accomplished a given data manipulation goal in a way that they could understand. In fact, it's not too unreasonable to expect to be able to store an entire library of gears and leave it to someone else to assemble them to accomplish a given task. For example, one might consider building a gear that collects the Apache log files from a given server. Then one might create a gear that uses a Perl routine to parse each log entry into CSV
format. From there, a person could assemble a machine to out put reports in almost any format they wanted.

The concept of having ready-made gears aimed at solving common problems implies that we can capture these gears and reuse them. Sprog allows us to save a given machine, of course, but it also allows us to save completely unconnected gears and essentially build a library of “puzzle pieces” that solve various problems. The end user simply assembles the parts needed to solve a given problem. To make this concept more clear, lets look at the content of a saved machine file.

- Sprog
- 1
- run_on_drop: 0
-
- CLASS: Sprog::Gear::CommandIn
ID: 6
NEXT: 14
X: 334
Y: 173
prop:
command: 'ls -la '
title: Run Command
- CLASS: Sprog::Gear::CSVSplit
ID: 11
NEXT: 12
X: 334
Y: 253
prop:
title: CSV Split
- CLASS: Sprog::Gear::TextWindow
ID: 7
NEXT: ~
X: 334
Y: 413
prop:
auto_scroll: ''
clear_on_run: 1
show_end_events: ''
show_start_events: ''
title: Text Window
- CLASS: Sprog::Gear::SelectColumns
ID: 12
NEXT: 10
X: 334
Y: 293
prop:
base: 1
columns: '1,9'
title: Select Columns
- CLASS: Sprog::Gear::PerlCode
ID: 14
NEXT: 11
X: 334
Y: 213
prop:
perl_code: 's/\ +/,/g;'
title: Perl Code
- CLASS: Sprog::Gear::PerlCode
ID: 16
NEXT: 7
X: 334
Y: 373
prop:
perl_code: 's/,/\t/g;'
title: Perl Code
- CLASS: Sprog::Gear::ListToCSV
ID: 10
NEXT: 16
X: 334
Y: 333
prop:
title: List to CSV

Fortunately, the file format is ASCII, and fairly intuitive. Essentially, it defines 7 gears, giving them unique ID numbers. For example, if you look at the gear known as ID 14, you see that it's a “Perl Code” gear and that it executes 's/\ +/,/g;' on it's input. The title of the gear is “Perl Code,” but I'm sure we can come up with something more imaginative, perhaps “Strip
out all spaces and convert them to commas.” Changing the name of a gear is as easy as changing it's title in the saved machine file. The next gear in the sequence is gear 11. So, continuing our thoughts from above, we could create a library of saved gears, then modify the save file so that each gear is well described. Finally, we could load the library in Sprog and assemble
machines to accomplish whatever data manipulation we need to do.

Once a library of gears has been created, we can distribute them to other people to assemble in order to solve recurring problems. It's kind of nice to create tools that other people can use to solve business problems without having to understand how they work. On the other hand, I don't want to rewrite the same snippets of code to solve common problems. As I see it, everybody wins!

I can easily imagine creating Sprog gears that access SQL databases or Apache log files. I can imagine incorporating Perl's filtering capabilities into a gear aimed at analyzing an Apache log file or an e-mail log file. I can even see creating a gear to output a spreadsheet in native Excel format.

After using Sprog for a bit, I've come up with a few hints that will make it easier for you to use. While Sprog does implement a snap-on function, it seems easier to grab a given gear by the “gear” icon and slightly overlap it with the previous gear; you will see it snap into place. Otherwise, it is often difficult to get gears to attach to each other. Also, it seems that Sprog inflicts a strict Top-down approach to solving problems. There is no
branching in Sprog. Sprog implements a series of gears, not a transmission. Finally, I've found that if I make a change to a given gear, I need to re-attach it to the gear before it. I guess this makes sense, but it lead to a lot of initial frustration until I realized what was happening.

I'm not sure I think that Sprog is easier than simply writing a Perl script to perform a particular data manipulation task, but it is certainly a lot more interesting and it's something that can be delegated to the actual consumers of the data and their results, thus empowering them to fulfill their own data manipulation requirements. Not everyone can write in Perl and Sprog is a nice way of empowering people to manipulate data in a transparent and repeatable fashion.

Load Disqus comments

HOW-TOs

Data Manipulation with Sprog

Recent Articles