Published on Linux Journal (http://www.linuxjournal.com)
Play Ball: Introducing Fungoes
By Mat Kovach
Created 2006-04-20 01:00

Your rating: None

When I was growing up and didn't have a car, I always was involved in two activities--working on my four-seam fastball from my Luis Tiant wind-up or peaking and poking with BASIC. Now that I am older, I have moved beyond BASIC and dived into baseball statistics, perhaps because I never got a good feel for a curveball. Nearly all of my statistical investigations have used open-source tools. I have tested the limits of OpenOffice.org and Gnumeric. I have put numbers produced from g77-compiled programs into Perl and TCL, and I have created graphs with gnuplot. Until now, I have kept my work and the strange collection of tools I use to myself.

How un-open source of me.

Around 1856, ex-cricket reporter Henry Chadwick started playing around with box scores and numerical representations of a baseball game. Since then, box scores have become a free data source for fans, a public recording of the facts. Retrosheet [1] is an excellent place to find old box scores and download the data. Although Retrosheet uses DOS tools on its site, the open-source tool Chadwick [2] is used on the data.

More recently, Bill James looked at baseball stats in different ways and wrapped them in a different philosophy. He was not the first one to do so, however. Branch Rickey, who also brought a guy named Jackie Robinson to the big leagues, used statistical analysis when he was general manager of the Brooklyn Dodgers. James' work grew into "sabermetrics". Sabermetrics, now taught and used in college courses, has its proponents and opponents. This parallels, in many ways, GNU and free software.

With Linux, Linus Torvalds arguably brought the first serious attention to free and open-source software. Billy Beane, General Manager of the Oakland As and central figure of the book MoneyBall, is the technical architect who helped bring sabermetrics into the public eye. Whereas Linus used free and open-source software to build Linux, Billy Beane used sabermetrics to build the Oakland As, with equally successful results.

Success has built slowly around sabermetrics, and many teams have come to integrate it into their decision-making. The Cleveland Indians, with Mark Shapiro as the General Manager, rely on statistical analysis in drafting, signing and trading players. Theo Epstein helped lead the Boston Red Sox to a championship, with consultant Bill James, using sabermetrics in player decisions. Likewise, Linux and free and open-source software has integrated itself slowly into larger companies and their decision-making process.

Baseball Prospectus [3] is a Web site that publishes daily articles about baseball, using sabermetrics as the foundation. Baseball Prospectus also publishes a book or two throughout the year. The site relates to sabermetrics in the same way that Linux Journal relates to Linux.

In February of this year, the good folks at O'Reilly published a book called Baseball Hacks, penned by Joseph Alder. Although not specifically about open-source tools--Excel and Access get some ink--most of the tools he discusses to collect baseball stats and mine the data are open source. He devotes a good chuck of the book to using MySQL and The R Project--a great project too few people have heard about--so people can fulfill their need for baseball statistical analysis. He also includes a few great sections on collecting live data, including pulling box score information off of the Major League Baseball's Web site [4] and shoving the information into a database.

Baseball Hacks removed my final excuse for not moving much of my baseball and football stats work into an open-source project. So I found acceptable replacements for my greenies, passed my drug test and started Fungoes [5].

Fungoes will have several parts, including a display of historical statistics. The site will offer different kinds of stats and allow people to sort and filter on various ones. Users will be able to find out who hit the most home runs, as well as who hit the fewest.


baseball=# select name,yearid,ab,hr
baseball-# from batting_career_totals
baseball-# where ab > 3000 and debut > 1900
baseball-# order by hr asc
baseball-# limit 5;

      name       | yearid |  ab  | hr
-----------------+--------+------+----
 Duane Kuiper    |     12 | 3379 |  1
 Bill Bergen     |     11 | 3028 |  2
 Al Bridwell     |     11 | 4169 |  2
 Johnny Cooney   |     20 | 3372 |  2
 Frank Taveras   |     11 | 4043 |  2

Note: above is the query that finally forced me to start this project. I was writing an article called "Ironic Announcers for MLB's Home Run Derby" [6] while researching a book I hope to write someday. I could not find a good way to include any statistics information beyond cut-and-pasting.

The Fungoes project offers many interesting challenges for me. Many statistics will need to be sorted and calculated. The range for sorting, the number of columns and the filtering possibilities will test my humble programming and database skills. In addition, because players can switch teams during the year, finding a way in which to display their data in an informative and easy-to-view manner increases the challenge.

Baseball Reference [7] is an excellent site that provides team and player statistics,. It currently does a great job of displaying static data. For example, here is the site's display for Jody Gerut [8].


Year Ag Tm  Lg  G   AB    R    H   2B 3B  HR  RBI  SB CS  BB  SO   BA   OBP   SLG  
+--------------+---+----+----+----+---+--+---+----+---+--+---+---+-----+-----+-----
 2003 25 CLE AL 127  480   66  134  33  2  22   75   4  5  35  70  .279  .336  .494
 2004 26 CLE AL 134  481   72  121  31  5  11   51  13  6  54  59  .252  .334  .405
 2005 27 TOT     59  170   15   43  11  1   1   14   1  1  20  20  .253  .330  .347
         TOT NL  15   32    3    5   2  0   0    2   0  0   2   6  .156  .206  .219
         CLE AL  44  138   12   38   9  1   1   12   1  1  18  14  .275  .357  .377
         CHC NL  11   14    1    1   1  0   0    0   0  0   2   3  .071  .188  .143
         PIT NL   4   18    2    4   1  0   0    2   0  0   0   3  .222  .222  .278
+--------------+---+----+----+----+---+--+---+----+---+--+---+---+-----+-----+-----
 3 Seasons      320 1131  153  298  75  8  34  140  18 12 109 149  .263  .334  .434 
+--------------+---+----+----+----+---+--+---+----+---+--+---+---+-----+-----+-----
 162 Game Avg        573   77  151  38  4  17   71   9  6  55  75  .263  .334  .434 
 Career High    134  481   72  134  33  5  22   75  13  6  54  70  .279  .336  .494

In 2005 Jody Gerut played for three teams: the Cleveland Indians, Chicago Cubs and Pittsburgh Pirates. As shown above, his totals are split in several ways, by season totals, by totals per team and by totals per league. Fungoes will strive to display stats in the same way, but using dynamic pages.

Fungoes will strive to display stats in the same way as Baseball Reference, but using dynamic pages. First, I have to get the data or spend time doing data entry. Lucky for me, the good people at The Baseball Databank [9] offer historical baseball data in two formats, MySQL and comma-separated values (csv). To get an idea of the size of the data, here are the current row counts for all the tables the site offers [10].


TABLE                     =>   ROWS
Master                    =>  16566
Teams                     =>   2505
TeamsFranchises           =>    120
TeamsHalf                 =>     52
Batting                   =>  87308
Pitching                  =>  36898
Fielding                  => 126130
FieldingOF                =>  21603
Salaries                  =>  17277
Managers                  =>   3067
ManagersHalf              =>     93
Allstar                   =>   4115
AwardsPlayers             =>   2383
AwardsSharePlayers        =>   5930
AwardsManagers            =>     47
AwardsShareManagers       =>    282
HallOfFame                =>   3369
HOFold                    =>    260
BattingPost               =>   9069
FieldingPost              =>   8981
PitchingPost              =>   3597
SeriesPost                =>    229
Schools                   =>    724
SchoolsPlayers            =>   5684
xref_stats                =>  16413

The next purpose of the Fungoes site will be to download box score information from mlb.com and store the information in the Fungoes database. At the end of the year, the data will be verified and added to the historical data. I also want to display box score data and offer sortable statistics for the current season.

Finally, what would be the point of having all this data if I didn't offer my humble opinion on the season and my excellent analysis of the numbers? Several records could fall this season. Barry Bond's quest to topple Babe Ruth and Hank Aaron's home run records will cause plenty of debate. Therefore, the Fungoes site will need to have an area for posting articles that use data from the site.

To help make it easier for these articles to refer to data on the site, I am going to add the functionality of My T Url [11], a Tinyurl clone, to all of the pages on the site. Doing so will allow each page to have a small URL, for example, http://fungoes.mek.cc/link/000a1 instead of http://fungoes.mek.cc/baseball-stats/player/aaronha01?batting%5forderby=rbi%2casc.

Behind the scenes, open-source tools running on a Linux box will power the site. I looked at many different Web toolkits and development platforms. The current flavor of the month is Ruby on Rails, but it just doesn't seem ready for the abuse I would be inflicting on it. Plone has some good merits but not enough to win me over. And, Drupal and many of the LAMP applications don't handle large, complex SQL statements in a simple manner. In the end, I decided to use OpenACS, with AOLserver as my Web server, PostgreSQL for SQL and TCL as the scripting language.

On a side note, I currently maintain Uptime [12] and My T Url [13]. Both Uptime and My T Url are free services--Web site monitoring and short URLs, respectively--with GPLed source code. They both run under AOLserver, using TCL and PostgreSQL. I also am involved in the OpenACS community from time to time.

OpenACS, AOLserver, PostgreSQL and TCL are not what most people consider to be their standard toolset. I am not sure how many open-source packages use PostgreSQL as their first database choice. Typically, people are shoe-horning PostgreSQL support into an application that originally used MySQL.

AOLserver [14] has an undeserved bad rap, mainly due to the "AOL" in its name. AOLserver, originally called Naviserver, is a multithreaded Web server built on top of TCL. AOL eventually bought Naviserver and renamed it AOLserver. AOL currently uses AOLserver for many of its Web sites.

TCL, the ugly duckling of scripting languages, too often is overlooked, and people sometimes avoid looking at AOLserver and OpenACS because of TCL. TCL is an easy language to learn; it sports only 90 or so core commands. AOLserver adds about 20 commands on top of TCL. OpenACS contains about 2,000 procedures in its core packages, but you need only about 10% of them to build sites. The hardest part of learning OpenACS, in fact, is not using AOLserver and TCL but understanding that somebody probably already wrote a procedure or function that does what you need to do.

To be completely honest, OpenACS suffers from two main problems. First, it can be hard to installed. Second, as with many packages that have been around for a while, it suffers from bloat. I suffer from a little "bloating" myself, though, so I don't hold that against OpenACS.

My plan for extended spring training, which I'll discus in a follow-up article, consists of the following:

  1. Install two instances of OpenACS, one for development and one for production.

  2. Use Subversion for source control.

  3. Complete the change-over of the Baseball Databank to PostgreSQL.

  4. Integrate Hack 27 from Baseball Hacks for collection of the current year's baseball stats.

  5. Create a baseball stats package for OpenACS that creates all of the database tables and loads the data.

__________________________

Source URL: http://www.linuxjournal.com/article/8986

Links:
[1] http://www.retrosheet.org
[2] http://chadwick.sourceforge.net/
[3] http://www.baseballprospectus.com
[4] http://www.mlb.com
[5] http://fungoes.mek.cc
[6] http://mek.cc/node/2
[7] http://www.baseball-reference.com
[8] http://www.baseball-reference.com/g/gerutjo01.shtml
[9] http://www.baseball-databank.org
[10] http://www.baseball-databank.org/files/tables.txt
[11] http://www.myturl.com
[12] http://uptime.openacs.org
[13] http://www.myturl.com
[14] http://www.aolserver.com