Play Ball: Introducing Fungoes

April 20th, 2006 by Mat Kovach in

Follow along as a life-long baseball fan turns his hobby into an open-source baseball stats Web site.
Your rating: None

When I was growing up and didn't have a car, I always was involved in two activities--working on my four-seam fastball from my Luis Tiant wind-up or peaking and poking with BASIC. Now that I am older, I have moved beyond BASIC and dived into baseball statistics, perhaps because I never got a good feel for a curveball. Nearly all of my statistical investigations have used open-source tools. I have tested the limits of OpenOffice.org and Gnumeric. I have put numbers produced from g77-compiled programs into Perl and TCL, and I have created graphs with gnuplot. Until now, I have kept my work and the strange collection of tools I use to myself.

How un-open source of me.

Around 1856, ex-cricket reporter Henry Chadwick started playing around with box scores and numerical representations of a baseball game. Since then, box scores have become a free data source for fans, a public recording of the facts. Retrosheet is an excellent place to find old box scores and download the data. Although Retrosheet uses DOS tools on its site, the open-source tool Chadwick is used on the data.

More recently, Bill James looked at baseball stats in different ways and wrapped them in a different philosophy. He was not the first one to do so, however. Branch Rickey, who also brought a guy named Jackie Robinson to the big leagues, used statistical analysis when he was general manager of the Brooklyn Dodgers. James' work grew into "sabermetrics". Sabermetrics, now taught and used in college courses, has its proponents and opponents. This parallels, in many ways, GNU and free software.

With Linux, Linus Torvalds arguably brought the first serious attention to free and open-source software. Billy Beane, General Manager of the Oakland As and central figure of the book MoneyBall, is the technical architect who helped bring sabermetrics into the public eye. Whereas Linus used free and open-source software to build Linux, Billy Beane used sabermetrics to build the Oakland As, with equally successful results.

Success has built slowly around sabermetrics, and many teams have come to integrate it into their decision-making. The Cleveland Indians, with Mark Shapiro as the General Manager, rely on statistical analysis in drafting, signing and trading players. Theo Epstein helped lead the Boston Red Sox to a championship, with consultant Bill James, using sabermetrics in player decisions. Likewise, Linux and free and open-source software has integrated itself slowly into larger companies and their decision-making process.

Baseball Prospectus is a Web site that publishes daily articles about baseball, using sabermetrics as the foundation. Baseball Prospectus also publishes a book or two throughout the year. The site relates to sabermetrics in the same way that Linux Journal relates to Linux.

In February of this year, the good folks at O'Reilly published a book called Baseball Hacks, penned by Joseph Alder. Although not specifically about open-source tools--Excel and Access get some ink--most of the tools he discusses to collect baseball stats and mine the data are open source. He devotes a good chuck of the book to using MySQL and The R Project--a great project too few people have heard about--so people can fulfill their need for baseball statistical analysis. He also includes a few great sections on collecting live data, including pulling box score information off of the Major League Baseball's Web site and shoving the information into a database.

Baseball Hacks removed my final excuse for not moving much of my baseball and football stats work into an open-source project. So I found acceptable replacements for my greenies, passed my drug test and started Fungoes.

Fungoes will have several parts, including a display of historical statistics. The site will offer different kinds of stats and allow people to sort and filter on various ones. Users will be able to find out who hit the most home runs, as well as who hit the fewest.


baseball=# select name,yearid,ab,hr
baseball-# from batting_career_totals
baseball-# where ab > 3000 and debut > 1900
baseball-# order by hr asc
baseball-# limit 5;

      name       | yearid |  ab  | hr
-----------------+--------+------+----
 Duane Kuiper    |     12 | 3379 |  1
 Bill Bergen     |     11 | 3028 |  2
 Al Bridwell     |     11 | 4169 |  2
 Johnny Cooney   |     20 | 3372 |  2
 Frank Taveras   |     11 | 4043 |  2

Note: above is the query that finally forced me to start this project. I was writing an article called "Ironic Announcers for MLB's Home Run Derby" while researching a book I hope to write someday. I could not find a good way to include any statistics information beyond cut-and-pasting.

The Fungoes project offers many interesting challenges for me. Many statistics will need to be sorted and calculated. The range for sorting, the number of columns and the filtering possibilities will test my humble programming and database skills. In addition, because players can switch teams during the year, finding a way in which to display their data in an informative and easy-to-view manner increases the challenge.

Baseball Reference is an excellent site that provides team and player statistics,. It currently does a great job of displaying static data. For example, here is the site's display for Jody Gerut.


Year Ag Tm  Lg  G   AB    R    H   2B 3B  HR  RBI  SB CS  BB  SO   BA   OBP   SLG  
+--------------+---+----+----+----+---+--+---+----+---+--+---+---+-----+-----+-----
 2003 25 CLE AL 127  480   66  134  33  2  22   75   4  5  35  70  .279  .336  .494
 2004 26 CLE AL 134  481   72  121  31  5  11   51  13  6  54  59  .252  .334  .405
 2005 27 TOT     59  170   15   43  11  1   1   14   1  1  20  20  .253  .330  .347
         TOT NL  15   32    3    5   2  0   0    2   0  0   2   6  .156  .206  .219
         CLE AL  44  138   12   38   9  1   1   12   1  1  18  14  .275  .357  .377
         CHC NL  11   14    1    1   1  0   0    0   0  0   2   3  .071  .188  .143
         PIT NL   4   18    2    4   1  0   0    2   0  0   0   3  .222  .222  .278
+--------------+---+----+----+----+---+--+---+----+---+--+---+---+-----+-----+-----
 3 Seasons      320 1131  153  298  75  8  34  140  18 12 109 149  .263  .334  .434 
+--------------+---+----+----+----+---+--+---+----+---+--+---+---+-----+-----+-----
 162 Game Avg        573   77  151  38  4  17   71   9  6  55  75  .263  .334  .434 
 Career High    134  481   72  134  33  5  22   75  13  6  54  70  .279  .336  .494

In 2005 Jody Gerut played for three teams: the Cleveland Indians, Chicago Cubs and Pittsburgh Pirates. As shown above, his totals are split in several ways, by season totals, by totals per team and by totals per league. Fungoes will strive to display stats in the same way, but using dynamic pages.

Fungoes will strive to display stats in the same way as Baseball Reference, but using dynamic pages. First, I have to get the data or spend time doing data entry. Lucky for me, the good people at The Baseball Databank offer historical baseball data in two formats, MySQL and comma-separated values (csv). To get an idea of the size of the data, here are the current row counts for all the tables the site offers.


TABLE                     =>   ROWS
Master                    =>  16566
Teams                     =>   2505
TeamsFranchises           =>    120
TeamsHalf                 =>     52
Batting                   =>  87308
Pitching                  =>  36898
Fielding                  => 126130
FieldingOF                =>  21603
Salaries                  =>  17277
Managers                  =>   3067
ManagersHalf              =>     93
Allstar                   =>   4115
AwardsPlayers             =>   2383
AwardsSharePlayers        =>   5930
AwardsManagers            =>     47
AwardsShareManagers       =>    282
HallOfFame                =>   3369
HOFold                    =>    260
BattingPost               =>   9069
FieldingPost              =>   8981
PitchingPost              =>   3597
SeriesPost                =>    229
Schools                   =>    724
SchoolsPlayers            =>   5684
xref_stats                =>  16413

The next purpose of the Fungoes site will be to download box score information from mlb.com and store the information in the Fungoes database. At the end of the year, the data will be verified and added to the historical data. I also want to display box score data and offer sortable statistics for the current season.

Finally, what would be the point of having all this data if I didn't offer my humble opinion on the season and my excellent analysis of the numbers? Several records could fall this season. Barry Bond's quest to topple Babe Ruth and Hank Aaron's home run records will cause plenty of debate. Therefore, the Fungoes site will need to have an area for posting articles that use data from the site.

To help make it easier for these articles to refer to data on the site, I am going to add the functionality of My T Url, a Tinyurl clone, to all of the pages on the site. Doing so will allow each page to have a small URL, for example, http://fungoes.mek.cc/link/000a1 instead of http://fungoes.mek.cc/baseball-stats/player/aaronha01?batting%5forderby=rbi%2casc.

Behind the scenes, open-source tools running on a Linux box will power the site. I looked at many different Web toolkits and development platforms. The current flavor of the month is Ruby on Rails, but it just doesn't seem ready for the abuse I would be inflicting on it. Plone has some good merits but not enough to win me over. And, Drupal and many of the LAMP applications don't handle large, complex SQL statements in a simple manner. In the end, I decided to use OpenACS, with AOLserver as my Web server, PostgreSQL for SQL and TCL as the scripting language.

On a side note, I currently maintain Uptime and My T Url. Both Uptime and My T Url are free services--Web site monitoring and short URLs, respectively--with GPLed source code. They both run under AOLserver, using TCL and PostgreSQL. I also am involved in the OpenACS community from time to time.

OpenACS, AOLserver, PostgreSQL and TCL are not what most people consider to be their standard toolset. I am not sure how many open-source packages use PostgreSQL as their first database choice. Typically, people are shoe-horning PostgreSQL support into an application that originally used MySQL.

AOLserver has an undeserved bad rap, mainly due to the "AOL" in its name. AOLserver, originally called Naviserver, is a multithreaded Web server built on top of TCL. AOL eventually bought Naviserver and renamed it AOLserver. AOL currently uses AOLserver for many of its Web sites.

TCL, the ugly duckling of scripting languages, too often is overlooked, and people sometimes avoid looking at AOLserver and OpenACS because of TCL. TCL is an easy language to learn; it sports only 90 or so core commands. AOLserver adds about 20 commands on top of TCL. OpenACS contains about 2,000 procedures in its core packages, but you need only about 10% of them to build sites. The hardest part of learning OpenACS, in fact, is not using AOLserver and TCL but understanding that somebody probably already wrote a procedure or function that does what you need to do.

To be completely honest, OpenACS suffers from two main problems. First, it can be hard to installed. Second, as with many packages that have been around for a while, it suffers from bloat. I suffer from a little "bloating" myself, though, so I don't hold that against OpenACS.

My plan for extended spring training, which I'll discus in a follow-up article, consists of the following:

  1. Install two instances of OpenACS, one for development and one for production.

  2. Use Subversion for source control.

  3. Complete the change-over of the Baseball Databank to PostgreSQL.

  4. Integrate Hack 27 from Baseball Hacks for collection of the current year's baseball stats.

  5. Create a baseball stats package for OpenACS that creates all of the database tables and loads the data.

__________________________


Special Magazine Offer -- Free Gift with Subscription
Receive a free digital copy of Linux Journal's System Administration Special Edition as well as instant online access to current and past issues. CLICK HERE for offer

Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Nicholas's picture

recommend good baseball freeware for dad/coach

On April 23rd, 2007 Nicholas (not verified) says:

Looking for a recommendation of baseball boxscore freeware to track son/team - can you recommend?

Anonymous's picture

Basic for Baseball Buffs

On October 18th, 2006 Anonymous (not verified) says:

Why not combining baseball and programming in basic? Years ago I read a book called "Basic for Baseball Buffs", "An Introduction to Programming in Basic especially for Baseball Fans" by an author named Bob Spear...

Post new comment

Please note that comments may not appear immediately, so there is no need to repost your comment.
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.

More information about formatting options

Newsletter

Each week Linux Journal editors will tell you what's hot in the world of Linux. You will receive late breaking news, technical tips and tricks, and links to in-depth stories featured on www.linuxjournal.com.
Sign up for our Email Newsletter

Tech Tip Videos

From the Magazine

July 2009, #183

News Flash: Linux Kernel 3.0 to include an on-the-go Expresso machine interface! Ok, maybe not, but Linux is definitely going mobile, from phones to e-readers. Find out more inside about Android, the Kindle 2, the Western Digital MyBook II, The Bug, and Indamixx (a portable recording studio). And if you've gone mobile and you been wanting more Emacs in your life then check out Conkeror.


To compliment the mobile we've got the stationary: parsing command line options with getopt, checking your Ruby code with metric_fu, and building a secure Squid proxy. How is this stationary you ask? What can we say? It's not. We just wanted to see if anybody actually read this part of the page :) .


All this and more, and all you have to do is get your hot sweaty hands on the latest copy of Linux Journal.





Read this issue