Recipy for Science
More and more journals are demanding that the science being published be reproducible. Ideally, if you publish your code, that should be enough for someone else to reproduce the results you are claiming. But, anyone who has done any actual computational science knows that this is not true. The number of times you twiddle bits of your code to test different hypotheses, or the specific bits of data you use to test your code and then to do your actual analysis, grows exponentially as you are going through your research program. It becomes very difficult to keep track of all of those changes and variations over time.
Because more and more scientific work is being done in Python, a new tool is available to help automate the recording of your research program. Recipy is a new Python module that you can use within your code development to manage the history of said code development.
Recipy exists in the Python module repository, so installation can be as easy as:
pip install recipy
The code resides in a GitHub repository, so you always can get the latest and greatest version by cloning the repository and installing it manually. If you do decide to install manually, you also can install the requirements with the following using the file from the recipy source code::
pip install -r requirements.txt
Once you have it installed, using it is extremely easy. You can alter your scripts by adding this line to the top of the file:
It needs to be the very first line of Python executed in order to capture everything else that happens within your program. If you don't even want to alter your files that much, you can run your code through Recipy with the command:
python -m recipy my_script.py
All of the reporting data is stored within a TinyDB database, in a file named test.npy.
Once you have collected the details of your code, you now can start to play around with the results stored in the test.npy file. To explore this module, let's use the sample code from the recipy documentation. A short example is the following, saved in the file my_script.py:
import recipy import numpy arr = numpy.arange(10) arr = arr + 500 numpy.save('test.npy', arr)
The recipy module includes a script called recipy that can process the stored data. As a first look, you can use the following command, which will pull up details about the run:
recipy search test.npy
On my Cygwin machine (the power tool for Linux users forced to use a Windows machine), the results look like this:
Run ID: eb4de53f-d90c-4451-8e35-d765cb82d4f9 Created by berna_000 on 2015-09-07T02:18:17 Ran /cygdrive/c/Users/berna_000/Dropbox/writing/lj/ ↪science/recipy/my_script.py using /usr/bin/python Git: commit 1149a58066ee6d2b6baa88ba00fd9effcf434689, in ↪repo /cygdrive/c/Users/berna_000/Dropbox/writing, ↪with origin https://github.com/joeybernard/writing.git Environment: CYGWIN_NT-10.0-2.2.0-0.289-5-3-x86_64-64bit, ↪python 2.7.10 (default, Jun 1 2015, 18:05:38) Inputs: none Outputs: /cygdrive/c/Users/berna_000/Dropbox/writing/lj/ ↪science/recipy/test.npy
Every time you run your program, a new entry is added to the test.npy file. When you run the search command again, you will get a message like the following to let you know:
** Previous runs creating this output have been found. ↪Run with --all to show. **
If using a text interface isn't your cup of tea, there is a GUI available with the following command, which gives you a potentially nicer interface (Figure 1):
This GUI is actually Web-based, so once you are done running this command, you can open it in the browser of your choice.
Figure 1. Recipy includes a GUI that provides a more intuitive way to work with your run data.
Recipy stores its configuration and the database files within the directory ~/.recipy. The configuration is stored in the recipyrc file in this folder. The database files also are located here by default. But, you can change that by using the configuration option:
[database] path = /path/to/file.json
This way, you can store these database files in a place where they will be backed up and potentially versioned.
You can change the amount of
information being logged with a few different configuration options. In
[general] section, you can use the
debug option to include
debugging messages or
quiet to not print any messages.
all of the metadata around git commands is included within the recorded
information. You can ignore some of this metadata selectively with
the configuration section
[ignored metadata]. If you
option, the output from a
git diff command won't be stored. If instead
you wanted to ignore everything, you could use the
git option to skip
everything related to git commands. You can ignore specific modules on
either the recorded inputs or the outputs by using the configuration
[ignored inputs] and
outputs], respectively. For
example, if you want to skip recording any outputs from the numpy
module, you could use:
[ignored outputs] numpy
If you want to skip everything, you could use the special
option for either section. If these options are stored in the main configuration
file mentioned above, it will apply to all of your recipy runs. If you
want to use different options for different projects, you can use a
file named .recipyrc within the current directory with the specific
options for the project.
The way that recipy works is that it ties into the Python system for importing modules. It does this by using wrapping classes around the modules that you want to record. Currently, the supported modules are numpy, scikit-learn, pandas, scikit-image, matplotlib, pillow, GDAL and nibabel.
The wrapper function is extremely simple, however, so it is an easy matter to add wrappers for your favorite scientific module. All you need to do is implement the PatchSimple interface and add lists of the input and output functions that you want logged.
After reading this article, you never should lose track of how you reached your results. You can configure recipy to record the details you find most important and be able to redo any calculation you did in the past. Techniques for reproducible research are going to be more important in the future, so this is definitely one method to add to your toolbox. Seeing as it is only at version 0.1.0, it will be well worth following this project to see how it matures and what new functionality is added to it in the future.
Limited Time Offer
Take Linux Journal for a test drive. Download our September issue for FREE.
Topic of the Week
The cloud has become synonymous with all things data storage. It additionally equates to the many web-centric services accessing that same back-end data storage, but the term also has evolved to mean so much more.