Sorting Photos


We all know the right way to sort photos is to do them right after you take them. We also know that doing a disk backup before your drive fails is the right way to do backups. But, we don't always do things the right way. Enter my situation. I have close to 10,000 photos takes with my digital camera over the last seven years. Yes, same camera—this could probably be an ad for a Canon A20 which has been abused, dropped by me and others and used by tens of kids that have never used a camera before and some of them that have never even used a flush toilet. In any case, the photos from the camera are spread over a few CF cards, CDs, two different computers and who knows where else.

Sometimes the same photos have been saved multiple times. The photo number sequence has been reset when I changed CF cards. Or, put in other terms, I have a disaster to clean up. I regularly get asked for a particular photo and spend a bunch of time looking for it. This time, I decided to take the time to write a program to help solve the problem.

Yes, there are lots of programs to sort and thumbnail photos but when you have 10,000 or so images to start with, some sort of pre-sorting makes sense. Here is what I want that presort to do.

  1. Read a list of possible photo files.
  2. Build a database with creation date, some source information and an organized place to store each photo.
  3. Be able to tag each photo. In this case, any of a number of letters will work.
  4. Optionally add a description.
  5. Allow me to say "forget it" for obviously bad photos.
  6. Allow me to incrementally add to this collection.

Yeah, that's just a start but it makes sense considering the magnitude of the problem. Source information, for example, would be which computer or CD the photos came from. For most photos, the EXIF information from the camera will give me the actual date and time the photo was taken. But, if that isn't available (edited photo, for example), I will settle for the Linux filesystem timestamp.

I see getting this stuff organized as a four-step process.

  1. Find all the photos—this is a combination of physical work and then building lists of filenames. A find command can do the dirty work. for example
    find /home/tux/Pix -iname "*.jpg" >file.list
    can do most of the work. Multiple lists can be built on a directory or directory tree basis.
  2. fotosort (the program I am talking about here) and my time can be used to process each list. It will allow me to skip a photo or add some tagging information and save a copy, All the "processed" photos will end up in one big tree with the database pointing to them.
  3. Toss duplicates. This will be the next programming project. With an MD5 digest and the file size (in bytes) in the database, it will be easy to find files that are duplicates.
  4. Create photo galleries.

Whether I elect to do the final step—create the galleries— manually, using one of the many existing programs or write something to do it myself, I am already heading in the right direction. All the information I need is in a database and the photos are all in one place.

The Code

Let's look at what I have created. It is far from a work of art as it has experienced the typical evolution sequence that most programs go through. But, it works. If I was going to use it regularly, I would invest a bit of time to clean it up and add error handling but it is petty much a one-shot for me.

Class Rec is not much more than a comment that shows what data I will need. When used to create an instance, it is passed the source_info string. It will be common for all the records created in a single run of fotosort.

The main program opens the filename list passed as a command line argument and opens the database (or creates it and the file tree if it doesn't exist). It then loops through those filenames displaying them using GraphicMagick's display function and checks to see if you want to save each one. If you say skip, it moves on to the next file.

If you elect to save the file it gets the file timestamp, bytecount and MD5 digest, prompts for the flags and description, inserts the information into the database and copies the image file over to the new tree. No matter whether you picked save or not, the image display is terminated by calling kill with the pid returned when it was started. All the nitty gritty is handled by functions. Here is a quick look at the important ones.

tree_setup() creates 100 sub-directories named 00 through 99. As I have 10,000 files to play with I certainly don't want to put them all in one directory. They will be stored in the 100 different directories selected by the last two digits of their filename. For example, picture z_000021, z_000121, z_099921, ... will all be stored in sub-directory 21.

store_open() checks to see if the data directory is accessible. If so, it opens the database and returns the sqlite3 connection id. Otherwise, with your permission, it creates a new file tree (using tree_setup() and initializes the database.

store_add() adds a record to the database. It returns the last row id (auto-increment id field) which is also the numeric part of the filename. We use this to copy the file to the data tree.

file_ts() is, well, ugly. The clean part is stat is used to get the bytecount. The ugly part is getting the picture creation time from the EXIF info if it exists. I found references to multiple EXIF packages in Python but each seemed to have a problem. I elected to use the exiv2 program which is included in Kubuntu. I read the results until I find the "Image timestamp" line and hackishly convert it into a real Linux-ish timestamp (seconds since the epoch). It was a pain but this is the best choice for later data comparisons.

If there is no EXIF information of the timestamp is missing, I settle for the last modify time in the filesystem. stat easily supplies this information.

img_save() creates a filename consisting of z_ and a six digit number. That number is the database record id with leading zeros added. It then computes the actual destination path with the same mod 100 trick for directory name as tree_setup() used.

img_hash() used hashlib to create an MD5 digest for the file. No magic other than hashlib is new and replaces the older digest creation routines.

That's the end of the story. As I said, the program evolved and it shows it. It's actually a good example of why programs should be written twice. One serious (ok, irritating) problem remains. When the image display is opened, the focus switches to it. Thus, you need a mouse click to get back to the console window to communicate with the main loop. There is probably is the right way to fix this but, for now, just setting the Focus stealing prevention level in the KDE Control Module (click on the icon in the task bar, select Configure Window Behavior and the Advanced) to high solves the problem. Unfortunately, that isn't the general policy I want. I am sure it is easy to fix under program control—I just haven't figured out how yet.

Now, I guess I need to actually spend the next few days using the program. I do need a bunch of photos for the Geek Ranch web site.

Ed Note: The code below will NOT work if you copy and paste it. Get the code here.

# Takes a list of photo files and lets you play with them
# What it did goes in a database including user supplied flags and description
# Phil Hughes 25 Dec 2007@0643

import sys
import os
import time
import shutil
import hashlib
from pysqlite2 import dbapi2 as sqlite

dataloc ="/home/fyl/PIXTREE"    # where to build the tree

connection = 0                  # will be connection ID

class Rec():            # what we will put in the db
        def __init__(self, source_info):
                self.source_info = source_info  # where it came from
        # id integer primary key        # will be filename
        # flags text            # letters used for selection
        # md5 text              # MD5 hex digest
        # size integer          # file byte count
        # description text      # caption information
        # source_path text      # path we got it from
        # timestamp integer     # creation timestamp (from image of fs date)

def tree_setup():
        os.mkdir(dataloc, 0755)         # tree base
        for x in range(100):            # build 100 sub-directories
                os.mkdir("%s/%02i" % (dataloc, x), 0755)

def show_pix(path):     # runs display, returns display_pid so kill can work
        return os.spawnv(os.P_NOWAIT, "/usr/bin/gm",
                ["display", "-geometry", "240x240", path])

def store_open():       # opens, returns biggest ID or -1 on error
        # create data store if it doesn't exist
        if not os.access(dataloc, os.R_OK|os.W_OK|os.X_OK):
                print "can't open %s\n" % dataloc
                if raw_input("Create data structures (y/n): ") == 'y':
                        # initialize the database
                        con = sqlite.connect(dataloc + "/pix.db")
                        cur = con.cursor()
                        cur.execute('''create table pix
                                (id integer primary key,
                                flags text,
                                md5 text,
                                size integer,
                                description text,
                                source_info text,
                                source_path text,
                                timestamp integer)
                else:           # the boss said forget it
                con = sqlite.connect(dataloc + "/pix.db")
        if con > 0:
                return con
                return -1

def store_close(con):

def store_add(data): # assigns next id, saves, returns id
        cur = connection.cursor()
        insert into pix (flags, md5, size, description, source_info,
                source_path, timestamp) values (?, ?, ?, ?, ?, ?, ?)''',
                (data.flags, data.md5, data.size, data.description,
                data.source_info, data.source_path, data.timestamp)
        return cur.lastrowid

def openfl(path):       # open a file list, returns file object
        return open(path, 'r')

def getfn(rec): # gets the next filename
        return readline(lfo)

def form_fill(rec):             # pass record to fill in
        rec.flags = raw_input("Flags: ")
        rec.description = raw_input("Desc.: ")

def file_ts(path):      # returns creation timestamp, file size in bytes
        size = os.stat(path).st_size
        # look for EXIF info but, if not found, uses filesystem timestamp
        exiv2fo = os.popen("/usr/bin/exiv2  %s" % path, 'r')
        for line in exiv2fo:
                if line[0:15] == "Image timestamp":
                        cl = line.index(':')
                        ts_str = line[cl+2:cl+21]
                        ts = time.mktime((int(line[cl+2:cl+6]),
                                int(line[cl+7:cl+9]), int(line[cl+10:cl+12]),
                                int(line[cl+13:cl+15]), int(line[cl+16:cl+18]),
                                int(line[cl+19:cl+21]), 0, 0, 0))
        else:                   # use filesystem timestamp
                ts = os.stat(path).st_mtime
        return (long(ts), size)

def img_save(image_file, id):   # copy image file to store
        # store location is built from id and some other fun stuff

        fname = "z_%06d" % int(id)
        dest = dataloc + "/" + "%02d" % (int(id) % 100) + '/' + fname
        # print dest
        shutil.copyfile(image_file, dest)
        return dest

def img_hash(image_file):       # returns MD5 hash for a file
        fo = open(image_file, 'r')
        m = hashlib.md5()
        stuff =
        while len(stuff) > 0:
                stuff =
        return (m.hexdigest())

### This is where the action starts ###

if len(sys.argv) != 2:
        print "usage %s names_file\n" % sys.argv[0]

lfo = openfl(sys.argv[1])               # filename list file
connection = store_open()
if connection < 0:
        print "%s: unable to initialize database" % sys,agrv[0]

# let's get the string to use for source info
rec = Rec(raw_input("Enter source info: "))

for f in lfo:
        f = f.strip()                           # toss possible newline
        display_pid = show_pix(f)
        disp = raw_input("s[ave]/d[iscard]/q[uit]: ")
        if disp != 'q' and disp != 'd':
                rec.timestamp, rec.size = file_ts(f)
                rec.source_path = f
                rec.md5 = img_hash(f)   # hash

                form_fill(rec)                          # get user input
                id = store_add(rec)                     # insert in db
                savedloc = img_save(f, id)      # copy the image

                print "Photo saved as %s\n" % savedloc
        os.system("kill %s" % display_pid)
        if disp == 'q':


Phil Hughes


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Doesn't seem to work - get an error

TenSigh's picture

I copied the script, saved it to a text file, and ran python

I'm getting the following error:
File "", line 17
class Rec(): # what we will put in the db
SyntaxError: invalid syntax

What do I need to get the script to run?


Phil Hughes's picture

The mostly likely problem is indenting. I should have put the code in a separate file that could be downloaded. I have asked our Webmistress to do that.

Python blocks are indicated by indentation levels. While the code may appear ok, spaces and tabs that make things look lined up are not necessarily treated the same.

Phil Hughes


Webmistress's picture

There is a link to the code in the article above. You can also get it right here.

Katherine Druckman is webmistress at You might find her on Twitter or at the Southwest Drupal Summit

have you tried kphotoalbum or f-spot or others?

Anonymous's picture

does your script do anything that one of the existing photomanagment applications can't do as well?

i am quite happy managing my 12000 photos in kphotoalbum.
it spreads them out on a scalable timeline and allows you to tag photos one by one or in bulk. it does not bother about the filenames or even the path as it tracks photos by checksum, so even if you move/rename them later, tags won't be lost.

i started using kphotoalbum after i had about 10000 photos and just walked through the timeline spending a couple of hours each day tagging batches of photos for few months.

before using kphotoalbum i just sorted my photos by date, putting them in a year/month/day sort of path. i still do that as kphotoalbum does not care.

i don't import the original photos into kphotoalbum but only a smaller version (800x600) of them to make handling easier.
(kphotoalbum does have a feature to handle offline storage though)

as for backups, i don't delete a photo from the cf-card until it is copied to at least two locations. one on my notebook and one on an external usb disk. that external disk contains disk-images in dvd-sized files. photos get placed into those disk images, and disk-images are written to a dvd as they fill up. once a dvd is written, i delete the directory that corresponds to that dvd from my notebook, so that i end up with one copy of the photo on dvd and one on the external usb disk. i also keep a second set of dvd-disks at my grandmothers place.

greetings, eMBee.


Anonymous's picture

that should have been a reply to the main article, not to the first comment.

greetings, eMBee.

One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix