Sorting Photos

We all know the right way to sort photos is to do them right after you take them. We also know that doing a disk backup before your drive fails is the right way to do backups. But, we don't always do things the right way. Enter my situation. I have close to 10,000 photos takes with my digital camera over the last seven years. Yes, same camera—this could probably be an ad for a Canon A20 which has been abused, dropped by me and others and used by tens of kids that have never used a camera before and some of them that have never even used a flush toilet. In any case, the photos from the camera are spread over a few CF cards, CDs, two different computers and who knows where else.

Sometimes the same photos have been saved multiple times. The photo number sequence has been reset when I changed CF cards. Or, put in other terms, I have a disaster to clean up. I regularly get asked for a particular photo and spend a bunch of time looking for it. This time, I decided to take the time to write a program to help solve the problem.

Yes, there are lots of programs to sort and thumbnail photos but when you have 10,000 or so images to start with, some sort of pre-sorting makes sense. Here is what I want that presort to do.

  1. Read a list of possible photo files.
  2. Build a database with creation date, some source information and an organized place to store each photo.
  3. Be able to tag each photo. In this case, any of a number of letters will work.
  4. Optionally add a description.
  5. Allow me to say "forget it" for obviously bad photos.
  6. Allow me to incrementally add to this collection.

Yeah, that's just a start but it makes sense considering the magnitude of the problem. Source information, for example, would be which computer or CD the photos came from. For most photos, the EXIF information from the camera will give me the actual date and time the photo was taken. But, if that isn't available (edited photo, for example), I will settle for the Linux filesystem timestamp.

I see getting this stuff organized as a four-step process.

  1. Find all the photos—this is a combination of physical work and then building lists of filenames. A find command can do the dirty work. for example
    find /home/tux/Pix -iname "*.jpg" >file.list
    
    can do most of the work. Multiple lists can be built on a directory or directory tree basis.
  2. fotosort (the program I am talking about here) and my time can be used to process each list. It will allow me to skip a photo or add some tagging information and save a copy, All the "processed" photos will end up in one big tree with the database pointing to them.
  3. Toss duplicates. This will be the next programming project. With an MD5 digest and the file size (in bytes) in the database, it will be easy to find files that are duplicates.
  4. Create photo galleries.

Whether I elect to do the final step—create the galleries— manually, using one of the many existing programs or write something to do it myself, I am already heading in the right direction. All the information I need is in a database and the photos are all in one place.

The Code

Let's look at what I have created. It is far from a work of art as it has experienced the typical evolution sequence that most programs go through. But, it works. If I was going to use it regularly, I would invest a bit of time to clean it up and add error handling but it is petty much a one-shot for me.

Class Rec is not much more than a comment that shows what data I will need. When used to create an instance, it is passed the source_info string. It will be common for all the records created in a single run of fotosort.

The main program opens the filename list passed as a command line argument and opens the database (or creates it and the file tree if it doesn't exist). It then loops through those filenames displaying them using GraphicMagick's display function and checks to see if you want to save each one. If you say skip, it moves on to the next file.

If you elect to save the file it gets the file timestamp, bytecount and MD5 digest, prompts for the flags and description, inserts the information into the database and copies the image file over to the new tree. No matter whether you picked save or not, the image display is terminated by calling kill with the pid returned when it was started. All the nitty gritty is handled by functions. Here is a quick look at the important ones.

tree_setup() creates 100 sub-directories named 00 through 99. As I have 10,000 files to play with I certainly don't want to put them all in one directory. They will be stored in the 100 different directories selected by the last two digits of their filename. For example, picture z_000021, z_000121, z_099921, ... will all be stored in sub-directory 21.

store_open() checks to see if the data directory is accessible. If so, it opens the database and returns the sqlite3 connection id. Otherwise, with your permission, it creates a new file tree (using tree_setup() and initializes the database.

store_add() adds a record to the database. It returns the last row id (auto-increment id field) which is also the numeric part of the filename. We use this to copy the file to the data tree.

file_ts() is, well, ugly. The clean part is stat is used to get the bytecount. The ugly part is getting the picture creation time from the EXIF info if it exists. I found references to multiple EXIF packages in Python but each seemed to have a problem. I elected to use the exiv2 program which is included in Kubuntu. I read the results until I find the "Image timestamp" line and hackishly convert it into a real Linux-ish timestamp (seconds since the epoch). It was a pain but this is the best choice for later data comparisons.

If there is no EXIF information of the timestamp is missing, I settle for the last modify time in the filesystem. stat easily supplies this information.

img_save() creates a filename consisting of z_ and a six digit number. That number is the database record id with leading zeros added. It then computes the actual destination path with the same mod 100 trick for directory name as tree_setup() used.

img_hash() used hashlib to create an MD5 digest for the file. No magic other than hashlib is new and replaces the older digest creation routines.

That's the end of the story. As I said, the program evolved and it shows it. It's actually a good example of why programs should be written twice. One serious (ok, irritating) problem remains. When the image display is opened, the focus switches to it. Thus, you need a mouse click to get back to the console window to communicate with the main loop. There is probably is the right way to fix this but, for now, just setting the Focus stealing prevention level in the KDE Control Module (click on the icon in the task bar, select Configure Window Behavior and the Advanced) to high solves the problem. Unfortunately, that isn't the general policy I want. I am sure it is easy to fix under program control—I just haven't figured out how yet.

Now, I guess I need to actually spend the next few days using the program. I do need a bunch of photos for the Geek Ranch web site.

Ed Note: The code below will NOT work if you copy and paste it. Get the code here.

# fotosort.py
# Takes a list of photo files and lets you play with them
# What it did goes in a database including user supplied flags and description
# Phil Hughes 25 Dec 2007@0643

import sys
import os
import time
import shutil
import hashlib
from pysqlite2 import dbapi2 as sqlite

dataloc ="/home/fyl/PIXTREE"    # where to build the tree

connection = 0                  # will be connection ID

class Rec():            # what we will put in the db
        def __init__(self, source_info):
                self.source_info = source_info  # where it came from
        # id integer primary key        # will be filename
        # flags text            # letters used for selection
        # md5 text              # MD5 hex digest
        # size integer          # file byte count
        # description text      # caption information
        # source_path text      # path we got it from
        # timestamp integer     # creation timestamp (from image of fs date)

def tree_setup():
        os.mkdir(dataloc, 0755)         # tree base
        for x in range(100):            # build 100 sub-directories
                os.mkdir("%s/%02i" % (dataloc, x), 0755)

def show_pix(path):     # runs display, returns display_pid so kill can work
        return os.spawnv(os.P_NOWAIT, "/usr/bin/gm",
                ["display", "-geometry", "240x240", path])

def store_open():       # opens, returns biggest ID or -1 on error
        # create data store if it doesn't exist
        if not os.access(dataloc, os.R_OK|os.W_OK|os.X_OK):
                print "can't open %s\n" % dataloc
                if raw_input("Create data structures (y/n): ") == 'y':
                        tree_setup()
                        # initialize the database
                        con = sqlite.connect(dataloc + "/pix.db")
                        cur = con.cursor()
                        cur.execute('''create table pix
                                (id integer primary key,
                                flags text,
                                md5 text,
                                size integer,
                                description text,
                                source_info text,
                                source_path text,
                                timestamp integer)
                                ''')
                else:           # the boss said forget it
                        exit(1)
        else:
                con = sqlite.connect(dataloc + "/pix.db")
        if con > 0:
                return con
        else:
                return -1

def store_close(con):
        con.close()

def store_add(data): # assigns next id, saves, returns id
        cur = connection.cursor()
        cur.execute('''
        insert into pix (flags, md5, size, description, source_info,
                source_path, timestamp) values (?, ?, ?, ?, ?, ?, ?)''',
                (data.flags, data.md5, data.size, data.description,
                data.source_info, data.source_path, data.timestamp)
        )
        connection.commit()
        return cur.lastrowid

def openfl(path):       # open a file list, returns file object
        return open(path, 'r')

def getfn(rec): # gets the next filename
        return readline(lfo)

def form_fill(rec):             # pass record to fill in
        rec.flags = raw_input("Flags: ")
        rec.description = raw_input("Desc.: ")

def file_ts(path):      # returns creation timestamp, file size in bytes
        size = os.stat(path).st_size
        # look for EXIF info but, if not found, uses filesystem timestamp
        exiv2fo = os.popen("/usr/bin/exiv2  %s" % path, 'r')
        for line in exiv2fo:
                if line[0:15] == "Image timestamp":
                        cl = line.index(':')
                        ts_str = line[cl+2:cl+21]
                        ts = time.mktime((int(line[cl+2:cl+6]),
                                int(line[cl+7:cl+9]), int(line[cl+10:cl+12]),
                                int(line[cl+13:cl+15]), int(line[cl+16:cl+18]),
                                int(line[cl+19:cl+21]), 0, 0, 0))
                        break
        else:                   # use filesystem timestamp
                ts = os.stat(path).st_mtime
        exiv2fo.close()
        return (long(ts), size)

def img_save(image_file, id):   # copy image file to store
        # store location is built from id and some other fun stuff

        fname = "z_%06d" % int(id)
        dest = dataloc + "/" + "%02d" % (int(id) % 100) + '/' + fname
        # print dest
        shutil.copyfile(image_file, dest)
        return dest

def img_hash(image_file):       # returns MD5 hash for a file
        fo = open(image_file, 'r')
        m = hashlib.md5()
        stuff = fo.read(8192)
        while len(stuff) > 0:
                m.update(stuff)
                stuff = fo.read(8192)
        fo.close()
        return (m.hexdigest())

###
### This is where the action starts ###

if len(sys.argv) != 2:
        print "usage %s names_file\n" % sys.argv[0]
        exit(1)

lfo = openfl(sys.argv[1])               # filename list file
connection = store_open()
if connection < 0:
        print "%s: unable to initialize database" % sys,agrv[0]
        exit(1)

# let's get the string to use for source info
rec = Rec(raw_input("Enter source info: "))

for f in lfo:
        f = f.strip()                           # toss possible newline
        display_pid = show_pix(f)
        disp = raw_input("s[ave]/d[iscard]/q[uit]: ")
        if disp != 'q' and disp != 'd':
                rec.timestamp, rec.size = file_ts(f)
                rec.source_path = f
                rec.md5 = img_hash(f)   # hash

                form_fill(rec)                          # get user input
                id = store_add(rec)                     # insert in db
                savedloc = img_save(f, id)      # copy the image

                print "Photo saved as %s\n" % savedloc
        os.system("kill %s" % display_pid)
        if disp == 'q':
                break

Load Disqus comments