Sorting Photos
We all know the right way to sort photos is to do them right after you take them. We also know that doing a disk backup before your drive fails is the right way to do backups. But, we don't always do things the right way. Enter my situation. I have close to 10,000 photos takes with my digital camera over the last seven years. Yes, same camera—this could probably be an ad for a Canon A20 which has been abused, dropped by me and others and used by tens of kids that have never used a camera before and some of them that have never even used a flush toilet. In any case, the photos from the camera are spread over a few CF cards, CDs, two different computers and who knows where else.
Sometimes the same photos have been saved multiple times. The photo number sequence has been reset when I changed CF cards. Or, put in other terms, I have a disaster to clean up. I regularly get asked for a particular photo and spend a bunch of time looking for it. This time, I decided to take the time to write a program to help solve the problem.
Yes, there are lots of programs to sort and thumbnail photos but when you have 10,000 or so images to start with, some sort of pre-sorting makes sense. Here is what I want that presort to do.
- Read a list of possible photo files.
- Build a database with creation date, some source information and an organized place to store each photo.
- Be able to tag each photo. In this case, any of a number of letters will work.
- Optionally add a description.
- Allow me to say "forget it" for obviously bad photos.
- Allow me to incrementally add to this collection.
Yeah, that's just a start but it makes sense considering the magnitude of the problem. Source information, for example, would be which computer or CD the photos came from. For most photos, the EXIF information from the camera will give me the actual date and time the photo was taken. But, if that isn't available (edited photo, for example), I will settle for the Linux filesystem timestamp.
I see getting this stuff organized as a four-step process.
- Find all the photos—this is a combination of physical
work and then building lists of filenames. A find command
can do the dirty work. for example
find /home/tux/Pix -iname "*.jpg" >file.list
can do most of the work. Multiple lists can be built on a directory or directory tree basis. - fotosort (the program I am talking about here) and my time can be used to process each list. It will allow me to skip a photo or add some tagging information and save a copy, All the "processed" photos will end up in one big tree with the database pointing to them.
- Toss duplicates. This will be the next programming project. With an MD5 digest and the file size (in bytes) in the database, it will be easy to find files that are duplicates.
- Create photo galleries.
Whether I elect to do the final step—create the galleries— manually, using one of the many existing programs or write something to do it myself, I am already heading in the right direction. All the information I need is in a database and the photos are all in one place.
The Code
Let's look at what I have created. It is far from a work of art as it has experienced the typical evolution sequence that most programs go through. But, it works. If I was going to use it regularly, I would invest a bit of time to clean it up and add error handling but it is petty much a one-shot for me.
Class Rec is not much more than a comment that shows what data I will need. When used to create an instance, it is passed the source_info string. It will be common for all the records created in a single run of fotosort.
The main program opens the filename list passed as a command line argument and opens the database (or creates it and the file tree if it doesn't exist). It then loops through those filenames displaying them using GraphicMagick's display function and checks to see if you want to save each one. If you say skip, it moves on to the next file.
If you elect to save the file it gets the file timestamp, bytecount and MD5 digest, prompts for the flags and description, inserts the information into the database and copies the image file over to the new tree. No matter whether you picked save or not, the image display is terminated by calling kill with the pid returned when it was started. All the nitty gritty is handled by functions. Here is a quick look at the important ones.
tree_setup() creates 100 sub-directories named 00 through 99. As I have 10,000 files to play with I certainly don't want to put them all in one directory. They will be stored in the 100 different directories selected by the last two digits of their filename. For example, picture z_000021, z_000121, z_099921, ... will all be stored in sub-directory 21.
store_open() checks to see if the data directory is accessible. If so, it opens the database and returns the sqlite3 connection id. Otherwise, with your permission, it creates a new file tree (using tree_setup() and initializes the database.
store_add() adds a record to the database. It returns the last row id (auto-increment id field) which is also the numeric part of the filename. We use this to copy the file to the data tree.
file_ts() is, well, ugly. The clean part is stat is used to get the bytecount. The ugly part is getting the picture creation time from the EXIF info if it exists. I found references to multiple EXIF packages in Python but each seemed to have a problem. I elected to use the exiv2 program which is included in Kubuntu. I read the results until I find the "Image timestamp" line and hackishly convert it into a real Linux-ish timestamp (seconds since the epoch). It was a pain but this is the best choice for later data comparisons.
If there is no EXIF information of the timestamp is missing, I settle for the last modify time in the filesystem. stat easily supplies this information.
img_save() creates a filename consisting of z_ and a six digit number. That number is the database record id with leading zeros added. It then computes the actual destination path with the same mod 100 trick for directory name as tree_setup() used.
img_hash() used hashlib to create an MD5 digest for the file. No magic other than hashlib is new and replaces the older digest creation routines.
That's the end of the story. As I said, the program evolved and it shows it. It's actually a good example of why programs should be written twice. One serious (ok, irritating) problem remains. When the image display is opened, the focus switches to it. Thus, you need a mouse click to get back to the console window to communicate with the main loop. There is probably is the right way to fix this but, for now, just setting the Focus stealing prevention level in the KDE Control Module (click on the icon in the task bar, select Configure Window Behavior and the Advanced) to high solves the problem. Unfortunately, that isn't the general policy I want. I am sure it is easy to fix under program control—I just haven't figured out how yet.
Now, I guess I need to actually spend the next few days using the program. I do need a bunch of photos for the Geek Ranch web site.
Ed Note: The code below will NOT work if you copy and paste it. Get the code here.
# fotosort.py
# Takes a list of photo files and lets you play with them
# What it did goes in a database including user supplied flags and description
# Phil Hughes 25 Dec 2007@0643
import sys
import os
import time
import shutil
import hashlib
from pysqlite2 import dbapi2 as sqlite
dataloc ="/home/fyl/PIXTREE" # where to build the tree
connection = 0 # will be connection ID
class Rec(): # what we will put in the db
def __init__(self, source_info):
self.source_info = source_info # where it came from
# id integer primary key # will be filename
# flags text # letters used for selection
# md5 text # MD5 hex digest
# size integer # file byte count
# description text # caption information
# source_path text # path we got it from
# timestamp integer # creation timestamp (from image of fs date)
def tree_setup():
os.mkdir(dataloc, 0755) # tree base
for x in range(100): # build 100 sub-directories
os.mkdir("%s/%02i" % (dataloc, x), 0755)
def show_pix(path): # runs display, returns display_pid so kill can work
return os.spawnv(os.P_NOWAIT, "/usr/bin/gm",
["display", "-geometry", "240x240", path])
def store_open(): # opens, returns biggest ID or -1 on error
# create data store if it doesn't exist
if not os.access(dataloc, os.R_OK|os.W_OK|os.X_OK):
print "can't open %s\n" % dataloc
if raw_input("Create data structures (y/n): ") == 'y':
tree_setup()
# initialize the database
con = sqlite.connect(dataloc + "/pix.db")
cur = con.cursor()
cur.execute('''create table pix
(id integer primary key,
flags text,
md5 text,
size integer,
description text,
source_info text,
source_path text,
timestamp integer)
''')
else: # the boss said forget it
exit(1)
else:
con = sqlite.connect(dataloc + "/pix.db")
if con > 0:
return con
else:
return -1
def store_close(con):
con.close()
def store_add(data): # assigns next id, saves, returns id
cur = connection.cursor()
cur.execute('''
insert into pix (flags, md5, size, description, source_info,
source_path, timestamp) values (?, ?, ?, ?, ?, ?, ?)''',
(data.flags, data.md5, data.size, data.description,
data.source_info, data.source_path, data.timestamp)
)
connection.commit()
return cur.lastrowid
def openfl(path): # open a file list, returns file object
return open(path, 'r')
def getfn(rec): # gets the next filename
return readline(lfo)
def form_fill(rec): # pass record to fill in
rec.flags = raw_input("Flags: ")
rec.description = raw_input("Desc.: ")
def file_ts(path): # returns creation timestamp, file size in bytes
size = os.stat(path).st_size
# look for EXIF info but, if not found, uses filesystem timestamp
exiv2fo = os.popen("/usr/bin/exiv2 %s" % path, 'r')
for line in exiv2fo:
if line[0:15] == "Image timestamp":
cl = line.index(':')
ts_str = line[cl+2:cl+21]
ts = time.mktime((int(line[cl+2:cl+6]),
int(line[cl+7:cl+9]), int(line[cl+10:cl+12]),
int(line[cl+13:cl+15]), int(line[cl+16:cl+18]),
int(line[cl+19:cl+21]), 0, 0, 0))
break
else: # use filesystem timestamp
ts = os.stat(path).st_mtime
exiv2fo.close()
return (long(ts), size)
def img_save(image_file, id): # copy image file to store
# store location is built from id and some other fun stuff
fname = "z_%06d" % int(id)
dest = dataloc + "/" + "%02d" % (int(id) % 100) + '/' + fname
# print dest
shutil.copyfile(image_file, dest)
return dest
def img_hash(image_file): # returns MD5 hash for a file
fo = open(image_file, 'r')
m = hashlib.md5()
stuff = fo.read(8192)
while len(stuff) > 0:
m.update(stuff)
stuff = fo.read(8192)
fo.close()
return (m.hexdigest())
###
### This is where the action starts ###
if len(sys.argv) != 2:
print "usage %s names_file\n" % sys.argv[0]
exit(1)
lfo = openfl(sys.argv[1]) # filename list file
connection = store_open()
if connection < 0:
print "%s: unable to initialize database" % sys,agrv[0]
exit(1)
# let's get the string to use for source info
rec = Rec(raw_input("Enter source info: "))
for f in lfo:
f = f.strip() # toss possible newline
display_pid = show_pix(f)
disp = raw_input("s[ave]/d[iscard]/q[uit]: ")
if disp != 'q' and disp != 'd':
rec.timestamp, rec.size = file_ts(f)
rec.source_path = f
rec.md5 = img_hash(f) # hash
form_fill(rec) # get user input
id = store_add(rec) # insert in db
savedloc = img_save(f, id) # copy the image
print "Photo saved as %s\n" % savedloc
os.system("kill %s" % display_pid)
if disp == 'q':
break
Phil Hughes
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Linux Systems Administrator
- Validate an E-Mail Address with PHP, the Right Way
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- RSS Feeds
- Introduction to MapReduce with Hadoop on Linux
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?



Comments
Doesn't seem to work - get an error
I copied the script, saved it to a text file, and ran python fotosort.py.
I'm getting the following error:
-------------------------------------------------------------
File "fotosort.py", line 17
class Rec(): # what we will put in the db
^
SyntaxError: invalid syntax
-------------------------------------------------------------
What do I need to get the script to run?
Indenting
The mostly likely problem is indenting. I should have put the code in a separate file that could be downloaded. I have asked our Webmistress to do that.
Python blocks are indicated by indentation levels. While the code may appear ok, spaces and tabs that make things look lined up are not necessarily treated the same.
Phil Hughes
Code
There is a link to the code in the article above. You can also get it right here.
Katherine Druckman is webmistress at LinuxJournal.com. You might find her on Twitter or at the Southwest Drupal Summit
have you tried kphotoalbum or f-spot or others?
does your script do anything that one of the existing photomanagment applications can't do as well?
i am quite happy managing my 12000 photos in kphotoalbum.
it spreads them out on a scalable timeline and allows you to tag photos one by one or in bulk. it does not bother about the filenames or even the path as it tracks photos by checksum, so even if you move/rename them later, tags won't be lost.
i started using kphotoalbum after i had about 10000 photos and just walked through the timeline spending a couple of hours each day tagging batches of photos for few months.
before using kphotoalbum i just sorted my photos by date, putting them in a year/month/day sort of path. i still do that as kphotoalbum does not care.
i don't import the original photos into kphotoalbum but only a smaller version (800x600) of them to make handling easier.
(kphotoalbum does have a feature to handle offline storage though)
as for backups, i don't delete a photo from the cf-card until it is copied to at least two locations. one on my notebook and one on an external usb disk. that external disk contains disk-images in dvd-sized files. photos get placed into those disk images, and disk-images are written to a dvd as they fill up. once a dvd is written, i delete the directory that corresponds to that dvd from my notebook, so that i end up with one copy of the photo on dvd and one on the external usb disk. i also keep a second set of dvd-disks at my grandmothers place.
greetings, eMBee.
ooops...
that should have been a reply to the main article, not to the first comment.
greetings, eMBee.