Anyone with an email address can expect to receive attachments in a multitude of formats. Unfortunately, some formats cannot be read using free software. This is especially true if our email buddies are still involved in the arguably risky practice of using proprietary programs in conjunction with their email readers.
Many free software advocates adopt a policy of ignoring all email with attachments dependent on closed source software, opting instead to lecture the sender on the importance of open standards. Others may not like missing out on the fun to be had from attachments being forwarded amongst their peers. If you find yourself in this situation, the techniques outlined in this article may serve as a partial solution.
There is not much a Linux user can do if the entire contents of an attachment are encoded using a jealously guarded secret algorithm. Very often however, the problematic file is merely a thin proprietary envelope enclosing a loose collection of data objects that use well-known encoding standards. For instance, some MS Word documents being forwarded around the Net contain ordinary JPG and PNG images embedded within the file. If we can find a way to remove the envelope, reading these enclosed files would be a straight forward matter. The following sections describe how this can be accomplished using a little Python scripting together with a few image viewing and manipulation tools available on most Linux distributions.
Before tackling the problem of the embedded images we can easily
view any readable text using the strings utility:
strings proprietary.file | less
strings proprietary.file | grep JFIF strings -n 3 proprietary.file | grep PNG strings proprietary.file | grep GIF8
We need to find where exactly each image is located within the
file. A little Python will help to find possible embedded images and
report their positions as a byte offset:
from string import find #read in proprietary data fh = open( "proprietary.file" ) dat = fh.read() fh.close() #search for JFIF x = -1 while 1: x = find(dat,"JFIF",x+1) if x<0: break #file actually started 6 bytes earlier print x - 6
#!/usr/bin/python from string import find from sys import argv headers = [("GIF8",0), ("PNG",1), ("JFIF",6)] filepath = "proprietary.file" if len(argv)>1: filepath = argv[1] fh = open(filepath ) dat = fh.read() fh.close() for kw,off in headers: x = 0 while 1: x = find(dat,kw,x+1) if x<0: break print kw,"file begins at byte",x - off
Now that we know where each image is likely to start how do we display
them? ImageMagick's display utility can help here. Suppose
our proprietary file contains a JPEG image beginning at byte 1000.
Using tail to remove all the bytes that preceed it and pipe
the rest to display.
tail -c +1001 proprietary.file | display -
#!/usr/bin/python from string import find from sys import argv from os import system headers = [("GIF8",0), ("PNG",1), ("JFIF",6)] filepath = "proprietary.file" if len(argv)>1: filepath = argv[1] fh = open(filepath ) dat = fh.read() fh.close() for kw,off in headers: x = 0 while 1: x = find(dat,kw,x+1) if x<0: break system("tail -c +%d %s | display -" % (x - off + 1, filepath))
ImageMagick throws away any excess data fed to it after reading to
the end of the image segment. If we want to separate the image data
completely for storage as individual files, we also need to find the
end of each image. One way to do this is to use a modified binary chop
algorithm.
Listing 3
#!/usr/bin/python from string import find from sys import argv from commands import getstatusoutput headers = [("GIF8",0,"giftopnm","gif"), ("PNG",1,"pngtopnm","png"), ("JFIF",6,"djpeg","jpg")] filepath = "proprietary.file" if len(argv)>1: filepath = argv[1] fh = open(filepath ) dat = fh.read() fh.close() inum = 0 for kw,off,conv,ext in headers: x = -1 while 1: x = find(dat,kw,x+1) if x<0: break beg = x - off #possible image located -- find end by binary chop s1 = len(dat) - x s0 = 1 sz = s1 while s0<s1: (stat,output) = getstatusoutput("tail -c +%d %s | head -c %d | %s >/dev/null" % (beg + 1, filepath, sz, conv)) if stat: #failed -- possibly too small if sz == s1: #failed -- probably invalid data print "failed... no image here" break elif sz == s0: #we've found the length -- write out image imgname = "image%03d.%s" % (inum, ext) print "writing",imgname fh = open( imgname, "w") fh.write(dat[beg :beg+s1]) fh.close() inum = inum + 1 break s0 = sz else: #might be too big -- try smaller s1 = sz sz = int((s0+s1)/2)
This article has shown how to write scripts that extract data objects, encoded using platform-independent open standards, from within proprietary files. It should be a simple task to extend these scripts for handling other image formats and even other types of data objects, such as sound and music files. Note that there are many file formats that frustrate the techniques described here via a layer of simple encryption and/or obfuscation.
Even if one has access to the appropriate proprietary application for reading a particular email attachment, the scripts outlined above can be useful for avoiding any possible macro viruses or security exploits specific to that application.
And finally a word of warning. The legislature of some countries have vaguely worded laws that can be interpreted in such a way that these scripts may be considered as illegal copyright circumvention devices. This may or may not be relevant to you depending on the country where you reside. As is always the case when mixing open and closed source systems, your mileage may vary.
[Editor's note: The Python Imaging Library (PIL) provides a way to work with images from within a larger program. You can open an image and read its type and dimensions, transform it, create thumbnails, etc. -Iron.]