Sunday, April 11, 2010

software: slicing up PDFs

I wanted to combine all the statements that I downloaded for my audit into a single PDF and then exclude all the cover pages plus the pages of boilerplate disclaimer, "how to reconcile your account", etc. PDF is a standard page presentation format, so you would expect there to be software to do this besides paying Adobe $119 for Adobe Acrobat.

There is, but it's the usual onion: a load of crap surrounding a simple idea.
  • Googling for "split PDF" finds the usual mess of sites and shareware and paid utilities
    • So I restrict to "linux split PDF", which points me to pdftk that Sid Shepard wrote in support of his book. But installing that requires 50MB of supporting GCJ packages. It's really cool this runs as a standalone program, but I already have a Java interpreter installed so this approach is 20× bigger and more complicated,
      • So I google some more and find joinPDF, supposedly a simple script and a Java library written by Gerard Briscoe, but the directory to download for this is defunct.
        • There are tons of other search results for this, on Mac shareware sites (someone bundled a graphical user interface for the Mac for users who don't know how to enter command lines), but their links are broken as well. (As an aside, why can't Google be smart? If I Google "download joinPDF" and a page with that text has a broken link, then don't waste my time with that search result!! I need a decision engine, not a search engine.)
          • I finally find a web site that has the simple original joinPDF for download. Follow the README.txt's instructions to manually copy the Java library and two scripts to the right location, and I'm set!
            • It turns out the actual core of this onion is a Java library, iText written by Bruno Lowagie, that can slice and dice PDFs: both joinPDF and the bloated pdftk simply include this library and provide a wrapper around it

Now enter the command line
joinPDF combined_statements.pdf checking*.pdf
, and I get combined_statements.pdf! But the files use stupid date naming so they're in the wrong order. Rename them with ISO8601 date format 2007-01, 2007-02, etc. file names, repeat.

Now I have to excise the pages I don't want. joinPDF provides another command, splitPDF, to split a PDF into individual pages, but this does not remove particular ranges of pages. (I should have used splitPDF to split each statement into _page1, _page2, etc. files, then glued a subset of these together, but that seemed to mess up the thumbnail display). I could probably get the source code and write my own simple wrapper around the iText library for an excise command, how hard can Java programming be? But that seems silly. Surely a Portable Document Format should make it easy to cut out pages I don't want.

I bring up combined_statements.pdf in the awesome vim, text editor. It understands PDF files and colorcodes certain words of them: obj, /Type, Kids, stream, etc. Looks promising, but there's no obvious Start of page 39... End of page 39 to chop out. I just need a little guidance as to what these mean. Back to Google for "PDF file format". But all of the articles show graphical tools or describe the format from the bottom up instead of telling me at a high level what to look for. So I add one of the words in the file, endobj to my Google search, and find Introduction to PDF! That's what I need!

For reference, in a particular PDF produced by printing a Quicken document in Wine...

You need to delete the page object and optionally things it references. The PDF is full of flattened objects. Each object starts with NN 0 obj where NN is a number for the object and 0 is its version (0 for most generated objects), and ends with endobj . Delete from one to the other and you've removed an object.

One object in the file is:
2 0 obj
<< /Type /Pages /Kids [ 3 0 R
4 0 R
5 0 R
46 0 R
] /Count 44 >>
This lists all 44 pages in the file, using their object numbers. I think they're in the order you see them, so delete the Nth line inside the brackets and the PDF will no longer have an Nth page. Done! (My PDF viewer Okular doesn't seem to mind that the /Count 44 is no longer accurate.)

You can go on to actually get rid of the page object you removed from the page list:
46 0 obj
<< /Type /Page /Parent 2 0 R
/Contents 137 0 R
is the page itself. But that page object is only 12 lines long, where's the actual massive text block with the contents of the page? Well, any time you see NN 0, it's probably a reference to another object; Sure enough, /Contents 137 0 is another object with a huge stream of stuff:
137 0 obj
<< /Length 138 0 R >>
q 0.240000 0 0 0.240000 0 0 cm /R0 gs 0 w 1 J ... ...
So you can delete this as well. There are more objects you don't need, but they're small enough to leave around.

Update: The joinPDF author's web site actually does exist and you can click through (Software > joinPDF) to his software, but incredibly, Google search results show all those broken links in preference to this! Maybe because he's using frames, but c'mon Google, be smart!

Labels: , ,


Post a Comment

Links to this post:

Create a Link

<< Home