Sunday, April 11, 2010

software: slicing up PDFs

I wanted to combine all the statements that I downloaded for my audit into a single PDF and then exclude all the cover pages plus the pages of boilerplate disclaimer, "how to reconcile your account", etc. PDF is a standard page presentation format, so you would expect there to be software to do this besides paying Adobe $119 for Adobe Acrobat.

There is, but it's the usual onion: a load of crap surrounding a simple idea.
  • Googling for "split PDF" finds the usual mess of sites and shareware and paid utilities
    • So I restrict to "linux split PDF", which points me to pdftk that Sid Shepard wrote in support of his book. But installing that requires 50MB of supporting GCJ packages. It's really cool this runs as a standalone program, but I already have a Java interpreter installed so this approach is 20× bigger and more complicated,
      • So I google some more and find joinPDF, supposedly a simple script and a Java library written by Gerard Briscoe, but the directory to download for this is defunct.
        • There are tons of other search results for this, on Mac shareware sites (someone bundled a graphical user interface for the Mac for users who don't know how to enter command lines), but their links are broken as well. (As an aside, why can't Google be smart? If I Google "download joinPDF" and a page with that text has a broken link, then don't waste my time with that search result!! I need a decision engine, not a search engine.)
          • I finally find a web site that has the simple original joinPDF for download. Follow the README.txt's instructions to manually copy the Java library and two scripts to the right location, and I'm set!
            • It turns out the actual core of this onion is a Java library, iText written by Bruno Lowagie, that can slice and dice PDFs: both joinPDF and the bloated pdftk simply include this library and provide a wrapper around it

Now enter the command line
joinPDF combined_statements.pdf checking*.pdf
, and I get combined_statements.pdf! But the files use stupid date naming so they're in the wrong order. Rename them with ISO8601 date format 2007-01, 2007-02, etc. file names, repeat.

Now I have to excise the pages I don't want. joinPDF provides another command, splitPDF, to split a PDF into individual pages, but this does not remove particular ranges of pages. (I should have used splitPDF to split each statement into _page1, _page2, etc. files, then glued a subset of these together, but that seemed to mess up the thumbnail display). I could probably get the source code and write my own simple wrapper around the iText library for an excise command, how hard can Java programming be? But that seems silly. Surely a Portable Document Format should make it easy to cut out pages I don't want.

I bring up combined_statements.pdf in the awesome vim, text editor. It understands PDF files and colorcodes certain words of them: obj, /Type, Kids, stream, etc. Looks promising, but there's no obvious Start of page 39... End of page 39 to chop out. I just need a little guidance as to what these mean. Back to Google for "PDF file format". But all of the articles show graphical tools or describe the format from the bottom up instead of telling me at a high level what to look for. So I add one of the words in the file, endobj to my Google search, and find Introduction to PDF! That's what I need!

For reference, in a particular PDF produced by printing a Quicken document in Wine...

You need to delete the page object and optionally things it references. The PDF is full of flattened objects. Each object starts with NN 0 obj where NN is a number for the object and 0 is its version (0 for most generated objects), and ends with endobj . Delete from one to the other and you've removed an object.

One object in the file is:
2 0 obj
<< /Type /Pages /Kids [ 3 0 R
4 0 R
5 0 R
46 0 R
] /Count 44 >>
This lists all 44 pages in the file, using their object numbers. I think they're in the order you see them, so delete the Nth line inside the brackets and the PDF will no longer have an Nth page. Done! (My PDF viewer Okular doesn't seem to mind that the /Count 44 is no longer accurate.)

You can go on to actually get rid of the page object you removed from the page list:
46 0 obj
<< /Type /Page /Parent 2 0 R
/Contents 137 0 R
is the page itself. But that page object is only 12 lines long, where's the actual massive text block with the contents of the page? Well, any time you see NN 0, it's probably a reference to another object; Sure enough, /Contents 137 0 is another object with a huge stream of stuff:
137 0 obj
<< /Length 138 0 R >>
q 0.240000 0 0 0.240000 0 0 cm /R0 gs 0 w 1 J ... ...
So you can delete this as well. There are more objects you don't need, but they're small enough to leave around.

Update: The joinPDF author's web site actually does exist and you can click through (Software > joinPDF) to his software, but incredibly, Google search results show all those broken links in preference to this! Maybe because he's using frames, but c'mon Google, be smart!

Labels: , ,

web: it's my data on their web site, let me get at it

I'm being audited, the IRS says "Please bring cancelled checks and deposit slips", how quaint. It's more like 250+ pages of electronic statements and electronic check images to print out. I wish the IRS let you bring a directory of hyperlinked PDFs.

Fortunately my financial institutions provide online records going back far enough, though one (whose name rhymes with "smells cargo") cuts off after a pathetically short two years.

==> Save your own PDF copies of your statements! Don't rely on your bank.

Unfortunately, all financial institutions make it difficult to grab this information. The URL to download my January 2007 statement is invariably an impenetrable mess. It should be just https://secure.thebank.com/records/internalUserID/2007/statements/checking_1234_2007-01.pdf, where internalUserID is what refers to me internally. Then I can just change the end of the URL to 2007-02, -03, etc. You might think it's more secure to have a meaningless jumbled URL with token IDs and session IDs and crap, but that's confusing a secured session with a complicated name, and it's guaranteeing the URLs will change when they rethink their web site.

(The same really holds true for any other data on the web. I can't get my pictures out of Sprint PictureMail because there isn't a simple URL for each one.)

Also, the institutions do the usual crappy job of naming the downloaded file. When I repeatedly click to download my statements, I get
Note the ^$#@! random order of the files because the institution didn't use ISO8601 date format. BANKSTMT_1234_2008-04.pdf sorts in the right order, why do people persist in using stupid date formats?.

The real interesting issue is what would happen if I was no longer a customer of Tells Margo? The moment you're not a customer, you lose access. But that's not fair, a former customer still should have rights to access old data. Again, that's why simple URLs are so important. An institution should let me access /records/internalUserID/correspondence/2010/some_old_record.pdf even if my accounts are defunct. And again, until the world works as it should, save those records in your own well-organized system despite the hassle.

Labels: ,

Monday, August 4, 2008

web: still no date love

I complained about the complete and utter randomness of date formats. Going through receipts from travel:
Jul'06 08
06JUL 2008
And Yankee imperialist scum, none are June, those are all July 6th 2008.

How much time do people waste comprehending the 20 different ways of representing a date? What does it take to get companies to stop this nonsense and use ISO8601 dates?

2008-07-06 Done. All ambiguity gone for anyone on earth.

Labels: ,