web: book reviews again

I have a substantial pile of books I’ve read that will injure me in an earthquake. I ought to write perspicacious pithy reviews of them. I could write them on Amazon, but why should Amazon own and profit from my words? I could write them on https://lib.reviews/ “a free, open and not-for-profit platform for reviewing absolutely anything, in any language,” but it seems a bit moribund. Instead I have this web site! Putting book reviews here will ensure they live forever in complete obscurity.

Oh no, not the semantic web again!

A long time ago I just wrote a definition list in HTML in Blogger with each book title followed by a paragraph underneath. Then the idea of a semantic web came along: the web page should unambiguously tell machines that a chunk of writing is a review of a particular book rather than me advertising some books for sale, or writing about the author. And it should tell the machines it’s a review by skierpage, of a book with a particular title and ISBN, who gives it a rating of 3 out of 5 stars, etc.

Why bother?

Disclaimer: all the semantic web work below is probably irrelevant. If your web page is important according to Google’s PageRank algorithm, then Google will devote AI to figuring out what it says, even if it has no, or incorrect, semantic markup. So most of those making the effort to do this semantic markup are shady SEO (search engine optimization) sites, trying to convince you that if you jump through all these hoops or pay them to do it, then your site on topic X will somehow rise in search results from utter obscurity on the 20th page of results to mostly ignored on the 4th page.

hReview microformat

Back in 2011 the leading implementation of this idea for plain web pages was microformats: you probably already have the bits of text in your human-readable book review, so put additional markup (the ‘M’ in Hypertext Markup Language) around them identifying the bit that’s the rating, the summary, etc. using invisible HTML attributes like class=reviewer, class=rating, class=summary , etc. So I wrote a few reviews using an online tool to generate the necessary HTML, which I pasted into WordPress.

So many schemas

The hReview microformat is still going and supposedly Google still parses it when it crawls web pages. Some big guns of Web 2.0 (Google, Microsoft, Yahoo, and Yandex) came up with their own standard for structured data, similar but different, at the poorly named schema.org: “a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.” This got more detailed and complicated than microformats: there are separate related schemas for a review by the person skierpage about a book authored by another person. And there are three ways you can put the machine-readable information into your web pages (two too many!).

Google provides a structured data markup helper to guide me in creating this markup, and then its structured data testing tool to see if I got it right. (There was another schema generator at tools.seochat.com now defunct, and another checker at linter.structured-data.org/ .) If you choose to put invisible markup in the page surrounding the text of your review (schema.org calls this “microdata,” different from “microformat”), the HTML looks something like:

<!-- Microdata markup added by Google Structured Data Markup Helper. -->
  <div itemscope itemtype="http://schema.org/Book" id="hreview-Sprawling,-very-good!">
  <meta itemprop="isbn" content="03-5091234-034">
  <meta itemprop="genre" content="Science Fiction">
  <meta itemprop="datePublished" content="2017-06-04">
  <h3>Sprawling, very good!</h3>
  <p>
    <img itemprop="image" class="photo" src="http://ecx.images-amazon.com/images/I/51Gvu3UlqGL.jpg" width="167" height="250" alt="cover of 'River of Gods'" align="left" style="margin-right: 1em"/>
  </p>

  <div class="item">
    <a title="paperback at Amazon" href="http://www.amazon.com/River-Gods-Ian-McDonald/dp/1591025958" class="fn url">
      <span itemprop="name">River of Gods</span>
    </a>
    by
    <a href="http://en.wikipedia.org/wiki/Ian_McDonald_%28British_author%29">
      <span itemprop="author" itemscope itemtype="http://schema.org/Person">
        <span itemprop="name">Ian McDonald</span>
      </span>
    </a>
  </div>
  <p itemprop="review" itemscope itemtype="http://schema.org/Review" class="description">
    <abbr itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating" class="rating" title="4">
      <span itemprop="ratingValue">4</span>
      5
    </abbr>
    <span itemprop="reviewBody">This does a fantastic job of presenting the foreign culture of ... !</span>
    <meta itemprop="datePublished" content="2007-08-01">
    <span itemprop="author" itemscope itemtype="http://schema.org/Person">
      <meta itemprop="name" content="skierpage">
      <meta itemprop="sameAs" content="http://www.skierpage.com/about/">
    </span>
  </p>
</div>

The problem is, if I copy and paste this complicated HTML into WordPress’s post editor, it throws away much of the HTML markup, for example all the <meta> tags for information I don’t want to display, like <meta itemprop="datePublished" content="2007-08-01">. There are any number of dubious plug-ins to WordPress that support parts of schema.org schemas and want money for a professional version from desperate non-technical web site owners who see their traffic dropping and will clutch at straws hoping to appear higher in Google search results, but I don’t understand what these plug-ins do or don’t do.

Another representation for this structured data is JSON-LD, a completely separate representation of the semantic information that you stick in your web page and the reader never sees it. So maybe just sticking in a block of JSON-LD will work better (a guide to supporting it in WordPress is in section “Implementing Structured Data Using JSON-LD” in schema article at torquemag.io. Hmmm…, instead of copying and pasting twice, can I put this inside WordPress myself? Maybe try Markup (JSON-LD) Structure in schema.org plug-in for WordPress? wpengine article has JSON-LD generators, but they’re not much good:

Tracking data

The problem with JSON-LD is I have to put the same information into the web page twice, first as HTML to display to human readers, and then again in this invisible data format. Or maybe use Handlebars or something to spit out both the block of JSON and the HTML. A spreadsheet may be best to track most of this information. It sucks for entering formatted text, but probably OK just for a pithy two-sentence review.

Generated HTML

Each book review in the spreadsheet should generate both the JSON-LD that web crawlers should read, and a human-readable book review. In the latter, I want things to link to something useful.

Author ISBN should probably link it to https://en.wikipedia.org/wiki/Special:BookSources/0060932902{ISBN}. Or I could accept that Jeff Bezos owns us and have it link to Amazon’s ASIN? Wikipedia’s Special:BookSources above creates a query https://www.amazon.com/s?k=0060932902, note how the dashes are removed in the query otherwise it doesn’t work. Spam-filled https://kindlepreneur.com/amazon-search-url-isbn-ref/ says you can use a 10-digit ISBN in place of ASIN, e.g. https://www.amazon.com/dp/0060932902, but you still have to remove the dashes.

Other items in the review, like the author name and book title, should link to Wikipedia pages if available. There’s no easy way to know that Ian McDonald’s English Wikipedia page is at https://en.wikipedia.org/wiki/Ian_McDonald_(British_author), so the spreadsheet needs to have columns for Author URL and Book URL. (The alternative would be to store the Wikidata ‘Q’ numbers for each of these and work backwards from the wikidata info to the English Wikipedia pages, if any, for them.)

Coding it

Uh, scripting… Python? I quickly found a library to read a spreadsheet, and everyone uses seems jinja2 for HTML templating in Python. Adding these libraries mean dealing with all the ways to manage the Python libraries in a project; I have used pip and virtualenv in the past, but now teh hotness is pipenv, so install that and then add pyexcel-ods and jinja2. I’m rocking! In two hours I’ve read a line of my book reviews spreadsheet and generated some HTML

Then I upgraded to Fedora 32, and nothing works because its Python is now python3.8, so I have to coerce pipenv to rebuild everything. Guessing what to do, I run pipenv check and it tells me “In order to get an API Key you need a monthly subscription on pyup.io, starting at $14.99″ Guess I won’t run that command then.

Writing JSON-LD

JSON (JavaScript Object Notationis a simple file and data format to represent data. JSON-LD takes this and makes it slightly more complicated to represent Linked Data: on this Web page a person authored this review of a book which has its own author, another person(s). The details quickly degenerate into semantic triples, contexts, more three-letter acronyms like RDF, etc. Schema.org has fairly simple examples of JSON-LD for a review, but they leave it unclear if just writing "author": "skierpage" is enough for computers to figure out that the person writing the review is the person who runs this web site, or whether I have to go highly complicated

"author": [
  {
    "@type": "Person",
    "name": "skierpage",
    "sameAs": "https://www.skierpage.com/people/skierpage/foaf.rdf"
  }
],

To have multiple book reviews on a web page, you can put them in a top-level “graph” object. What’s unclear is if the page should have a graph of books, each with a single review, or a graph of reviews, each of a single itemReviewed that’s a book.

{
	"@context": "http://schema.org/",
	"@graph": [{
		"@type": "Review",
		"author": {
			"@type": "Person",
			"name": "skierpage",
			"sameAs": "https://www.skierpage.com/people/skierpage/foaf.rdf"
		},
		"datePublished": "2011-04-01",
		"reviewBody": "The book has a nice cover.",
		"itemReviewed": {
			"@type": "Book",
			"name": "River of Gods",
			"isbn": "03-5091234-0344",
			"author": "Ian McDonald"
		},
		"reviewRating": {
			"@type": "Rating",
			"ratingValue": 4,
			"worstRating": 1,
			"bestRating": 5
		}
	},
	{
		... another review
	}]
}

Google’s validator doesn’t like the above, it complains the review is missing a description, publisher, and url. Isn’t this all obvious from the web page?

Maybe I don’t need author, https://schema.org/Review says “Please note that author is special in that HTML 5 provides a special mechanism for indicating authorship via the rel tag. That is equivalent to this and may be used interchangeably.”

You can dump a Python object with just json.dumps(bookReview), or there is a fancy pyld Python module that outputs JSON-LD. So I could have stock Python dictionaries with some of the unchanging stuff (me, worst/bestRating), etc. to which I add review-specific info, or I could try and build the LinkedData structures and feed them into pyld

Summary: still working on this.

This entry was posted in books, semantic web, software. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.