Sunday, June 11, 2006

web: knowledge and semantics

The things that pass for knowledge, I can't understand
(Steely Dan, "Reeling in the Years")

I'm working a little bit with the Semantic MediaWiki project. It's already useful and I hope Wikipedia itself picks it up so you can query for [[produced by::Thomas Dolby]] and get a list of records. You can see Semantic MediaWiki in action on its home wiki. Try its San Diego page, see the summary "factbox" at the bottom of each and click the looking glass icon next to relations and attributes.

There's information out there. People can't possibly digest it all so they need help finding it. Meanwhile machines can digest it all, but don't understand it. I hope that they can meet in the middle, and Semantic MediaWiki is one of the best bridges.

I've worked on a string of so-called "Knowledge Bases", a term I've never liked.
  • First capturing support interactions and follow ups in Lotus Notes, and providing exports of them to customers.
  • Then writing simple Perl scripts to capture keywords like product and version, output them in HTML meta tags, and teach Verity and Atomz search to search on these.
  • Last a Kanisa knowledge base that actually got fairly smart with topic maps and nested categorizations.
All of these fall down at the point where a human has to input something useful in a field like Keywords. Many people can't give a good title, let alone a synopsis, even though they're subject experts.

From the machine understanding end, one of the most inspiring things I've ever envisioned came from the Cyc project, Doug Lenat's 21 year old project to teach a computer enough common-sense facts so it can comprehend a newspaper story. Cyc predates the World-Wide Web, and I and lots of others realized that combined with a Web crawler it could read everything online in three weeks and would then come up with amazing insights. Alas Cyc just isn't happening, "common sense" is as elusive and meaningless as everything else in hard AI.

Semantic MediaWiki is poised to bring those two ends closer. The wiki effect solves the bad input problem, because interested strangers will make editing passes to improve the semantics, just like other editing passes (it's impressive to see sets of Wikipedia pages get consistent and acquire more elaborate infoboxes and categorization over time).

The querying and exploration made possible by semantic relations helps people find relevant articles and information. Imagine trying to find cities in Europe with a population over one million. Even though Wikipedia has articles with that data, the best you can do is hope for a category of [Large European Cities] and then read each in turn, searching for the word "population" and reading the figure that comes after it. Or trying to find every current president. Semantic MediaWiki identifies the facts in articles and the relations between them so you're not guessing for a good set of search terms and then hoping the Google snippets have the answer.

The relations between articles can be exported as RDF, which after a lot of big word effort involving ontologies and predicate vocabularies and OWL is amenable to simple "reasoning", such as Berlin is located in Germany and Germany is located in Europe implies Berlin is located in Europe. That's still a long way from machine understanding, but I'm not sure anyone even knows what it would mean for a machine to understand a Web page.

Of course, any attempt to tell machines the subject of a Web page is immediately subject to abuse by fake and parasite Web sites as I've noted. I remember when you could clearly identify the subject of a Web page by adding it to the dmoz open directory project and putting KEYWORDS and DESCRIPTION META tags in the HTML head section. Now there are far richer ways to express information in Web pages, like microformats and RSS summaries, but search engines like Google seem to deliberately ignore them and stick to brute force indexing. Again, the wiki editing effect will keep this in check. Already I usually skip searching Google in favor of guessing Wikipedia article names (I have a Firefox bookmarklet http://en.wikipedia.org/wiki/%s with shortcut 'w' so I just type "w Cyc" in the browser).

Categories: , ,


Post a Comment

Links to this post:

Create a Link

<< Home