The Read the Web team at CMU have come up with NELL, a computer program that learns by reading web pages, extracting facts and improving its reading ability as it does so.
As I blogged a few years ago, we’ve been here before, starting with the Cyc project. After 26 years, that’s struggled to get to 170,000 facts. The NELL team are proud to have reached 440,000 beliefs, but I suspect the whole project will get sludgy and confused as it tries to turn the richness of language into a set of cut-and-dried belief statements. As I commented elsewhere, the more meanings that you patiently explain to these systems, the less they know. Very quickly they learn “New York” is a “city” which is a geographical and governmental entity, but what the hell do they do with the sentence “I’m in a New York state of mind”?! The context for meaning is vast, it takes years of hard living in the real world to gain meaning from many sentences about the world. (Thus the AI researchers trying to raise a robot like a baby.)
Also, many of the things that NELL has figured out are already well-explained and codified in information sources on the web, such as the mighty Wikipedia. The New York Times article about the project gives the example
Peyton Manning is a football player (category). The Indianapolis Colts is a football team (category). By scanning text patterns, NELL can infer with a high probability that Peyton Manning plays for the Indianapolis Colts.
But just go to Wikipedia and in the source of Peyton Manning’s page you see [[Category:American football quarterbacks]] and in the template, {Infobox NFLactive ... |currentteam=Indianapolis Colts}} ! Almost every fact simple enough for NELL to learn that’s notable is already codified on Wikipedia! Already the DBpedia project takes such semi-structured data from Wikipedia pages and maps it to computer-readable semantic statements using standard vocabularies like rdf:Description, skos:subject, dbprop:currentteam , etc. DBpedia has millions of such bits of information about the millions of things that have Wikipedia pages. Does it know more then NELL’s 440,000 facts? Well, what does it mean to “know” something anyway? What does it mean to mean something? What is what? Huh?
It’s cool that NELL has amassed so many beliefs by reading, but that’s dwarfed by the millions of machine-readable “facts” already out there. NELL knows enough to get confused and require human correction, but that’s a weak kind of intelligence. If a NELL or DBpedia can’t do original research or come up with insights, then is either system better than Googling “What is Peyton Manning’s football team?” and scanning the results for an answer?
When Doug Lenat started Cyc in 1984 it was all about reading a “newspaper”, a quaint set of articles printed on a dead tree. Now it’s all social and webified. You can read NELL’s tweet stream and see it get things right, and wrong:
I think “Steel Mobile Phone” is a #buildingmaterial (http://bit.ly/b4V0HU)
I think “dutchtown high school” is a #sportsteam (http://bit.ly/bCh4YY)
I think “doubletree hotel and waltham” is a #hotel (http://bit.ly/aanr6j
Scanning NELL’s apparent representation:
- The team’s scrunchedtogethernames willfully ignores the Wikipedia friendly naming approch, e.g. Building_material.
- I wonder how NELL will handle naming conflicts. Again, Wikipedia has a fine approach: Steel (band), Steel (comics), Steel (film), etc. NELL needs to learn to disambiguate its names by matching Wikipedia articles, otherwise it’s going to wind up terribly confused between all the Michael Jacksons out there.