music/web: Sundays spent disambiguating

The Google Now screen on my phone does a good job presenting news relevant to me. It struck gold when it displayed “new album out now” by The Sundays, zOMG!! After 22 years, out of nowhere they deliver a new album around Harriet Wheeler’s astounding voice and David Gavurin’s chiming guitar work!

Oh noes, it’s actually an unrelated Japanese band.

SUNDAYS in 2016
These fine folk are SUNDAYS(サンデイズ)
press pic of The Sundays
but I don’t think they’re The Sundays

2nd example: Google Play Music, YouTube, and Genius show Korean songs by 코듀로이 (“corduroy”) as by the English acid jazz group Corduroy of whom I’m a big fan. Yes her name translates as “corduroy,” but she’s a different artist!

3rd example: GPM found a sweet cover of Burt Bacharach’s light adult pop song “Knowing When to Leave,” made in 1998 by Casino, which Google and English Wikipedia agree is a rock/alternative band from Birmingham. Crazy genre-defying work? I finally figured out that the British Casino didn’t even form until 2003 and the song is actually by an obscure Icelandic band also called “Casino” together with Páll Óskar Hjálmtýsson. It’s part of an entire album of sincere/camp/tongue-in-cheek recordings of late 1960s/early 1970s hip music that is not just in stereo, it’s called “Stereo.”

The band “Casino” but not THE band “Casino”

4th example, then I’ll stop: Google Now then alerted me to the new album by progressive rock masters Yes, named “Chet.” Well, my hero Steve Howe is a fan of Chet Atkins, so it’s possible…

 alt=

Nope, it’s obviously a different band. Come on, punctuation matters! Just because the band name has a comma in it is no excuse to get it wrong. I’m going to release music by “The Bea[Unicode ZERO-WIDTH NO-BREAK SPACE]tles” to see how many people I can scam 😉

Spotify is also confused about who made this:

search results for 'Yes Plis Chet'...
Comma? Ampersand? Confusion!

And though Amazon seems to know it’s by “Yes & Plis,” if you ask for more about the band you can tell Amazon is commingling the songs like a shelf of widgets in its warehouse ostensibly sold by different companies

Amazon Unlimited when you click the band name for 'Chet'...
All the classic rock plus one sore thumb

Get a Q!

I believe Google Play Music, YouTube, and these other services rely on what the music labels provide, and/or then just do a string search. But this doesn’t work when band names are translated into English, or have weird punctuation, or the band name contains another group’s name, or the band lazily/intentionally reuses an existing band name, …

If only there was a vendor-neutral way to identify and disambiguate entities in the world. Of course there is, Wikidata! “The Sundays” are the entity Q3122789 in Wikidata that is an instance of a band, and then some person or bot added another entity Q17231144 that is also an instance of a band, also labeled “The Sundays” (until I edited it, see below). Same name, two different things.

So these bands can be distinguished, but actually doing it is a hard problem as long as humans are in the loop: I don’t see Japanese press release writers writing “FOR IMMEDIATE RELEASE: The Sundays (Q17231144) release new album!” so that Google can disambiguate, nor will the people who translate that press release into English (whence I assume Google got excited on my behalf) add a note “not the Q3122789 English band.” 🙂 Moreover, as I’ve written before, I’m convinced Google doesn’t actually want a semantic web where web pages tell computers what they mean; it wants a messy confused bunch of pages so that it can apply massive AI to this kind of disambiguation, so that only Google can provide good context-specific answers to questions like “What’s the last album from the Sundays”? (Also, the moment you make it easier for pages to say what they’re about, immediately a bunch of boner and diet pill pages will semantically identify themselves as “Latest news about Kardashian family” or whatever is a popular search term.) But then it’s frustrating to see Google itself get it wrong.

What’s also frustrating is others fixed the Genius lyrics site to distinguish “Corduroy“, “Corduroy. (band)” [note the period, sic(k)!], and “Corduroy (Korea)“, but the same cleanup has to be repeated on every data-driven web site. Q numbers to rule them all!

Cleaning up Wikidata

Like Wikipedia, anyone can edit Wikidata information. The Japanese Wikipedia article that seems to have generated the duplicate “The Sundays” entity in Wikidata is actually titled SUNDAYS, so I changed the English label of Q17231144 to “SUNDAYS” and added the English description “Japanese rock band”; I also added the English description “1990s English alternative rock band” to Q3122789 to help avoid further errors.

What’s odd is the Japanese band’s Wikidata page includes a bunch of identifiers for the English band in other online databases: the VIAF identifier 126826622, the Bibliothèque nationale de France identifier 13926837j, the International Standard Name Identifier identifier 0000 0001 1087 4877, the Library of Congress authority ID n91122952, etc. All of the data in these other databases seems to apply to the English band, but all were missing from the English band’s Wikidata page. I suspect some automated bot found the Japanese “The Sundays,” incorrectly linked it to the VIAF identifier for the English “The Sundays,” and that in turn prompted other bots to add all those other identifiers to the wrong band. It seems poor design that an entity that obviously conflicts with another “band named ‘The Sundays'” entity gets all these automated identifiers for the other thing added to it.

The Icelandic group Casino doesn’t seem to have a Wikidata page… meanwhile Wikidata already has two “Casino” instances of a band, the well-known 2000s Casino from English Wikipedia and another from Dutch Wikipedia that describes a one-off British band. As with the two “Sundays”, they have overlapping external identifiers, in fact some bot mistakenly linked both of them to the Billboard artist page for an unrelated rapper who calls himself “Casino.” And on Google Play Music, the artist “Casino” identifies as the 2000s English “rock/alternative band,” but most of the songs and tracks are clearly by black rapper(s) who adopted the moniker “Ca$ino” without caring about existing European bands.

This entry was posted in music, semantic web. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.