music: how to contribute scanned lyrics to the web

Lyrics are everywhere on the web, yet I regularly come across popular songs whose lyrics are nowhere to be found. Sometimes I have a CD or LP on the shelf with the missing lyrics printed in it! Time to make a little more of the sum of human knowledge available… Here are my notes on the process.

Where to contribute lyrics?

Many sites let users contribute and update lyrics. Ideally there would be a non-commercial user-supported über repository of lyrics, but if there is I can’t find it. All lyrics sites seem to be ad-supported (I don’t see the ads because I use the uBlock Origin ad-blocker). The worst are the sites which optimize their pages to fake out Google search so they show up high in search results for e.g. “Linx You’re Lying lyrics,” but when you visit them the page’s only content is just “Be the first to contribute the missing lyrics of You’re Lying by Linx! Kthxbye.

LyricWiki? (No)

The obvious contender is lyrics.wikia.com. It uses the same underlying MediaWiki software as Wikipedia, but it’s on the ad-supported Wikia platform that Wikipedia founder Jimmy Wales created. It has a genuine community trying to do a good job. I made many cleanup edits and added a few songs in 2008-2011. The problem with Lyric Wiki is nothing is created for you, you really have to create each page after page with wiki text. So (using the example of adding the lyrics of Max Tundra’s Mastered by Guy at The Exchange album): first you have to add a bit of fiddly markup to the band’s page for the album:

==[[Max Tundra:Mastered By Guy At The Exchange (2002)|Mastered by Guy at The Exchange (2002)]]==
 {{Album Art|Max Tundra - Mastered By Guy at the Exchange.jpg|Mastered by Guy at The Exchange}}
# '''[[Max Tundra:Merman|Merman]]'''
# '''[[Max Tundra:Mbgate|Mbgate]]'''
...

then you have to create the album’s page with more fiddly markup listing each song all over again:

{{AlbumHeader
 |artist    = Max Tundra
 |album     = Mastered by Guy at The Exchange
 |genre     = Electronic
 ...
 }}
# '''[[Max Tundra:Merman|Merman]]'''
# '''[[Max Tundra:Mbgate|Mbgate]]''
...

then you have to create a page for each song that points back to the album with even more fiddly markup, and then you provide the actual value, the lyrics themselves:

{{SongHeader
 |song     = Merman
 |artist   = Max Tundra
 |album1   = Max Tundra:Mastered By Guy At The Exchange (2002)
 |language = English
 |star     = Bronze
 }}
<lyrics> 
I'm feeling flirty
Must be you heard me
My knee is hurty
...

Even if you’re fluent in MediaWiki markup and templates, it is pointless error-prone duplication to keep repeating the artist, album, and track name on every page. Instead, adding a lyric should be a single database action that automatically adds the song to the artist’s page and the album’s page.

So Genius!

Genius came out of annotating rap lyrics. It has a nice interface for adding song lyrics to albums, a solid community, and lets people comment on songs and individual lines. So I went there.

Scanning and converting to text

On my all-in-one printer I scanned the record sleeves and CD booklets with the lyrics at high resolution and saved them as PDFs. Then I used gImageReader-qt5 for Linux to do optical character recognition. This works impressively well! It handled blue on pink text, it automatically identifies each block of text. Then delete the blocks you don’t want recognized, such as image captions and “Thanks to Kev and Fender guitars”. Then trigger OCR and it gives you a big chunk of recognized text.

Case conversion

Some lyrics that I scanned were printed entirely in UPPER CASE. There are many ways to convert case, but the wrinkle is I want the first sentence of each line to remain capitalized; also a bit of smarts about proper names, the word “I”, and such would be nice. I found the web page https://convertcase.net does the right thing in its Sentence case mode; it saved me hacking my own tool. The other nice thing about web-based converters is the textarea with the converted text is in the browser, and Firefox highlights many misspellings due to mis-recognition, such as “allbi” instead of “alibi.”

De-Unicode-ization

Genius wants simple ASCII for lyrics: simple quotation marks, hyphens not em-dashes, no ligatures like fi, etc. Unfortunately gImageReader doesn’t have an option to only output simple ASCII. To find the problematic characters, I used this command line to search for any character that isn’t ASCII.

rg ‘\P{ascii}’ *lyrics.txt

(rg is ripgrep, a better text search program than the venerable grep.)

My IQ goes up, the kudos roll in

To keep us unpaid suckers working, Genius has gamified (horrible word) contributions in the form of “IQ points.” When you add a wanted song, you get points. When you identify the song parts (verse, chorus, bridge, etc.) you get more points. More points give you more rights – I can now add a new song and edit a track list, but I still can’t add an entire new album or state that Peter Martin is commonly known as Sketch.

One of the problems I had with the lyrics for the band Corduroy is Genius already listed other songs by “Corduroy” that are by a Korean singer 코듀로이 (which translates to Corduroy) and a wannabe band that reused the name. Renaming artists is very tricky and way above my IQ level, but the forum participants are very helpful. “I have to say I am really impressed with the research you have done here. I will disambiguate the artists to fix this.” Awww.

(Elsewhere I blogged about the semantic confusion of translated band names matching other bands, names containing other bands, and straight up multiple bands with the same name.)

This entry was posted in music, web. Bookmark the permalink.

2 Responses to music: how to contribute scanned lyrics to the web

  1. Note that ripgrep does not respect your locale settings, so things like `LC_ALL=C` have no impact on ripgrep’s behavior.

    Also, an equivalent but perhaps more declarative way to do this is `rg ‘\P{ascii}’` or even `rg ‘[^\p{ascii}]’`.

    • skierpage says:

      BurntSushi in da house! Thanks for ripgrep. That’ll teach me not to monkeypatch instructions I find on the web without understanding them fully. The two produce the same result, There’s More than One Way to Negate.

Leave a Reply to Andrew Gallant Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.