music: how to contribute scanned lyrics to the web

Lyrics are everywhere on the web, yet I regularly come across popular songs whose lyrics are nowhere to be found. Sometimes I have an LP on the shelf with the missing lyrics printed in it! Time to make a little more of the sum of human knowledge available… Here are my notes on the process.

Where to contribute lyrics

Many sites let users contribute and update lyrics. Ideally there would be a non-commercial user-supported über repository of lyrics, but if there is I can’t find it. All lyrics sites seem to be ad-supported (I don’t see the ads because I use the uBlock Origin ad-blocker). The worst are the sites which optimize their pages to fake out Google search so they show up high in search results for e.g. “Linx You’re Lying lyrics,” but when you visit them the page content is just “Be the first to contribute the missing lyrics of You’re Lying by Linx! Kthxbye.

LyricWiki? (No)

The obvious contender is lyrics.wikia.com. It uses the same underlying MediaWiki software as Wikipedia, but it’s on the ad-supported wikia platform that. It has a genuine community trying to do a good job. I made many cleanup edits and added a few songs 2008-2011. The problem with Lyric Wiki you really are sticking wiki text on pages. So you have to add a bit of fiddly markup to the band’s page for the album:

==[[Max Tundra:Mastered By Guy At The Exchange (2002)|Mastered by Guy at The Exchange (2002)]]==
{{Album Art|Max Tundra - Mastered By Guy at the Exchange.jpg|Mastered by Guy at The Exchange}}
# '''[[Max Tundra:Merman|Merman]]'''
# '''[[Max Tundra:Mbgate|Mbgate]]'''

then you have to create the album’s page with more fiddly markup listing each song:

{{AlbumHeader
|artist = Max Tundra
|album = Mastered by Guy at The Exchange
|genre = Electronic
|length = 37:28
|cover = Max Tundra - Mastered By Guy at the Exchange.jpg
|wikipedia = Mastered by Guy at The Exchange
|star = Bronze
}}
# '''[[Max Tundra:Merman|Merman]]'''
# '''[[Max Tundra:Mbgate|Mbgate]]''
...

then you have to create a page for each song with even more fiddly markup and the actual lyrics:

{{SongHeader
|song = Merman
|artist = Max Tundra
|album1 = Max Tundra:Mastered By Guy At The Exchange (2002)
|language = English
|star = Bronze
}}
<lyrics>
I'm feeling flirty
Must be you heard me
...

Even if you’re fluent in MediaWiki markup and templates, It’s pointless error-prone duplication to keep repeating the artist, album, and track name in every place. Instead, adding a lyric should be a database action that automatically adds the song to the artist and album.

So Genius!

Genius came out of annotating rap lyrics. It has a nice interface for adding songs, a solid community, and lets people comment on songs and individual lines. So I went there.

Scanning and converting to text

I scanned the sleeves with the lyrics at high resolution and saved them as PDFs. Then I used gImageReader-qt5 for Linux to do optical character recognition. This works impressively well! It handled blue on pink text, it automatically identifies each block of text. Then delete the blocks you don’t want recognized, such as image captions and “Thanks to Kev and Fender guitars”. Then trigger OCR and it gives you a big chunk of text.

Case conversion

Some lyrics that I scanned are entirely in UPPER CASE. There are many ways to convert case, but the wrinkle is I want the first sentence of each line to remain capitalized, also a bit of smarts about proper names, the word “I”, and such would be nice. I found the web page https://convertcase.net does the right thing in its Sentence case mode; it saved me hacking my own tool. The other nice thing about web-based converters is the textarea with the converted text is in the browser, and Firefox highlights many misspellings due to mis-recognition, such as “allbi” instead of “alibi.”

De-Unicode-ization

Genius wants simple ASCII for lyrics: simple quotation marks, hyphens not em-dashes, no ligatures like fi, etc. Unfortunately gImageReader doesn’t have an option to only output simple ASCII. To find the problematic characters, I used this command line to search for any character that isn’t ASCII.

LC_ALL=C rg '[^\x00-\x7f]' *lyrics.txt

(LC_ALL=C turns off any localization that might mess around with output, and rg is ripgrep, a better text search program than the venerable grep.)

My IQ goes up, the kudos roll in

To keep us unpaid suckers working, Genius has gamified (horrible word) contributions in the form of “IQ points.” When you add a wanted song, you get points. When you identify the song parts (verse, chorus, bridge, etc.) you get more points. More points give you more rights – I can add a new song and edit a track list, but I still can’t add an entire new album or set that Peter Martin is aka Sketch.

One of the problems I had with the lyrics for the band Corduroy is Genius already listed other songs by “Corduroy” that are by a Korean singer 코듀로이 (which translates to Corduroy) and a wannabe band that reused the name. Renaming artists is very tricky and way above my IQ level, but the forum participants are very helpful. “I have to say I am really impressed with the research you have done here. I will disambiguate the artists to fix this.” Awww.

This entry was posted in music, web. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.