In Search of the Perfect Music Dataset

In this article I review and compare the best freely available music datasets and APIs.

How did I get into this?

Recently I set out on a side project to find all the records that my favourite musicians had played on. It’s common for musicians to play on a record and not get artist credit. Often the only way to know who played on a record is to look at the cover or liner notes of the album. But of course, the physical product has long since disappeared from all but the most ardent and affluent collectors’ hands.

Since we all consume so much music digitally now, our view of who is playing the music we’re listening to has been reduced to this:

Spotify playing Jackie McLean - It’s time

Guess what? It’s Herbie Hancock on piano, Roy Haynes on drums and Cecil McBee on Bass!

Ok, it’s always been hard. The Wrecking Crew were a group of session musicians that often got paid a day rate and received no release credit, yet were responsible for recording and sometimes writing numerous hit recordings ultimately credited to the Beach Boys, Simon and Garfunkel, Cher, The Mamas and The Papas, Frank Sinatra, etc.

I’m looking to find recordings that feature my favourite musicians, so I need the data that includes who was playing on the record. E.g. who played bass on Roy Ayers “Everybody loves the sunshine”, who played guitar on Sonny Rollins' “The Bridge”? I want to be able to find more of these people’s work, easily.

The Wrecking Crew

The Wrecking Crew at Gold Star Studios

I’m most interested in finding this credit meta-data for jazz artists, but if you consider classical music the plot thickens. Take for example a recording of Beethoven’s 9th Symphony, you will have the composer (Beethoven), the orchestra (e.g. Berlin Philharmonic), conductor (e.g. Herbert von Karajan) and soloists (Anna Tomowa-Sintow, Agnes Baltsa, Peter Schreier and José van Dam). This all gets reduced into a single “Artist” field if you’re looking on Spotify. Then of course there are 3 movements, which tend to be split into tracks. Oh and the movement names might be in German or English. Somebody really needs to improve the UX of searching for classical music.

Data Sources

I set out to find the most comprehensive music databases and review them for the quality and quantity of discography and credits data, here’s what I found:

Wikipedia

Echonest / Spotify

Allmusic

Musicbrainz

Discogs

Others

There are many other music databases that are worth a mention. I haven’t taken an in-depth look at these but it appears that they are less suitable for my purposes than the databases reviewed above.

In Summary

Discogs is the best for our purposes of obtaining detailed credit meta-data.

I’ve opted to import the Discogs XML dumps into a Postgres database, where I can link together the tables that I’m interested in with a hideous SQL query. I then push the tracks, together with their credit meta-data into an Elasticsearch index, allowing for much faster full-text search.

If you’re interested in building data driven applications / data pipelines / visualisations for your organisation then get in touch. It’s what we do for a living, and for fun!