Publicly-available Datasets

Below are a number of publicly-available datasets for audio version identification:

The Covers80 Dataset A dataset with low quality audio consisting of 160 songs which are split into two disjoint subsets A and B, each with exactly one version of a pair of songs, for a total of 80 pairs. Mostly '80s and early '90s pop music. This was one of the first publicly available datasets used by researchers
Da-TACOS A dataset with pre-extracted features and metadata for 15,000 songs for a "benchmarking subset" and 10,000 songs for a "cover analysis subset."
CoversBR A database with pre-extracted features (similar to Da-TACOS) for 102,298 songs, distributed into 26,366 groups of covers, of mostly Brazilian music.
Covers1000 A dataset of pre-extracted features of 395 groups of songs, along with a live demo of some alignment algorithms.
Kara1k Karaoke Songs Dataset A dataset with features for 2000 songs: 1000 originals and 1000 corresponding karaoke versions. Also a great dataset for singing voice analysis.
The SecondHandSongs Dataset Another dataset based off of annotations from secondhandsongs.com, which is a subset of the Million Songs Dataset consiting of about 20,000 tracks with EchoNest features.
The YouTubeCovers Dataset A collection chroma, CRP, and CENS features for 350 songs of various genres.
https://secondhandsongs.com/ A community project of annotations of cover songs.