Duplicate mp3s in `fma_full`?

After some digging, I'm reasonably confident that there are a fair number of files that have at least one exact duplicate in the `fma_full` zipfile. This came up when I was trouble-shooting some weird behavior, and noticed that the ID3 metadata associated with a track didn't match the CSV file of track metadata, but _did_ match a different row.

Metadata matching is at best a wicked pain, so instead I took at look at which files match based on a hash of the bytestream:

```python
import hashlib, glob, os
from joblib import Parallel, delayed

def hash_one(fname):
    hsh = hashlib.sha384()
    hsh.update(open(fname, 'rb').read())
    return hsh.digest().hex()

pool = Parallel(n_jobs=-2, verbose=20)
dfx = delayed(hash_one)
fnames = glob.glob('fma_full/*/*mp3')
fhashes = pool(dfx(fn) for fn in fnames)  # takes approx 20min w/64 cores :oD

groups = dict()
for fh, fn in zip(fhashes, fnames):
    if fh not in groups:
        groups[fh] = []
    groups[fh].append(os.path.splitext(os.path.basename(fn))[0])
```

This produces 105637 unique file hashes from 106574, with 105042 pointing to a single file.

I've reproduced this twice decompressing the zipfile, so I'm _pretty_ sure it's nothing I did. That said, I also downloaded the dataset a long time ago (last summer, maybe?), and I'm curious if it's been updated at all?

I'm curious what might have caused this, and wonder if the 105k tracks without duplicates map to accurate metadata in the `raw_tracks.csv` file? I haven't had a chance to check the ID3 tag coverage yet, but that should be an easy thing to look into. 

for what it's worth, I also haven't looked at the smaller partitions, so I'm not sure if / how this might affect other uses of the dataset. Will follow up later if / when I learn more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Duplicate mp3s in `fma_full`? #23

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Duplicate mp3s in fma_full? #23

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Duplicate mp3s in `fma_full`? #23