This is a dataset of Movies. The original data from the UCI database have been cleaned up and converted from HTML to CSV.
- Syntax errors in the original HTML files have been cleaned up.
- Script for converting HTML files into CSV was added (see file
data/make). - Some additional columns (eg.
Ref:) were omitted from the CSV output as they can not be parsed easily by a computer (the HTML files were intact, you can continue working on this easily). - The following list of files should be “ready”:
actors.csvcasts.csvremakes.csvstudios.csvsynonyms.csvmain.htmlpeople.html
Despite the effort, there are many ways this dataset can be enhanced:
- Characterize what type of films the database contains. Currently, the description only says over 10000 films including many older, odd, and cult films, which is not a very precise definition...
- Clean up one of these files:
awtypes.html,locales.html,quotes.htmlorsayings.html. - Devise a way of making the non-machine-readable columns (usually the last one) machine-readable.
- Make the date columns machine-readable (lot of noise currently).
- Check for errors in the data.