CLI utility to download and compile IMDb datasets into an SQLite database.
Rackfocus (or rackfocus) is a CLI tool to compile IMDb datasets into a convenient SQLite database for easy consumption.
IMDb provides open access to an assortment of their data for non-commercial use. This data is packaged in TSV files that are available to download without any authentication. See their Datasets page for more information.
It might be super nifty for data visualization projects and other similar use cases, but it’s hard to “peek” into the data, as it often involves joins. For instance, cast, crew and titles are referred to by unique IDs, and looking up their names requires joining datasets with other datasets. Rackfocus makes it easy to surf through the data (assuming you speak SQL) by compiling it all into a compact SQLite file.
Disclaimer: Rackfocus is not owned or maintained by IMDb or Amazon.
All that’s required to run Rackfocus is Python 3 (I’m pretty certain as low as 3.4 should work, possibly lower). It does not depend on any third-party packages.
Installing + Running
You can find Rackfocus on PyPI using pip:
pip install rackfocus
That should get you set up with a rackfocus command that can be invoked from anywhere like so:
rackfocus ./path/to/working/dir ./path/to/output/dir
Rackfocus will use the working directory (first argument) to place datasets temporarily. Working data will be contained in a new directory that Rackfocus creates, which will be destroyed after the compilation is complete.
Rackfocus’s output is a neat little SQLite database in a file named rackfocus_out.db, which is placed under the path specified as the second argument.
To surf through the database generated by Rackfocus, open it using an SQLite command line utility like so:
At the SQLite prompt, view all tables using:
View schema for a table using:
Tip - Schedule It!
During its development, Rackfocus was intended to be scheduled. Say you enjoy data visualization and always want the latest IMDb data ready to dip into. Simply set up the job to run periodically on any computer, perhaps a Raspberry Pi. Use cron or another scheduling mechanism to update data daily or weekly!
The database file that gets generated as output has a reliable, unchanging file name. That way, it gets overwritten when scheduled, without requiring any cleanup.
The output SQLite database file includes tables that more or less mirror the TSV files that IMDb provides, with a few quality-of-life joining and browsing enhancements. These enhancements are mainly some extra unpacking of rows into multiple rows.
Much of the data model described on IMDb’s datasets page applies. Any overrides to this can be deduced from the rackfocus/models.py file in this repo.
There are two key facts about the datasets to get you started:
tconst values represent unique titles. Entities like movies, shorts, video games, TV series, and even individual TV episodes share this namespace.
nconst values represent unique people.
These values also happen to appear in URLs for people and titles on IMDb. For example, Nicolas Cage is nm0000115.
Get all feature films starring Evangeline Lilly with an IMDb rating of 7.0 or higher:
select distinct primaryTitle from name_basics join title_principals on name_basics.nconst=title_principals.nconst join title_basics on title_basics.tconst=title_principals.tconst join title_ratings on title_basics.tconst=title_ratings.tconst where primaryName='Evangeline Lilly' and titleType='movie' and category='actress' and averageRating > 7.0;
Get all feature films directed by Michael Bay:
select distinct primaryTitle from title_basics join title_crew on title_basics.tconst=title_crew.tconst join name_basics on title_crew.nconst=name_basics.nconst where primaryName='Michael Bay' and job='director' and titleType='movie';
Get all comedy feature films with an IMDb rating between 4.0 and 6.0, released in the 21st century:
select distinct primaryTitle from title_basics join title_ratings on title_basics.tconst=title_ratings.tconst where averageRating > 4.0 and averageRating < 6.0 and genres like '%Comedy%' and startYear >= 2000 and titleType='movie';
Improvements are very welcome, and contributing them is easy! Rackfocus is a very straightforward application, with zero package dependencies outside of Python itself. Simply fork and/or clone this repo and you should be good to go.
From the top-level directory of this repo, run:
pip install -e .
That should install your local version of Rackfocus in your environment. Run the same command after making each change and it should refresh the local install to reflect changes.
To restore your system to the original state, run:
pip uninstall rackfocus
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for rackfocus-1.2.0-py3-none-any.whl