Skip to main content

CLI utility to download and compile IMDb datasets into an SQLite database.

Project description

Rackfocus (or rackfocus) is a CLI tool to compile IMDb datasets into a convenient SQLite database for easy consumption.

IMDb provides open access to an assortment of their data for non-commercial use. This data is packaged in TSV files that are available to download without any authentication. See their Datasets page for more information.

It might be super nifty for data visualization projects and other similar use cases, but it’s hard to “peek” into the data, as it often involves joins. For instance, cast, crew and titles are referred to by unique IDs, and looking up their names requires joining datasets with other datasets. Rackfocus makes it easy to surf through the data (assuming you speak SQL) by compiling it all into a compact SQLite file.

Disclaimer: Rackfocus is not owned or maintained by IMDb or Amazon.

Usage

Requirements

All that’s required to run Rackfocus is Python 3 (I’m pretty certain as low as 3.4 should work, possibly lower). It does not depend on any third-party packages.

Installing + Running

You can find Rackfocus on PyPI using pip:

pip install rackfocus

That should get you set up with a rackfocus command that can be invoked from anywhere like so:

rackfocus ./path/to/working/dir ./path/to/output/dir

Rackfocus will use the working directory (first argument) to place datasets temporarily. Working data will be contained in a new directory that Rackfocus creates, which will be destroyed after the compilation is complete.

Rackfocus’s output is a neat little SQLite database in a file named rackfocus_out.db, which is placed under the path specified as the second argument.

To surf through the database generated by Rackfocus, open it using an SQLite command line utility like so:

sqlite3 rackfocus_out.db

At the SQLite prompt, view all tables using:

.tables

View schema for a table using:

.schema title_akas

Tip - Schedule It!

During its development, Rackfocus was intended to be scheduled. Say you enjoy data visualization and always want the latest IMDb data ready to dip into. Simply set up the job to run periodically on any computer, perhaps a Raspberry Pi. Use cron or another scheduling mechanism to update data daily or weekly!

The database file that gets generated as output has a reliable, unchanging file name. That way, it gets overwritten when scheduled, without requiring any cleanup.

Data Model

The output SQLite database file includes tables that more or less mirror the TSV files that IMDb provides, with a few quality-of-life joining and browsing enhancements. These enhancements are mainly some extra unpacking of rows into multiple rows.

Much of the data model described on IMDb’s datasets page applies. Any overrides to this can be deduced from the rackfocus/models.py file in this repo.

There are two key facts about the datasets to get you started:

  1. tconst values represent unique titles. Entities like movies, shorts, video games, TV series, and even individual TV episodes share this namespace.

  2. nconst values represent unique people.

These values also happen to appear in URLs for people and titles on IMDb. For example, Nicolas Cage is nm0000115.

Examples

Get all feature films starring Evangeline Lilly with an IMDb rating of 7.0 or higher:

select
    distinct primaryTitle
from
    name_basics
    join title_principals on name_basics.nconst=title_principals.nconst
    join title_basics on title_basics.tconst=title_principals.tconst
    join title_ratings on title_basics.tconst=title_ratings.tconst
where
    primaryName='Evangeline Lilly'
    and titleType='movie'
    and category='actress'
    and averageRating > 7.0;

Get all feature films directed by Michael Bay:

select
    distinct primaryTitle
from
    title_basics
    join title_crew on title_basics.tconst=title_crew.tconst
    join name_basics on title_crew.nconst=name_basics.nconst
where
    primaryName='Michael Bay'
    and job='director'
    and titleType='movie';

Get all comedy feature films with an IMDb rating between 4.0 and 6.0, released in the 21st century:

select
    distinct primaryTitle
from
    title_basics
    join title_ratings on title_basics.tconst=title_ratings.tconst
where
    averageRating > 4.0
    and averageRating < 6.0
    and genres like '%Comedy%'
    and startYear >= 2000
    and titleType='movie';

Contributing

Improvements are very welcome, and contributing them is easy! Rackfocus is a very straightforward application, with zero package dependencies outside of Python itself. Simply fork and/or clone this repo and you should be good to go.

From the top-level directory of this repo, run:

pip install -e .

That should install your local version of Rackfocus in your environment. Run the same command after making each change and it should refresh the local install to reflect changes.

To restore your system to the original state, run:

pip uninstall rackfocus

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rackfocus-1.2.0.tar.gz (10.2 kB view hashes)

Uploaded Source

Built Distribution

rackfocus-1.2.0-py3-none-any.whl (9.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page