build a database from IMDb datasets
The IMDb datasets are only available for personal and non-commercial use. For details refer to the previous link.
Pimdb is available from PyPI and can be installed using:
$ pip install pimdb
To download the current IMDb datsets to the current folder, run:
pimdb download all
(This downloads about 1 GB of data and might take a couple of minutes).
Transferring datasets into tables
To import them in a local SQLite database
pimdb.db located in the current
pimdb transfer all
(This will take a while. On a reasonably modern laptop with a local database you can expect about 2 hours).
The resulting database contains one tables for each dataset. The table names
are PascalCase variants of the dataset name. For example, the date from the
title.basics are stored in the table
TitleBasics. The column names
in the table match the names from the datasets, for example
TitleBasics.primaryTitle. A short description of all the datasets and
columns can be found at the download page for the
Optionally you can specify a different database using the
SQLAlchemy engine configuration.
For simple queries you can also use
pimdb and look at the result as
UTF-8 encoded TSV. For example, here are the details of the top 10 oldest
people alive according to IMDb:
pimdb query "select * from NameBasics where birthYear is not null and deathYear is null order by birthYear limit 10" >oldest_people_alive.tsv
You can also run an SQL statement stored in a file:
pimdb query --file some.sql
Building normalized tables
The tables so far are almost verbatim copies of the IMDb datasets with the exception that possible duplicate rows have been removed. This data model already allows to perform several kinds of queries quite easily and efficiently.
However, the IMDb datasets do not offer a simple way to query N:M relations.
For example, the column
NameBasics.knownForTitles contains a comma separated
list of tconsts like "tt2076794,tt0116514,tt0118577,tt0086491".
To perform such queries efficiently you can build strictly normalized tables derived from the dataset tables by running:
If you did specify a
--database for the
transfer command before, you have to
specify the same value for
build in order to find the source data. These tables
generally use snake_case names for both tables and columns, for example
Querying normalized tables
N:M relations are stored in tables using the naming template
name_to_known_for_title. These relation tables contain only the
numeric ID's to the respective actual data and a numeric column
remember the sort order of the comma separated list in the IMDb dataset column.
For example, here is an SQL query to list the titles Alan Smithee is known for:
select title.primary_title, title.start_year from name_to_known_for_title join name on name.id = name_to_known_for_title.name_id join title on title.id = name_to_known_for_title.title_id where name.primary_name = 'Alan Smithee'
For more information on which tables are available on how they are related read the chapter about the pimdb data model.
Where to go from here
Pimdb's online documentation describes all aspects in further detail. You might find the following chapters of particular interest:
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.