Skip to main content

build a database from IMDb datasets

Project description

pimdb

Pimdb is a python package and command line utility to maintain a local copy of the essential parts of the Internet Movie Database (IMDb) based in the TSV files available from IMDb datasets.

License

The IMDb datasets are only available for personal and non-commercial use. For details refer to the previous link.

Pimdb is open source and distributed under the BSD license. The source code is available from https://github.com/roskakori/pimdb.

Installation

Pimdb is available from PyPI and can be installed using:

$ pip install pimdb

Quick start

Downloading datasets

To download the current IMDb datasets to the current folder, run:

pimdb download all

(This downloads about 1 GB of data and might take a couple of minutes).

Transferring datasets into tables

To import them in a local SQLite database pimdb.db located in the current folder, run:

pimdb transfer all

This will take several hours, on a MacBook Pro M1 about 11 hours.

The resulting database contains one table for each dataset. The table names are PascalCase variants of the dataset name. For example, the date from the dataset title.basics are stored in the table TitleBasics. The column names in the table match the names from the datasets, for example TitleBasics.primaryTitle. A short description of all the datasets and columns can be found at the download page for the IMDb datasets.

Optionally you can specify a different database using the --database option with an SQLAlchemy engine configuration.

Querying tables

To query the tables, you can use any database tool that supports SQLite, for example the freely available and platform independent community edition of DBeaver or the command line shell for SQLite.

For simple queries you can also use pimdb and look at the result as UTF-8 encoded TSV. For example, here are the details of the top 10 oldest people alive according to IMDb:

pimdb query "select * from NameBasics where birthYear is not null and deathYear is null order by birthYear limit 10" >oldest_people_alive.tsv

You can also run an SQL statement stored in a file:

pimdb query --file some.sql

Building normalized tables

The tables so far are almost verbatim copies of the IMDb datasets with the exception that possible duplicate rows have been removed. This data model already allows to perform several kinds of queries quite easily and efficiently.

However, the IMDb datasets do not offer a simple way to query N:M relations. For example, the column NameBasics.knownForTitles contains a comma separated list of tconsts like "tt2076794,tt0116514,tt0118577,tt0086491".

To perform such queries efficiently you can build strictly normalized tables derived from the dataset tables by running:

pimdb build

If you did specify a --database for the transfer command before, you have to specify the same value for build in order to find the source data. These tables generally use snake_case names for both tables and columns, for example title_allias.is_original.

This will take some time, on a MacBook Pro M1 about 30 minutes.

Querying normalized tables

N:M relations are stored in tables using the naming template some_to_other, for example name_to_known_for_title. These relation tables contain only the numeric ID's to the respective actual data and a numeric column ordering to remember the sort order of the comma separated list in the IMDb dataset column.

For example, here is an SQL query to list the titles Alan Smithee is known for:

select
    title.primary_title,
    title.start_year
from
    name_to_known_for_title
    join name on
        name.id = name_to_known_for_title.name_id
    join title on
        title.id = name_to_known_for_title.title_id
where
    name.primary_name = 'Alan Smithee'

For more information on which tables are available on how they are related read the chapter about the pimdb data model.

Where to go from here

Pimdb's online documentation describes all aspects in further detail. You might find the following chapters of particular interest:

  • Usage: all command line options explained
  • Data model: available tables and example SQL queries
  • Contributing: obtaining the source code and building the project locally

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pimdb-0.3.1.tar.gz (154.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pimdb-0.3.1-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file pimdb-0.3.1.tar.gz.

File metadata

  • Download URL: pimdb-0.3.1.tar.gz
  • Upload date:
  • Size: 154.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for pimdb-0.3.1.tar.gz
Algorithm Hash digest
SHA256 6707eab08cab8f0c9dfee5d8af912e28b10fe5c84fb59ce2254bd297e39874b9
MD5 850f11fc4286b7354fa445e4ca0cfbe6
BLAKE2b-256 a00a7e929901ad38366400653e55166434ce20f7646000007d72269d845fb66b

See more details on using hashes here.

File details

Details for the file pimdb-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: pimdb-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 23.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for pimdb-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 aa9f0a71b4bb5c521539d05c924e77188a3ad22782e1845eb5bdb64a64e5f7ee
MD5 e5bbc1625f11e5a73b0248ec075d4a6e
BLAKE2b-256 5a2438b6c580a0aa5aed2044b75a46cc4d6dbcec6edbf82fbb26b20e12c8ee50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page