Skip to main content

An application that scrapes data from IMDB and adjusts rating based on some rulesets.

Project description

IMDB rating classifier

This is a simple IMDB rating classifier application that panalizes reviews in accordance with some rulesets.

Overview

The application scrapes data from IMDB and adjusts the rating system according to some specific validation rules.

The data is scraped from the IMDB charts API using the BeautifulSoup library.

The data structure of the parsed payload is as follows (example):

{
  "rank": "1",
  "title": "The Shawshank Redemption",
  "year": "1994",
  "rating": "9.2",
  "votes": "2,223,000",
  "url": "/title/tt0111161/",
  "poster_url": "https://m.media-amazon.com/images/wNjQ5NjU3MjE@._V1_SX300.jpg",
  "penalized": false
}

We would then, extract the following fields, into a dataframe:

- rank (int)
- title (str)
- year (int)
- rating (float)
- votes (int)
- url (str)
- poster_url (str)
- penalized (bool)

Using dataclasses, we can then, preprocess the data against some schema definition rules.

The schema definition rules are as follows:

schema = {
    "rank": {
        "type": "int",
        "min": 1,
        "max": 250,
        "required": True,
    },
    "title": {
        "type": "str",
        "required": True,
    },
    "year": {
        "type": "int",
        "min": 1900,
        "max": 2023,
        "required": True,
    },
    "rating": {
        "type": "float",
        "min": 0.0,
        "max": 10.0,
        "required": True,
    },
    "votes": {
        "type": "int",
        "min": 0,
        "required": True,
    },
    "url": {
        "type": "str",
        "required": True,
    },
    "poster_url": {
        "type": "str",
        "required": True,
    },
    "penalized": {
        "type": "bool",
        "required": True,
    },
}

Requirements

  • Python>=3.8>=3.10
  • BeautifulSoup4
  • requests
  • pytest
  • tox
  • click
  • pre-commit
  • flake8
  • black
  • isort

and more...

Installation

For development purposes:

  • Clone the repository

    foo@bar:~$ git clone git@github.com/marouenes/imdb-rating-classifier.git
    
  • Create a virtual environment

    foo@bar:~/imdb-rating-classifier$ virtualenv .venv
    
  • Activate the virtual environment

    foo@bar:~/imdb-rating-classifier$ source .venv/bin/activate
    
  • Install the dev dependencies

    foo@bar:~/imdb-rating-classifier$ pip install -r requirements-dev.txt
    
  • Install the pre-commit hooks

    foo@bar:~/imdb-rating-classifier$ pre-commit install
    

For usage:

  • Install the dependencies and build the wheel

    foo@bar:~/imdb-rating-classifier$ pip install -e .
    

The application is publicly available and published on PyPI and can be installed using pip:

foo@bar:~$ pip install imdb-rating-classifier

Usage

  • Display the help message and the available commands
foo@bar:~$ imdb-rating-classifier generate --help
Usage: imdb-rating-classifier generate [OPTIONS]

  Generate the output dataset containing both the original and adjusted
  ratings.

  An extra JSON file will be generated alongside the csv file

Options:
  --output FILE               The path to the output file.
  --number-of-movies INTEGER  The number of movies to scrape.
  -h, --help                  Show this message and exit.
  • Run the application with the default number of movies (20) and the default output file (data.csv)
imdb-rating-classifier generate
  • Run the application with a specific number of movies
imdb-rating-classifier generate --number-of-movies 100
  • Run the application with a specific number of movies and a specific output file
imdb-rating-classifier generate --number-of-movies 100 --output some_name.csv

Testing

  • Run tests and pre-commit hooks
foo@bar:~/imdb-rating-classifier$ tox

CI/CD

The application is automatically packaged and distributed to PyPI, It is also automatically tested using tox as an environment orchestrator and GitHub Actions.

TODO

  • Add more tests
  • Add more validation rules
  • Add more documentation
  • Add more features
  • Publish the package on PyPI
  • Add oscar awards or nominations for the movies
  • Add a version switch for the cli

License

MIT License

Author

Marouane Skandaji

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imdb_rating_classifier-0.1.4.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

imdb_rating_classifier-0.1.4-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file imdb_rating_classifier-0.1.4.tar.gz.

File metadata

File hashes

Hashes for imdb_rating_classifier-0.1.4.tar.gz
Algorithm Hash digest
SHA256 837e46636d4beb404076a7f8bfa5d48eb5002f9ea8ff08cc049b92495a6e9c6f
MD5 ac89bbbcdc934a4d235936da00cbc262
BLAKE2b-256 705233f556910c85750238c2ecc1e7bf4ab7920fbca73ece0e071e0ba06152f7

See more details on using hashes here.

File details

Details for the file imdb_rating_classifier-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for imdb_rating_classifier-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 15c18196c65b52f55e083eeb0e041c709007c5ab3fa89f986c501a7db680ac63
MD5 e33da47f9fa188e8cfb7ecf61edf0af2
BLAKE2b-256 709dc11d493d99806af7178b5e0bb736eb6e1a6d71587d4b2d66f394482b3ffc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page