Skip to main content

Python IMage MIning

Project description

PIMMI : Python IMage MIning

PIMMI is a software that performs visual mining in a corpus of images. Its main objective is to find all copies, total or partial, in large volumes of images and to group them together. Our initial goal is to study the reuse of images on social networks (typically, our first use is the propagation of memes on Twitter). However, we believe that its use can be much wider and that it can be easily adapted for other studies. The main features of PIMMI are therefore :

  • ability to process large image corpora, up to several millions files
  • be robust to some modifications of the images, typical of their reuse on social networks (crop, zoom, composition, addition of text, ...)
  • be flexible enough to adapt to different use cases (mainly the nature and volume of the image corpora)

PIMMI is currently only focused on visual mining and therefore does not manage metadata related to images. The latter are specific to each study and are therefore outside our scope. Thus, a study using PIMMI will generally be broken down into several steps:

  1. constitution of a corpus of images (jpg and/or png files) and their metadata
  2. choice of PIMMI parameters according to the criteria of the corpus
  3. indexing the images with PIMMI and obtaining clusters of reused images
  4. exploitation of the clusters by combining them with the descriptive metadata of the images

PIMMI relies on existing technologies and integrates them into a simple data pipeline:

  1. Use well-established local image descriptors (Scale Invariant Feature Transform: SIFT) to represent images as sets of keypoints. Geometric consistency verification is also used. (OpenCV implementation for both).
  2. To adapt to large volumes of images, it relies on a well known vectors indexing library that provides some of the most efficient algorithms implementations (FAISS) to query the database of keypoints.
  3. Similar images are grouped together using standard community detection algorithms on the graph of similarities.

PIMMI is a library developed in Python, which can be used through a command line interface. It is multithreaded. A rudimentary web interface to visualize the results is also provided, but more as an example than for intensive use (Pimmi-ui).

The development of this software is still in progress : we warmly welcome beta-testers, feedback, proposals for new features and even pull requests !

Authors

Installation

Pimmi requires Python 3 to be installed. If Python 3 is not installed on your computer, we recommend installing the distribution provided by Miniconda: https://docs.conda.io/projects/miniconda/en/latest/#quick-command-line-install

We recommend installing Pimmi in a virtual environment. The installation scenarios below provide instructions for installing Pimmi with conda (if you have Miniconda or Anaconda installed), with venv or with pyenv-virtualenv. If you are using another virtual environment management system, simply create a new environment, activate it and run:

pip install pimmi

Install with conda

conda create --name pimmi-env
conda activate pimmi-env
pip install -U pip
pip install pimmi

Install with venv

python3 -m venv /tmp/pimmi-env
source /tmp/pimmi-env/bin/activate
pip install -U pip
pip install pimmi

Install with pyenv-virtualenv

pyenv virtualenv 3.8.0 pimmi-env
pyenv activate pimmi-env
pip install -U pip
pip install pimmi

Demo

# --- Play with the demo dataset 1
# Download the demo dataset, it will be loaded in the folder demo_dataset
# You can choose between small_dataset and dataset1.
# small_dataset contains 10 images and dataset contains 1000 images, it takes 2 minutes to be downloaded.

pimmi download_demo dataset1

# Create a default index structure and fill it with the demo dataset. A directory named my_index will be created,
# it will contain the 2 files of the pimmi index : index.faiss and index.meta
pimmi fill demo_dataset/dataset1 my_index

# Query the same dataset on this index, the results will be stored in
# result_query.csv
pimmi query demo_dataset/dataset1 my_index -o result_query.csv

# Post process the mining results in order to visualize them
pimmi clusters my_index result_query.csv clusters.json

# You can also play with the configuration parameters. First, generate a default configuration file
pimmi create-config my_pimmi_conf.yml

# Then simply use this configuration file to relaunch the mining steps (erasing without prompt the
# previous data)
pimmi fill --erase --force --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1
pimmi query --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1
pimmi clusters --config-path my_pimmi_conf.yml dataset1

Test on the Copydays dataset

Unfortunately, the data files and the dataset explanations are not available anymore, you can get them from web archive with this link for the data files and with this link for dataset explanations.

Download the dataset

Download the 4 following gunzip folders : copydays_crop.tar.gz, copydays_jpeg.tar.gz, copydays_original.tar.gz, copydays_strong.tar.gz. Create a project structure and uncompress all the files downloaded in the same images directory.

images
   └───copydays_crop
   └───original
   └───jpegqual
   └───copydays_strong

Clone the repository

The script to evaluate your results is not included in the command line interface, so you should clone this repository to access it. It is located in scripts/copydays_groundtruth.py

git clone https://github.com/nrv/pimmi.git

Commands to reproduce the results

pimmi --sift-nfeatures 1000 --index-type IVF1024,Flat fill images/ my_index_folder
pimmi --query-sift-knn 1000 --query-dist-ratio-threshold 0.8 --index-type IVF1024,Flat query images my_index_folder -o result_query.csv
pimmi --index-type IVF1024,Flat --algo components clusters my_index_folder result_query.csv -o clusters.csv

#Run the script to create the groundtruth file
python scripts/copydays_groundtruth.py images/ clusters.csv

#Compare the results to the groundtruth
pimmi eval groundtruth.csv --query-column image_status

Results :

cluster precision: 0.98650288140734
cluster recall: 0.7441110375823754
cluster f1: 0.7838840961245362
query average precision: 0.839968152866242

Play with the parameters

You can then play with the different parameters and re-evaluate the results. If you want to loop over several parameters to optimize your settings, you may have a look at scripts/eval_copydays.sh.

Troubleshooting

Error while installing faiss-cpu for macOS > 12

error: command '/usr/local/bin/swig' failed with exit code 1

The installation of pimmi requires the package faiss-cpu. However, on macOS > 12 this package cannot be installed by pip. (https://github.com/facebookresearch/faiss/issues/2868) To fix this issue, please follow these steps:

Install Miniconda : https://docs.conda.io/projects/miniconda/en/latest/#quick-command-line-install

Create and activate a virtual environnement:

conda create --name testenv1
conda activate testenv1

In this virtual environment, install faiss-cpu:

conda install -c pytorch faiss-cpu

And then you should be able to install pimmi:

pip install pimmi

I have another error

Please submit an issue here

Contribute

Pull requests are welcome! Please find below the instructions to install a development version.

Install from source

python3 -m venv /tmp/pimmi-env
source /tmp/pimmi-env/bin/activate
pip install -U pip
git clone git@github.com:nrv/pimmi.git
cd pimmi
pip install -r requirements.txt
pip install -e .

Linting and tests

To lint the code and run the unit tests you can use the following commands:

# Only linter
make lint

# Only unit tests
make test

# Both
make

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pimmi-0.4.0.tar.gz (39.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pimmi-0.4.0-py3-none-any.whl (40.9 kB view details)

Uploaded Python 3

File details

Details for the file pimmi-0.4.0.tar.gz.

File metadata

  • Download URL: pimmi-0.4.0.tar.gz
  • Upload date:
  • Size: 39.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.2

File hashes

Hashes for pimmi-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6a3065fe43798d5e50e57a8763bede288bb4640b3f2e1c25fa6d89d08c6cc7ac
MD5 2c6d9a5b5e2708fbe2dd320dee85e22b
BLAKE2b-256 63ae242019b0a6a4d25a07dd15441ef1259ba09750b503d73c8d25ade5d52794

See more details on using hashes here.

File details

Details for the file pimmi-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: pimmi-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 40.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.2

File hashes

Hashes for pimmi-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 91d7385d4bb4419ff8d00f4177172add345e5af2d0c915fbf4420df7a656bf00
MD5 bbf315d4ec4133612f741c15430b911b
BLAKE2b-256 b539f97300a08cb8e34d0ff449a68be424b65491eb512cf521400faa89efc9a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page