Skip to main content

Tool to download large dataset from a list of url

Project description

Dataset Downloader

Preview

Dataset_downloader allow you to download large dataset from multiple list of url, from image-net for example. You can split the download into 2 folders, one for the training and one for the testing. File are save into their class name, perfect for model training. It looks something like that:

root:.
|
├───test
│   ├───accerola
│   ├───apple
│   └───lemon
├───train
│   ├───accerola
│   ├───apple
│   └───lemon

Installation

Simply install from pip:

pip install dataset_downloader

Config

Create a dataset.json file with the following content:

{
  "outputTrain": "...",
  "outputTest": "...",
  "ratio": ...,
  "classes": {
    "class1": [
      "http://url1",
      "http://url2"
    ],
    "class2": [
      "http://url1",
      "http://url2"
    ],
    "class3": "list_images.txt"
  }
}
  • outputTrain: Output folder of the training images
  • outputTest: Output folder of the testing images
  • ratio: The ratio of training/testing images. 0.8 correspond of 80% of training images.
  • classes: List of classes with their urls. Urls can be a list of url, a file containing a list of urls or an url containing a list of urls

An exemple of file on a windows computer:

  "outputTrain": "D:/dataset/train",
  "outputTest": "D:/dataset/test",
  "ratio": 0.8,
  "classes": {
    "accerola": [
      "http://tiachea.files.wordpress.com/2008/10/acerolas.jpg",
      "http://www.jardimdeflores.com.br/floresefolhas/JPEGS/A56acerola5.JPG",
      "http://farm2.staticflickr.com/1353/4602150961_177e096984_z.jpg",
    ],
    "apple": [
      "http://www.naturalhealth365.com/images/apple.jpg",
      "http://urbanext.illinois.edu/fruit/images/apple1.jpg",
      "https://www.aroma-zone.com/cms//sites/default/files/plante-acerola.jpg"
    ],
    "lemon": "list_images.txt",
    "watermelon": "https://gist.githubusercontent.com/johnrazeur/645787bc08a5aedd82da9573fbfa169a/raw/49cea1ee1438cecef8ac213b20f24e5ae02d4d78/watermelon.txt"
  }

Run

Simple call the dataset_downloader command:

cd yourdirectory
# You must create the dataset.json file before
dataset_downloader

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
dataset_downloader-1.0.0-py3-none-any.whl (5.7 kB) Copy SHA256 hash SHA256 Wheel py3
dataset_downloader-1.0.0.tar.gz (3.9 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page