Skip to main content

A package for preprocessing data in the Arboretum project.

Project description

README

Overview

Before using this script, please download the metadata from Hugging Face.

arbor_process contains scripts to generate machine learning-ready image-text pairs from the downloaded metadata in four steps:

  1. Processing metadata files to get category and species distribution.
  2. Filtering metadata based on user-defined thresholds and generating shuffled chunks.
  3. Downloading images based on URLs in the metadata.
  4. Generating text labels for the images.

To Use

Installation

Use any of the following methods:

  • Clone the repository Arboretum and navigate to Arbor-preprocess and run the script.
  • Clone the repository Arboretum and navigate to Arbor-preprocess and pip install . to use as a package.
  • pip install arbor-process.

To Test the Installation

Run python example.py. This will process the sample metadata in the data folder.

To Run on the Downloaded Metadata

1. Adjust the config.json File

The config.json file contains the arguments for different classes. Ensure it is updated with the correct paths and parameters before running the script. The detailed description of the arguments can be found below

{
    "metadata_processor_info": {
        "source_folder": "data/Arboretum-Full-samples/",
        "destination_folder": "data/v0/species_count_data",
        "categories": ["Aves", "Arachnida", "Insecta", "Plantae", "Fungi", "Mollusca", "Reptilia"]
    },

    "metadata_filter_and_shuffle_info": {
    "species_count_data": "data/v0/species_count_data/combined_sample_counts_per_species.csv",
    "directory": "data/Arboretum-Full-samples/",
    "rare_threshold": 10,
    "cap_threshold": 12,
    "part_size": 50,
    "rare_dir": "data/v0/rare_cases",
    "cap_filtered_dir_train": "data/v0/tmp_cap_filtered",
    "capped_dir": "data/v0/overthecap_cases",
    "merged_dir": "data/v0/processed_metadata",
    "files_per_chunk": 10,
    "random_seed": 42
    },

      "image_download_info": {
        "processed_metadata_folder" : "data/v0/processed_metadata", 
        "output_folder" :"data/v0/img_txt", 
        "start_index" : 0,
        "end_index" : 2,
        "concurrent_downloads" : 1000
        },

    "img_text_gen_info": {
        "processed_metadata_folder" : "data/v0/processed_metadata", 
        "img_folder" :"data/v0/img_txt",
        "generate_tar" : true
        }

}

2. Running as a Python Script

To run the entire script sequentially, use the provided example.py script. Comment out any step you do not want to run.

from arbor_process import *
import asyncio
import json

# Load configuration
config = load_config('config.json')

# Step 1: Process metadata
params = config.get('metadata_processor_info', {})
mp = MetadataProcessor(**params)
mp.process_all_files()

# Step 2: Generate shuffled chunks of metadata
params = config.get('metadata_filter_and_shuffle_info', {})
gen_shuffled_chunks = GenShuffledChunks(**params)
gen_shuffled_chunks.process_files()

# Step 3: Download images
params = config.get('image_download_info', {})
gi = GetImages(**params)
asyncio.run(gi.download_images())

# Step 4: Generate text pairs and create tar files (optional)
params = config.get('img_text_gen_info', {})
textgen = GenImgTxtPair(**params)
textgen.create_image_text_pairs()

3. Running from the Command Line

Update the config.json file with the path to the file and use the following commands to run each step individually.

# Step 1: Process metadata
python arbor_process/metadata_processor.py --config config.json

# Step 2: Generate shuffled chunks of metadata
python arbor_process/gen_filtered_shuffled_chunks.py --config config.json

# Step 3: Download images
python arbor_process/get_imgs.py --config config.json

# Step 4: Generate text pairs and create tar files (optional)
python arbor_process/gen_img_txt_pair.py --config config.json

Classes and Their Descriptions

1. MetadataProcessor

  • Description: Processes metadata files in parquet format. Filters the metadata based on categories and counts the number of species and categories. Saves the results in CSV files.
  • Inputs:
    • source_folder: The folder containing the parquet files.
    • destination_folder: The folder where the results will be saved.
    • categories: A list of categories to filter the metadata.
  • Outputs:
    • CSV files containing the counts of species and categories.

2. GenShuffledChunks

  • Description: Processes data files by filtering rare cases, capping frequent cases, and shuffling the data into specified parts.
  • Inputs:
    • species_count_data: Path to the species count data file.
    • directory: Path to the directory containing the original parquet files.
    • rare_threshold: Threshold for rare cases (default: 10).
    • cap_threshold: Threshold for frequent cases (default: 1000).
    • part_size: Size of each part after shuffling (default: 500).
    • rare_dir: Directory to save rare cases (default: 'rare_cases').
    • cap_filtered_dir_train: Directory to save capped and filtered cases (default: 'cap_filtered_train').
    • capped_dir: Directory to save capped cases (default: 'capped_cases').
    • merged_dir: Directory to save merged shuffled files (default: 'merged_cases').
    • files_per_chunk: Number of files to merge into a single chunk (default: 5).
    • random_seed: Random seed for shuffling (default: 42).
  • Outputs:
    • Saves rare cases, capped cases, and shuffled parts in specified directories.

3. GetImages

  • Description: Downloads images from URLs stored in parquet files asynchronously.
  • Inputs:
    • input_folder: Path to the folder containing parquet files.
    • output_folder: Path to the folder where images will be saved.
    • start_index: Index of the first parquet file to process (default: 0).
    • end_index: Index of the last parquet file to process (default: None).
    • concurrent_downloads: Number of concurrent downloads (default: 1000).
  • Outputs:
    • Downloads images to output_folder.

4. GenImgTxtPair

  • Description: Generates text labels for downloaded images.
  • Inputs:
    • metadata: Path to the directory containing processed parquet files.
    • img_folder: Path to the directory containing downloaded images in subfolders.
    • output_base_folder: Path to the directory saving the image-text pair data in tar files.
  • Outputs:
    • Generates 10 text labels in .txt and .json format for each image and saves them with each image.
    • Creates tar files from each image-text subfolder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arbor_process-0.1.1.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

arbor_process-0.1.1-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file arbor_process-0.1.1.tar.gz.

File metadata

  • Download URL: arbor_process-0.1.1.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.5

File hashes

Hashes for arbor_process-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f81897eeb776a5f650748ace4de9d191542ee84599f4181942d8756a6eef978c
MD5 bbd5286bf84fcd464cef3547c19d9675
BLAKE2b-256 3fdfafe067430f42d6041fd78c6c8be888b24ec44baf8e66b7abe6fc4fcd98a0

See more details on using hashes here.

File details

Details for the file arbor_process-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for arbor_process-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a47da63eab8e40af4b95fcf9b9018a6947bec379cdf794e48cb8cf1580516f95
MD5 3ba24780586c30879bcc67b4addf1dc9
BLAKE2b-256 832dafa7a23d0011cef6e29a265d6d9e773612894a10e3ed14abb2c18efb5aaa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page