A package for preprocessing data in the Arboretum project.
Project description
README
Overview
Before using this script, please download the metadata from Hugging Face.
arbor_process
contains scripts to generate machine learning-ready image-text pairs from the downloaded metadata in four steps:
- Processing metadata files to get category and species distribution.
- Filtering metadata based on user-defined thresholds and generating shuffled chunks.
- Downloading images based on URLs in the metadata.
- Generating text labels for the images.
To Use
Installation
Use any of the following methods:
- Clone the repository Arboretum and navigate to Arbor-preprocess and run the script.
- Clone the repository Arboretum and navigate to Arbor-preprocess and
pip install .
to use as a package. pip install arbor-process
.
To Test the Installation
Run python example.py
. This will process the sample metadata in the data folder.
To Run on the Downloaded Metadata
1. Adjust the config.json
File
The config.json
file contains the arguments for different classes. Ensure it is updated with the correct paths and parameters before running the script. The detailed description of the arguments can be found below
{
"metadata_processor_info": {
"source_folder": "data/Arboretum-Full-samples/",
"destination_folder": "data/v0/species_count_data",
"categories": ["Aves", "Arachnida", "Insecta", "Plantae", "Fungi", "Mollusca", "Reptilia"]
},
"metadata_filter_and_shuffle_info": {
"species_count_data": "data/v0/species_count_data/combined_sample_counts_per_species.csv",
"directory": "data/Arboretum-Full-samples/",
"rare_threshold": 10,
"cap_threshold": 12,
"part_size": 50,
"rare_dir": "data/v0/rare_cases",
"cap_filtered_dir_train": "data/v0/tmp_cap_filtered",
"capped_dir": "data/v0/overthecap_cases",
"merged_dir": "data/v0/processed_metadata",
"files_per_chunk": 10,
"random_seed": 42
},
"image_download_info": {
"processed_metadata_folder" : "data/v0/processed_metadata",
"output_folder" :"data/v0/img_txt",
"start_index" : 0,
"end_index" : 2,
"concurrent_downloads" : 1000
},
"img_text_gen_info": {
"processed_metadata_folder" : "data/v0/processed_metadata",
"img_folder" :"data/v0/img_txt",
"generate_tar" : true
}
}
2. Running as a Python Script
To run the entire script sequentially, use the provided example.py
script. Comment out any step you do not want to run.
from arbor_process import *
import asyncio
import json
# Load configuration
config = load_config('config.json')
# Step 1: Process metadata
params = config.get('metadata_processor_info', {})
mp = MetadataProcessor(**params)
mp.process_all_files()
# Step 2: Generate shuffled chunks of metadata
params = config.get('metadata_filter_and_shuffle_info', {})
gen_shuffled_chunks = GenShuffledChunks(**params)
gen_shuffled_chunks.process_files()
# Step 3: Download images
params = config.get('image_download_info', {})
gi = GetImages(**params)
asyncio.run(gi.download_images())
# Step 4: Generate text pairs and create tar files (optional)
params = config.get('img_text_gen_info', {})
textgen = GenImgTxtPair(**params)
textgen.create_image_text_pairs()
3. Running from the Command Line
Update the config.json
file with the path to the file and use the following commands to run each step individually.
# Step 1: Process metadata
python arbor_process/metadata_processor.py --config config.json
# Step 2: Generate shuffled chunks of metadata
python arbor_process/gen_filtered_shuffled_chunks.py --config config.json
# Step 3: Download images
python arbor_process/get_imgs.py --config config.json
# Step 4: Generate text pairs and create tar files (optional)
python arbor_process/gen_img_txt_pair.py --config config.json
Classes and Their Descriptions
1. MetadataProcessor
- Description: Processes metadata files in parquet format. Filters the metadata based on categories and counts the number of species and categories. Saves the results in CSV files.
- Inputs:
source_folder
: The folder containing the parquet files.destination_folder
: The folder where the results will be saved.categories
: A list of categories to filter the metadata.
- Outputs:
- CSV files containing the counts of species and categories.
2. GenShuffledChunks
- Description: Processes data files by filtering rare cases, capping frequent cases, and shuffling the data into specified parts.
- Inputs:
species_count_data
: Path to the species count data file.directory
: Path to the directory containing the original parquet files.rare_threshold
: Threshold for rare cases (default: 10).cap_threshold
: Threshold for frequent cases (default: 1000).part_size
: Size of each part after shuffling (default: 500).rare_dir
: Directory to save rare cases (default: 'rare_cases').cap_filtered_dir_train
: Directory to save capped and filtered cases (default: 'cap_filtered_train').capped_dir
: Directory to save capped cases (default: 'capped_cases').merged_dir
: Directory to save merged shuffled files (default: 'merged_cases').files_per_chunk
: Number of files to merge into a single chunk (default: 5).random_seed
: Random seed for shuffling (default: 42).
- Outputs:
- Saves rare cases, capped cases, and shuffled parts in specified directories.
3. GetImages
- Description: Downloads images from URLs stored in parquet files asynchronously.
- Inputs:
input_folder
: Path to the folder containing parquet files.output_folder
: Path to the folder where images will be saved.start_index
: Index of the first parquet file to process (default: 0).end_index
: Index of the last parquet file to process (default: None).concurrent_downloads
: Number of concurrent downloads (default: 1000).
- Outputs:
- Downloads images to
output_folder
.
- Downloads images to
4. GenImgTxtPair
- Description: Generates text labels for downloaded images.
- Inputs:
metadata
: Path to the directory containing processed parquet files.img_folder
: Path to the directory containing downloaded images in subfolders.output_base_folder
: Path to the directory saving the image-text pair data in tar files.
- Outputs:
- Generates 10 text labels in .txt and .json format for each image and saves them with each image.
- Creates tar files from each image-text subfolder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file arbor_process-0.1.1.tar.gz
.
File metadata
- Download URL: arbor_process-0.1.1.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f81897eeb776a5f650748ace4de9d191542ee84599f4181942d8756a6eef978c |
|
MD5 | bbd5286bf84fcd464cef3547c19d9675 |
|
BLAKE2b-256 | 3fdfafe067430f42d6041fd78c6c8be888b24ec44baf8e66b7abe6fc4fcd98a0 |
File details
Details for the file arbor_process-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: arbor_process-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a47da63eab8e40af4b95fcf9b9018a6947bec379cdf794e48cb8cf1580516f95 |
|
MD5 | 3ba24780586c30879bcc67b4addf1dc9 |
|
BLAKE2b-256 | 832dafa7a23d0011cef6e29a265d6d9e773612894a10e3ed14abb2c18efb5aaa |