An active learning package for experimental design in chemistry and materials science.
Project description
ActiveSampler: An Active Learning Package for Experimental Design in Chemistry and Materials Science
ActiveSampler is a Python package designed to facilitate active learning workflows specifically tailored for experimental design in chemistry and materials science. By intelligently selecting the most informative data points for labeling, ActiveSampler aims to optimize experiments, reduce costs, and accelerate discovery in these fields.
Features
- Model Training and Prediction: Supports both classification and regression tasks using models like Logistic Regression, Random Forest, and XGBoost.
- Uncertainty Calculation: Computes uncertainty for classification using entropy and for regression using variance.
- Objective Function Evaluation: Allows custom objective functions to guide the selection of samples.
- Diversity and Acquisition: Incorporates diversity measures and acquisition functions to balance exploration and exploitation.
- Grid Sampling and Constraints: Generates sampling grids and applies constraints to ensure valid experimental designs.
- Active Learning Selection: Selects the most informative samples to enhance model performance with customizable weights for objective, uncertainty, and diversity.
Installation
To install ActiveSampler, clone the repository and install the dependencies:
git clone https://github.com/yourusername/active_sampler.git # Replace with your repository URL
cd active_sampler
pip install -r requirements.txt
Usage
Example
This is an example input data to select new data points in a LARP synthesis, full data on examples/example1_LARP/input.csv:
ligand_quantity,ligand_ii_quantity,halogen_alloy_quantity,antisolvent_quantity,structural_response
10.0,300,0,3000,1
5.0,300,0,3000,1
...
Here is the code to sample these new points:
from active_sampler import active_sampling, load_and_preprocess_data
# Define the path to your data file
filepath = 'input.csv'
# Specify target columns and their types
target_columns = ['structural_response']
target_types = {
'structural_response': 'classification',
}
num_classes_dict = {
'structural_response': 3
}
# Define the objective function as a string
obj_fn_str = 'structural_response_class_2'
# Load and preprocess data
X, y_dict = load_and_preprocess_data(
filepath,
target_columns,
target_types,
)
# Start active learning selection
active_sampling(
X,
y_dict,
target_types,
obj_fn_str,
num_classes_dict=num_classes_dict,
num_sampling=25,
alpha=0.25, # Objective weight
beta=0.25, # Uncertainty weight
gamma=0.5, # Diversity weight
sufix='LARP',
)
Input Data Format
The input data should be in CSV format:
ligand_quantity,ligand_ii_quantity,halogen_alloy_quantity,antisolvent_quantity,structural_response
10.0,300,0,3000,1
5.0,300,0,3000,1
...
load_and_preprocess_data Function
The load_and_preprocess_data function loads, cleans, and prepares your data. It handles renaming, missing values, removing rows/columns, and splitting data into features (X) and targets (y_dict). See the examples for detailed usage.
Parameters: filepath, target_columns, target_types, column_mapping (optional), categorical_cols (optional), missing_value_strategy (optional), imputation_values (optional), rows_to_remove (optional), columns_to_remove (optional), regex_columns_to_remove (optional).
active_sampling Function Parameters
X: Feature DataFrame.y_dict: Dictionary mapping target names to their Series.target_types: Dictionary mapping target names to 'classification' or 'regression'.obj_fn_str: String defining the objective function. References:- Classification:
target_class_i(e.g.,'structure_type_class_2'). - Regression:
target(e.g.,'contact_angle'). - Normalized Regression:
norm_target(e.g.,norm_contact_angle).
- Classification:
sufix: Suffix for output files.categorical_cols: List of categorical columns.num_classes_dict: Dictionary mapping classification targets to number of classes.initial_train_size: Initial training set size (orNonefor all data).num_sampling: Number of samples to select.alpha,beta,gamma: Weights for objective, uncertainty, and diversity.user_num_grid_points: Custom grid points per numerical variable (int, 'unique', or dict).variable_constraints: Constraints to filter the sampling grid (list of dicts). Each dict hasconditions,assignments, and optionalmutual_constraint.unc_fn_str: Custom formula for combining uncertainties. References:target_unc,norm_target_unc.diversity_settings: Settings for diversity:neighbor_distance_metric(default: 'euclidean'),same_cluster_penalty(default: 0.5),number_of_clusters(default: 'num_sampling').
Output
The active_sampling function generates a .txt file and a .csv file containing the coordinates of the selected samples, sorted by all columns. See the examples folder for detailed output formats.
Examples
The package includes several examples demonstrating different use cases, located in the examples folder. The structure is as follows:
├── README.md
├── active_sampler
│ ├── __init__.py
│ ├── core.py
│ └── utils.py
├── examples
│ ├── example1_LARP
│ │ ├── example1.py
│ │ ├── input.csv
│ │ ├── selected_samples_LARP.csv
│ │ └── selected_samples_LARP.txt
│ ├── example2_PhobicSurfaces
│ │ ├── example2.py
│ │ ├── input.csv
│ │ ├── selected_samples_PhobicSurfaces.csv
│ │ └── selected_samples_PhobicSurfaces.txt
│ ├── example3_BatteryOptimization
│ │ ├── example3.py
│ │ ├── input.csv
│ │ ├── selected_samples_BatteryOptimization.csv
│ │ └── selected_samples_BatteryOptimization.txt
│ └── example4_ProcessingAndConstraints
│ ├── example4.py
│ ├── input.csv
│ ├── selected_samples_LARP_advanced_features.csv
│ └── selected_samples_LARP_advanced_features.txt
Each example folder contains:
example[N].py: The Python script implementing the active learning workflow.input.csv: The input data used for the example.
Pre-generated output files are provided for each example:
selected_samples_[sufix].csv: The CSV file with the selected samples.selected_samples_[sufix].txt: The text file with the selected samples and run information.
Here's a breakdown of each example:
-
example1_LARP: A basic example focused on optimizing a LARP (Ligand-Assisted Reprecipitation) synthesis. It uses a single classification target (structural_response) to predict the structural outcome of the synthesis. -
example2_PhobicSurfaces: This example deals with predicting the contact angle of surfaces, a regression problem. It also demonstrates the use of categorical features (metal_precursor,surface_coating_material). -
example3_BatteryOptimization: A more complex, multi-output example focused on battery material optimization. It involves multiple regression targets (specific capacity, capacity retention, etc.) and uses custom objective and uncertainty functions to guide the selection process. It also uses categorical features. -
example4_ProcessingAndConstraints: This example showcases advanced features like custom grid points (restricting the sampling space for certain variables), variable constraints (ensuring logical relationships between variables), and more detailed data preprocessing options. It uses a combination of classification and regression targets.
Run them directly (e.g., python example1_LARP/example1.py) after ensuring the active_sampler package is installed and the input.csv files are present.
Contributing
Contributions are welcome! Please submit a Pull Request.
License
This project is licensed under the MIT License.
Contact
For questions or issues, please contact [rogeriog.em@gmail.com].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file active_sampler-0.1.0.tar.gz.
File metadata
- Download URL: active_sampler-0.1.0.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
730a5e62f274c3a53c38c979930b882bdbfb50c7e400f89d5925bbf4d0e0878e
|
|
| MD5 |
a20679531eb278c0d7cc0a3cafd49fb1
|
|
| BLAKE2b-256 |
58af91e81cb178150fad3869d6457da0bc14d508a00a7b96be3582090d9595f2
|
File details
Details for the file active_sampler-0.1.0-py3-none-any.whl.
File metadata
- Download URL: active_sampler-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3e1f0d485bf686d928bccadb32d619cd88353aa67c9637012e03500cbb3d67e
|
|
| MD5 |
65929e9da2f1b74be7a606f7a26be020
|
|
| BLAKE2b-256 |
9d4fbdc34073ab38732380862c25fea7285afc0c642b8cc4bcf22562a4c2f77f
|