Feature discovery and generation utilities
Project description
LLM Feature Gen
LLM Feature Gen is a Python library for discovering and generating interpretable features from unstructured data with large language models.
It helps you:
- discover human-interpretable features from images, text, tabular data, and video
- turn model outputs into structured JSON artifacts
- generate feature values from raw multimodal inputs for downstream models
- export per-class CSVs that are ready for analysis or modeling
Quickstart
The README quickstart should get you from install to a first output with as little setup as possible:
pip install llm-feature-gen
Create a .env file in your working directory:
OPENAI_API_KEY=your_api_key
OPENAI_MODEL=gpt-4.1-mini
OPENAI_AUDIO_MODEL=whisper-1
python3 - <<'PY'
from pathlib import Path
from llm_feature_gen.discover import discover_features_from_texts
from llm_feature_gen.generate import generate_features_from_texts
samples = {
"demo_discover_texts/sample1.txt": "The dish was rich, spicy, and served in a deep bowl.",
"demo_discover_texts/sample2.txt": "The dessert was light, creamy, and topped with fresh fruit.",
"demo_texts/positive/review1.txt": "The meal was vibrant, aromatic, and beautifully plated.",
"demo_texts/negative/review1.txt": "The service was slow and the food arrived cold.",
}
for file_name, text in samples.items():
path = Path(file_name)
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(text, encoding="utf-8")
discovered = discover_features_from_texts("demo_discover_texts")
csv_paths = generate_features_from_texts(
root_folder="demo_texts",
merge_to_single_csv=True,
)
print(discovered)
print(csv_paths)
PY
This creates outputs/discovered_text_features.json, one CSV per class folder, and outputs/all_feature_values.csv.
If you want the fuller walkthrough, including provider switching and other modalities, see the tutorial notebook. If you are working from a repository checkout and want to use editable installs or the bundled sample folders, see the development setup below.
If you want one polished, citeable example that runs end to end from raw inputs to a downstream classifier, see examples/text_to_tabular_pipeline.py and the accompanying examples/README.md. It defaults to the real configured provider stack and also includes an offline replay mode for reproducible tests.
How It Works
The library supports a two-step workflow:
- Discover features from a dataset and save them as JSON in
outputs/. - Generate feature values for each file or row using the discovered feature schema.
Supported Inputs
Discovery
- Images:
.jpg,.jpeg,.png - Text:
.txt,.md,.pdf,.docx,.html - Tabular:
.csv,.xlsx,.xls,.parquet,.json - Video:
.mp4,.mov,.avi,.mkv
Generation
- Images, text, tabular files, and videos are supported through the same folder-based pipeline.
- Generation expects a root folder with one subfolder per class, for example
images/hotpot/andimages/vase/.
Optional Parser Dependencies
The base install covers the core package, but some formats need extra packages at runtime:
.pdf:pypdf.docx:python-docx.html:beautifulsoup4.xlsx:openpyxl.xls:xlrd.parquet:pyarroworfastparquet
For video audio extraction, you also need the ffmpeg system binary available on your machine.
Project Structure
llm-feature-gen/
├─ src/
│ ├─ llm_feature_gen/
│ │ ├─ __init__.py
│ │ ├─ discover.py
│ │ ├─ generate.py
│ │ ├─ providers/
│ │ │ ├─ local_provider.py
│ │ │ └─ openai_provider.py
│ │ ├─ prompts/
│ │ │ ├─ image_discovery_prompt.txt
│ │ │ ├─ image_generation_prompt.txt
│ │ │ ├─ text_discovery_prompt.txt
│ │ │ └─ text_generation_prompt.txt
│ │ └─ utils/
│ │ ├─ image.py
│ │ ├─ text.py
│ │ └─ video.py
│ └─ tests/
│ ├─ conftest.py
│ ├─ test_discover_more.py
│ ├─ test_discovery.py
│ ├─ test_generation.py
│ ├─ test_providers.py
│ └─ test_utils_and_prompts.py
├─ outputs/
├─ pyproject.toml
├─ tutorial.ipynb
└─ README.md
Installation
Install from PyPI:
pip install llm-feature-gen
Supported Python versions and operating systems are documented in SUPPORT.md.
Development
If you are working in this repository, use an editable install:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
If you need non-core document or tabular formats:
pip install pypdf python-docx beautifulsoup4 openpyxl xlrd pyarrow
Environment Setup
Create a .env file in the directory where you run the library.
OpenAI API
OPENAI_API_KEY=your_api_key
OPENAI_MODEL=your_model_name
OPENAI_AUDIO_MODEL=whisper-1
Azure OpenAI
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_API_VERSION=your_api_version
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_GPT41_DEPLOYMENT_NAME=your_chat_deployment
AZURE_OPENAI_WHISPER_DEPLOYMENT=your_audio_deployment
If AZURE_OPENAI_ENDPOINT is set, the provider automatically uses Azure OpenAI. Otherwise it falls back to the standard OpenAI API.
LocalProvider
LocalProvider supports OpenAI-compatible local servers such as Ollama, vLLM, and LM Studio.
LOCAL_OPENAI_BASE_URL=http://localhost:11434/v1
LOCAL_OPENAI_API_KEY=ollama
LOCAL_MODEL_TEXT=llama3
LOCAL_MODEL_VISION=llava
LOCAL_WHISPER_MODEL_SIZE=base
LOCAL_WHISPER_DEVICE=cpu
Use it by passing an explicit provider instance:
from llm_feature_gen.discover import discover_features_from_texts
from llm_feature_gen.providers.local_provider import LocalProvider
provider = LocalProvider()
result = discover_features_from_texts(
texts_or_file="discover_texts",
provider=provider,
)
For local video transcription, install faster-whisper if you want audio support. Otherwise set use_audio=False in video discovery or generation.
Input Layout and Outputs
Discovery inputs
discover_features_from_textsaccepts a raw string, a list of raw strings, a single file, or a folder of supported text documents.- The other
discover_features_from_*helpers accept a single file, a folder, or a list of raw file paths. - Discovery defaults to
as_set=True, so folder-based discovery compares the full batch together and usually writes one shared feature schema JSON file. - The default discovery outputs are:
outputs/discovered_image_features.jsonoutputs/discovered_text_features.jsonoutputs/discovered_tabular_features.jsonoutputs/discovered_video_features.json
Generation inputs
- Generation expects a root folder with one subfolder per class, such as
images/hotpot/andimages/vase/. - If you do not pass
classes=..., class names are inferred from those subfolder names. - Tabular generation reads one row at a time from
text_columnand can optionally uselabel_columnto override the class written to the CSV.
Generation outputs
- Generation writes one CSV per class to
outputs/. - If
merge_to_single_csv=True, it also writesoutputs/all_feature_values.csv. - Each generated CSV includes
File,Class, one column per discovered feature, andraw_llm_outputso you can inspect the original provider response.
Discovery Examples
Discover Features from Images
from llm_feature_gen.discover import discover_features_from_images
result = discover_features_from_images(
image_paths_or_folder="discover_images",
as_set=True,
)
print(result)
This reads all supported images in discover_images/, sends them as a joint set to the provider, and saves the result to outputs/discovered_image_features.json.
Example output:
{
"proposed_features": [
{
"feature": "has visible handle",
"description": "Some objects include handles, while others do not.",
"possible_values": ["present", "absent"]
},
{
"feature": "color tone",
"description": "Objects vary between metallic, earthy, and bright palettes.",
"possible_values": ["metallic", "earthy", "bright", "dark"]
}
]
}
Discover Features from Text
from llm_feature_gen.discover import discover_features_from_texts
result = discover_features_from_texts(
texts_or_file="discover_texts",
as_set=True,
)
print(result)
This loads all supported text documents in discover_texts/, extracts raw text, and saves the result to outputs/discovered_text_features.json.
If you already have text in memory, you can also pass it directly:
result = discover_features_from_texts(
"The dish was smoky, rich, and served family-style.",
as_set=True,
)
Discover Features from Tabular Data
from llm_feature_gen.discover import discover_features_from_tabular
result = discover_features_from_tabular(
file_or_folder="discover_tabular",
text_column="text",
as_set=True,
)
print(result)
This loads supported tabular files, reads the text column, and saves the result to outputs/discovered_tabular_features.json.
Example output:
{
"proposed_features": [
{
"feature": "overall sentiment",
"description": "Rows differ in whether they express favorable or unfavorable opinions.",
"possible_values": ["positive", "negative", "mixed"]
},
{
"feature": "focus of the review",
"description": "Some rows focus on performance, others on plot, visuals, or general quality.",
"possible_values": ["performance", "plot", "visuals", "general quality"]
}
]
}
Discover Features from Videos
from llm_feature_gen.discover import discover_features_from_videos
result = discover_features_from_videos(
videos_or_folder="discover_videos",
as_set=True,
num_frames=5,
use_audio=True,
random_seed=7,
)
print(result)
This extracts key frames, optionally transcribes audio, and saves the result to outputs/discovered_video_features.json.
When a folder contains more than max_videos_to_sample videos, the helper samples a subset before frame extraction. Pass random_seed if you want that subset to be reproducible. With as_set=False, the return value contains one result per extracted frame after pooling frames across all sampled videos.
Generation Example
After discovery, you can generate feature values for each class folder.
from llm_feature_gen.generate import generate_features_from_images
csv_paths = generate_features_from_images(
root_folder="images",
discovered_features_path="outputs/discovered_image_features.json",
merge_to_single_csv=True,
)
print(csv_paths)
With a folder layout like this:
images/
├─ hotpot/
└─ vase/
the command writes per-class CSVs such as outputs/hotpot_feature_values.csv and outputs/vase_feature_values.csv. If merge_to_single_csv=True, it also creates outputs/all_feature_values.csv.
The same workflow is available for other modalities:
from llm_feature_gen.generate import (
generate_features_from_images,
generate_features_from_tabular,
generate_features_from_texts,
generate_features_from_videos,
)
Running Tests
From the repository root:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest
Useful commands:
pytest -vv
pytest src/tests/test_discovery.py
Tests use fake providers and temporary directories, so they do not require OpenAI or Azure credentials.
Contributing and Documentation
If you want to contribute or need project maintenance details, start here:
- CONTRIBUTING.md for local setup, test workflow, pull request expectations, and issue-reporting guidance
- CHANGELOG.md for user-visible changes and the current release history
- GitHub issue templates under
.github/ISSUE_TEMPLATE/for bug reports and feature requests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_feature_gen-0.1.11.tar.gz.
File metadata
- Download URL: llm_feature_gen-0.1.11.tar.gz
- Upload date:
- Size: 53.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55a530bc38ebc3fcb6fcf1fac722b28ba65e143987f1ddfbed5f7571879f0efb
|
|
| MD5 |
dcd3dd76430dc7541a5cc0bf68ddf193
|
|
| BLAKE2b-256 |
48a167026b2ee499bde811987ca1196c584f0056c377b65761aff5346e02227d
|
File details
Details for the file llm_feature_gen-0.1.11-py3-none-any.whl.
File metadata
- Download URL: llm_feature_gen-0.1.11-py3-none-any.whl
- Upload date:
- Size: 35.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c30ee84852560b36876e968ef1db0b70504af028b0a3d4e42e31833fedee68f
|
|
| MD5 |
dfaf3d306eb57a9112f54b61f7a8967a
|
|
| BLAKE2b-256 |
063662a0fcaa41d2b80a8c264beabf47d6ab0ae959ae1d90626d0556672db2fb
|