Feature discovery and generation utilities

Project description

LLM_feature_gen

LLM Feature Gen is a Python library for discovering and generating interpretable features from unstructured data using Large Language Models (LLMs).
The library provides high-level utilities for:

Discovering human-interpretable features from sets of images,
Integrating prompts and model outputs into structured JSON representations,
- Generating new feature representations automatically from raw multimodal data, e.g., creating structured tables for downstream models,

Module: `discover`

The discover module focuses on feature discovery — identifying interpretable, discriminative visual or textual properties using an LLM.

Supported Data Types

Images (.jpg, .png)
Text documents (.txt, .pdf, .docx, .md, .html)
Tabular datasets (.csv, .xlsx, .parquet, .json)
Videos (.mp4)

✅ What it does

Given a folder of images and a prompt, the library:

Converts each image into Base64 format,
Sends them to an LLM,
Receives a structured JSON response describing the discovered features,
Automatically saves the output to a JSON file in outputs/.

📂 Project Structure

LLM_feature_gen/
├─ src/
│  └─ LLM_feature_gen/
│     ├─ __init__.py
│     ├─ discover.py                # High-level orchestration for feature discovery
│     ├─ generate.py                # Feature value generation
│     ├─ providers/
│     │   ├─ openai_provider.py     # OpenAI / Azure OpenAI API wrapper
│     │   └─ local_provider.py      # Local LLM wrapper
│     ├─ prompts/
│     │   ├─ image_discovery_prompt.txt
│     │   ├─ text_discovery_prompt.txt
│     │   ├─ image_generation_prompt.txt
│     │   └─ text_generation_prompt.txt
│     ├─ utils/
│     │   ├─ image.py               # Image → base64 conversion
│     │   ├─ video.py               # Video frame and audio extraction
│     │   └─ text.py                # Text extraction (txt, pdf, docx, etc.)
│     └─ tests/
│        └─ test_discover.py
├─ outputs/                         # Automatically generated feature JSONs
├─ pyproject.toml
└─ README.md

⚙️ Installation

Clone or download the repository, then install in editable mode:

pip install -e .

🧪 Running Tests

The project uses pytest. You don’t need external services (no network calls are made during tests), and heavy video tooling is stubbed out.

Quick start from the repository root (Windows PowerShell shown, works similarly on macOS/Linux):

# 1) (Recommended) Create and activate a virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1   # On macOS/Linux: source .venv/bin/activate

# 2) Install the package in editable mode
pip install -e .

# 3) Install test runner
pip install -U pytest

# 4) Run the test suite
python -m pytest -q

Useful commands:

Run a single test file:

python -m pytest -q src\tests\test_discovery.py

Run tests with verbose output:
```
python -m pytest -vv
```

Notes:

Tests create and use temporary directories; they do not modify your repository files.
Video-related utilities are monkeypatched/stubbed in tests, so ffmpeg is not required to run the suite.
Environment variables for Azure OpenAI are not required for tests because a fake provider is used.

🔑 Environment Setup for OpenAI API

Create a .env file in the project root

Example: Discover Features from Images

from LLM_feature_gen.discover import discover_features_from_images
# Folder with your example images
image_folder = "discover_images"

# Run feature discovery
result = discover_features_from_images(
    image_paths_or_folder=image_folder,
    as_set=True,  # analyze all images jointly
)

print(result)

This will:

Read all .jpg/.png images from discover_images/
the default prompt (prompts/image_discovery_prompt.txt)
Send them to your LLM provider
Save the results to outputs/discovered_image_features.json

Example saved JSON:

{
  "proposed_features": [
    {
      "feature": "has visible handle",
      "description": "Some objects include handles, others do not.",
      "possible_values": ["present", "absent"]
    },
    {
      "feature": "color tone",
      "description": "Images vary between metallic and earthy color palettes.",
      "possible_values": ["metallic", "matte", "bright", "dark"]
    }
  ]
}

Example: Discover Features from Texts

from LLM_feature_gen.discover import discover_features_from_texts

# Folder with text documents (txt, pdf, docx, md, html)
text_folder = "discover_texts"

# Run feature discovery
result = discover_features_from_texts(
    texts_or_file=text_folder,
    as_set=True,  # analyze all texts jointly
)

print(result)

This will:

Load all supported text files from discover_texts/,
Extract raw text automatically,
Use the default text discovery prompt,
Send them to your LLM provider,
Save the results to outputs/discovered_text_features.json.

Example saved JSON:

{
  "proposed_features": [
    {
      "feature": "presence_of_personal_experience",
      "description": "Some texts describe personal experiences or reflections, while others are more impersonal or instructional.",
      "possible_values": ["present", "absent"]
    },
    {
      "feature": "level_of_subjectivity",
      "description": "Texts vary in how subjective or opinion-based they are compared to neutral or factual descriptions.",
      "possible_values": ["highly subjective", "moderately subjective", "objective"]
    },
    {
      "feature": "use_of_first_person_perspective",
      "description": "Some texts use first-person pronouns indicating a personal perspective, while others do not.",
      "possible_values": ["first person", "third person or impersonal"]
    },
    {
      "feature": "presence_of_explicit_goal_or_intent",
      "description": "Texts may explicitly state an intended goal, motivation, or purpose behind actions or descriptions.",
      "possible_values": ["goal stated", "goal not stated"]
    }
  ]
}

Example: Discover Features from Tabular Data

from LLM_feature_gen.discover import discover_features_from_tabular

# Folder with tabular files (.csv, .xlsx, .parquet, .json)
tabular_folder = "discover_tabular"

# Run feature discovery
result = discover_features_from_tabular(
    texts_or_file=tabular_folder,
    as_set=True,  # analyze all texts jointly
    text_column="text",   # required: column containing raw text
)

print(result)

This will:

Load all supported tabular files from the folder discover_tabular/
Extract the specified text_column
Apply the standard text discovery prompt
Save the output to outputs/discovered_tabular_features.json.

Example saved JSON:

{
    "proposed_features": [
      {
        "feature": "overall sentiment",
        "description": "The texts differ in expressing positive or negative feelings about the subject, which can separate favorable from unfavorable opinions.",
        "possible_values": [
          "positive",
          "negative"
        ]
      },
      {
        "feature": "focus on emotional impact",
        "description": "Some texts emphasize emotional responses or feelings evoked, distinguishing those that highlight emotional engagement from those that do not.",
        "possible_values": [
          "emotional emphasis",
          "neutral or critical tone"
        ]
      },
      {
        "feature": "mention of specific artistic elements",
        "description": "Certain texts reference particular components like acting, soundtrack, or visuals, which can differentiate detailed critiques from more general statements.",
        "possible_values": [
          "acting",
          "story/plot",
          "soundtrack",
          "visuals",
          "dialogue",
          "character development",
          "none"
        ]
      }
      }

Project details

Release history Release notifications | RSS feed

0.1.12

Apr 19, 2026

0.1.11

Apr 9, 2026

0.1.10

Apr 9, 2026

0.1.9

Apr 9, 2026

0.1.8

Mar 10, 2026

0.1.7

Mar 10, 2026

0.1.6

Mar 1, 2026

0.1.4

Mar 1, 2026

This version

0.1.2

Mar 1, 2026

0.1.1

Feb 5, 2026

0.1.0

Jan 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_feature_gen-0.1.2.tar.gz (26.2 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_feature_gen-0.1.2-py3-none-any.whl (25.4 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file llm_feature_gen-0.1.2.tar.gz.

File metadata

Download URL: llm_feature_gen-0.1.2.tar.gz
Upload date: Mar 1, 2026
Size: 26.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for llm_feature_gen-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`0c651e1c9ac66ba9c16b9874ff8abb35bf05239bfc1b30f4ec85b8366abeb98f`
MD5	`febfff7549e7aacd420e6367a0c5f187`
BLAKE2b-256	`316d5f5a35515e96782bdc63d7d4e46fb4975516a5dfaac2b295079843a18981`

See more details on using hashes here.

File details

Details for the file llm_feature_gen-0.1.2-py3-none-any.whl.

File metadata

Download URL: llm_feature_gen-0.1.2-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 25.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for llm_feature_gen-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e812e99f16b70b53778ce893f6b3cf267e00faefa337d9cb6e1e22d3f340d94`
MD5	`fd7a568c6b7b3263cea9d75b2d39cf4a`
BLAKE2b-256	`08564ce9a724049e18ea329e808a3f3b6a1194791dec9d66e30f1710ccb3fcd2`

See more details on using hashes here.

llm-feature-gen 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LLM_feature_gen

Module: `discover`

✅ What it does

📂 Project Structure

⚙️ Installation

🧪 Running Tests

🔑 Environment Setup for OpenAI API

Example: Discover Features from Images

Example: Discover Features from Texts

Example: Discover Features from Tabular Data

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

llm-feature-gen 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LLM_feature_gen

Module: discover

✅ What it does

📂 Project Structure

⚙️ Installation

🧪 Running Tests

🔑 Environment Setup for OpenAI API

Example: Discover Features from Images

Example: Discover Features from Texts

Example: Discover Features from Tabular Data

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Module: `discover`