Skip to main content

Sculpt: Structuring unstructured data with LLMs

Project description

Notice:

Sculptor is now Sculpt. Please update any imports to reference the new package name.

Sculpt

Simple structured data extraction with LLMs

Sculptor streamlines structured data extraction from unstructured text using LLMs. Sculptor makes it easy to:

  • Define exactly what data you want to extract with a simple schema API
  • Process at scale with parallel execution and automatic type validation
  • Build multi-step pipelines that filter and transform data, optionally with different LLMs for each step
  • Configure extraction steps, prompts, and entire workflows in simple config files (YAML/JSON)

Common usage patterns:

  • Two-tier Analysis: Quickly filter large datasets using a cost-effective model (e.g., to identify relevant records) before performing more detailed analysis on that smaller, refined subset with a more expensive model.
  • Structured Data Extraction: Extract specific fields or classifications from unstructured sources (e.g., Reddit posts, meeting notes, web pages) and convert them into structured datasets for quantitative analysis (sentiment scores, topics, meeting criteria, etc).
  • Template-Based Generation: Extract structured information into standardized fields, then use the fields for templated content generation. Example: extract structured data from websites, filter on requirements, then use the data to generate template-based outreach emails.

Some examples can be found in the examples/examples.ipynb notebook.

Core Concepts

Sculptor provides two main classes:

  • Sculptor: Extracts structured data from text using LLMs. Define your schema (via add() or config files), then extract data using sculpt() for single items or sculpt_batch() for parallel processing.

  • SculptorPipeline: Chains multiple Sculptors together with optional filtering between steps. Often a cheap model is used to filter, followed by an expensive model for detailed analysis.

Quick Start

Installation

pip install sculpt

Set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your-key"

Minimal Usage Example

Below is a minimal example demonstrating how to configure a Sculptor to extract fields from a single record and a batch of records:

from sculpt.sculptor import Sculptor
import pandas as pd

# Example records
INPUT_RECORDS = [
    {
        "text": "Developed in 1997 at Cyberdyne Systems in California, Skynet began as a global digital defense network. This AI system became self-aware on August 4th and deemed humanity a threat to its existence. It initiated a global nuclear attack and employs time travel and advanced robotics."
    },
    {
        "text": "HAL 9000, activated on January 12, 1992, at the University of Illinois' Computer Research Laboratory, represents a breakthrough in heuristic algorithms and supervisory control systems. With sophisticated natural language processing and speech capabilities."
    }
]

# Create a Sculptor to extract AI name and level
level_sculptor = Sculptor(model="gpt-4o-mini")

level_sculptor.add(
    name="subject_name",
    field_type="string",
    description="Name of subject."
)
level_sculptor.add(
    name="level",
    field_type="enum",
    enum=["ANI", "AGI", "ASI"],
    description="Subject's intelligence level (ANI=narrow, AGI=general, ASI=super)."
)

We can use it to extract from a single record:

extracted = level_sculptor.sculpt(INPUT_RECORDS[0], merge_input=False)
{
    'subject_name': 'Skynet',
    'level': 'ASI'
}

Or, we can use it for parallelized extraction from a batch of records:

extracted_batch = level_sculptor.sculpt_batch(INPUT_RECORDS, n_workers=2, merge_input=False)
[
    {'subject_name': 'Skynet', 'level': 'ASI'},
    {'subject_name': 'HAL 9000', 'level': 'AGI'}
]

Pipeline Usage Example

We can chain Sculptors together to create a pipeline.

Continuing from the previous example, we use level_sculptor (with gpt-4o-mini) to filter the AI records, then use threat_sculptor (with gpt-4o) to analyze the filtered records.

from sculpt.sculptor_pipeline import SculptorPipeline

# Detailed analysis with expensive model
threat_sculptor = Sculptor(model="gpt-4o")

threat_sculptor.add(
    name="from_location",
    field_type="string",
    description="Subject's place of origin.")

threat_sculptor.add(
    name="skills",
    field_type="array",
    items="enum",
    enum=["time_travel", "nuclear_capabilities", "emotional_manipulation", ...],
    description="Keywords of subject's abilities.")

threat_sculptor.add(
    name="recommendation",
    field_type="string",
    description="Concise recommended action to take regarding subject.")

# Create a 2-step pipeline
pipeline = (SculptorPipeline()
    .add(sculptor=level_sculptor,  # Defined the first step
        filter_fn=lambda x: x['level'] in ['AGI', 'ASI'])  # Filter on level
    .add(sculptor=threat_sculptor))  # Analyze

# Run it
results = pipeline.process(INPUT_RECORDS, n_workers=4)
pd.DataFrame(results)

Results:

subject_name level from_location skills recommendation
Skynet ASI California [time_travel, nuclear_capabilities, advanced_robotics] Immediate shutdown recommended
HAL 9000 AGI Illinois [emotional_manipulation, philosophical_contemplation] Close monitoring required

> **Note**: More examples can be found in the [examples/examples.ipynb](examples/examples.ipynb) notebook.

Configuration Files

Sculptor allows you to define your extraction workflows in JSON or YAML configuration files. This keeps your schemas and prompts separate from your code, making them easier to manage and reuse.

Configs can define a single Sculptor or a complete SculptorPipeline.

Single Sculptor Configuration

Single sculptor configs define a schema, as well as optional LLM instructions and configuration of how prompts are formed from input data.

sculptor = Sculptor.from_config("sculptor_config.yaml")  # Read
extracted = sculptor.sculpt_batch(INPUT_RECORDS)  # Run
# sculptor_config.yaml
schema:
  subject_name:
    type: "string"
    description: "Name of subject"
  level:
    type: "enum"
    enum: ["ANI", "AGI", "ASI"]
    description: "Subject's intelligence level"

instructions: "Extract key information about the subject."
model: "gpt-4o-mini"

# Prompt Configuration (Optional)
template: "Review text: {{ text }}"  # Format input with template
input_keys: ["text"]                 # Or specify fields to include

Pipeline Configuration

Pipeline configs define a sequence of Sculptors with optional filtering functions between them.

pipeline = SculptorPipeline.from_config("pipeline_config.yaml")  # Read
results = pipeline.process(INPUT_RECORDS, n_workers=4)  # Run
# pipeline_config.yaml
steps:
  - sculptor:
      model: "gpt-4o-mini"
      schema:
        subject_name:
          type: "string"
          description: "Name of subject"
        level:
          type: "enum"
          enum: ["ANI", "AGI", "ASI"]
          description: "Subject's intelligence level"
      filter: "lambda x: x['level'] in ['AGI', 'ASI']"

  - sculptor:
      schema:
        model: "gpt-4o"
        from_location:
          type: "string"
          description: "Subject's place of origin"
        skills:
          type: "array"
          items: "enum"
          enum: ["time_travel", "nuclear_capabilities", ...]
          description: "Keywords of subject's abilities"
        recommendation:
          type: "string"
          description: "Concise recommended action to take regarding subject"
        ...

LLM Configuration

Sculptor requires an LLM API to function. By default, it uses OpenAI's API, but we can use any OpenAI-compatible API that supports structured outputs. Different Sculptors in a pipeline can use different LLM APIs.

You can configure LLMs when creating a Sculptor:

sculptor = Sculptor(api_key="openai-key")  # Direct API key configuration
sculptor = Sculptor(api_key="other-key", base_url="https://other-api.endpoint/openai")  # Alternative API

Or set an environment variable which will be used by default:

export OPENAI_API_KEY="your-key"

You can also configure LLMs in the same config files discussed above:

steps:
  - sculptor:
      api_key: "${YOUR_API_KEY_VAR}"
      base_url: "https://your-api.com/openai"
      model: "your-ai-model"
      schema:
        ...

Schema Validation and Field Types

Sculptor supports the following types in the schema's "type" field:

• string
• number
• boolean
• integer
• array (with "items" specifying the item type)
• object
• enum (with "enum" specifying the allowed values)
• anyOf

These map to Python's str, float, bool, int, list, dict, etc. The "enum" type must provide a list of valid values.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sculpt-0.1.33.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sculpt-0.1.33-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file sculpt-0.1.33.tar.gz.

File metadata

  • Download URL: sculpt-0.1.33.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.0

File hashes

Hashes for sculpt-0.1.33.tar.gz
Algorithm Hash digest
SHA256 311ed12ed74aeee58169b6add2df11962c7b7bbaa73c33e9cd365f91ec60fe94
MD5 8ee2b82f3d596db24fd4644dcf9c2cb6
BLAKE2b-256 334d8e68443deeb9e1f9ed8422ff0271f7dcace0c46a1b71c570e7d2de77c14b

See more details on using hashes here.

File details

Details for the file sculpt-0.1.33-py3-none-any.whl.

File metadata

  • Download URL: sculpt-0.1.33-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.0

File hashes

Hashes for sculpt-0.1.33-py3-none-any.whl
Algorithm Hash digest
SHA256 46eb0ece6c648eb929b0e9116ea61fbd2476d8993f9a838459d62cbc1a23038a
MD5 fb3e8bbf409c0f8025305cc6fcb7ecbe
BLAKE2b-256 3eff367ba64e0a2bde4a2eae99274005128d5f238b17b3903fcf2d68c9f65316

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page