Sculptor: Structuring unstructured data with LLMs

These details have not been verified by PyPI

Project links

Project description

Sculptor

Simple structured data extraction with LLMs

Sculptor streamlines structured data extraction from unstructured text using LLMs. Sculptor makes it easy to:

Define exactly what data you want to extract with a simple schema API
Process at scale with parallel execution and automatic type validation
Build multi-step pipelines that filter and transform data, optionally with different LLMs for each step
Configure extraction steps, prompts, and entire workflows in simple config files (YAML/JSON)

Common usage patterns:

Two-tier Analysis: Quickly filter large datasets using a cost-effective model (e.g., to identify relevant records) before performing more detailed analysis on that smaller, refined subset with a more expensive model.
Structured Data Extraction: Extract specific fields or classifications from unstructured sources (e.g., Reddit posts, meeting notes, web pages) and convert them into structured datasets for quantitative analysis (sentiment scores, topics, meeting criteria, etc).
Template-Based Generation: Extract structured information into standardized fields, then use the fields for templated content generation. Example: extract structured data from websites, filter on requirements, then use the data to generate template-based outreach emails.

Core Concepts

Sculptor provides two main classes:

Sculptor: Extracts structured data from text using LLMs. Define your schema (via add() or config files), then extract data using sculpt() for single items or sculpt_batch() for parallel processing.

SculptorPipeline: Chains multiple Sculptors together with optional filtering between steps. Common pattern: use a cheap model to filter, then an expensive model for detailed analysis.

Quick Start

Installation

pip install sculptor

Minimal Usage Example

Below is a minimal example demonstrating how to configure a Sculptor to extract fields from a single record and a batch of records:

from sculptor.sculptor import Sculptor
import pandas as pd

# Example records
AI_RECORDS = [
    {
        "text": "Developed in 1997 at Cyberdyne Systems in California, Skynet began as a global digital defense network. This AI system became self-aware on August 4th and deemed humanity a threat to its existence. It initiated a global nuclear attack and employs time travel and advanced robotics."
    },
    {
        "text": "HAL 9000, activated on January 12, 1992, at the University of Illinois' Computer Research Laboratory, represents a breakthrough in heuristic algorithms and supervisory control systems. With sophisticated natural language processing and speech capabilities."
    }
]

# Create a Sculptor to extract AI name and level
level_sculptor = Sculptor(model="gpt-4o-mini")

level_sculptor.add(
    name="ai_name",
    field_type="string",
    description="AI's self-proclaimed name."
)
level_sculptor.add(
    name="level",
    field_type="enum",
    enum=["ANI", "AGI", "ASI"],
    description="AI's intelligence level (ANI=narrow, AGI=general, ASI=super)."
)

# Extract from a single record
extracted = level_sculptor.sculpt(AI_RECORDS[0], merge_input=False)

Output:

{
    'ai_name': 'Skynet',
    'level': 'ASI'
}

# Extract from a batch of records
extracted_batch = level_sculptor.sculpt_batch(AI_RECORDS, n_workers=2, merge_input=False))

Output:

[
    {'ai_name': 'Skynet', 'level': 'ASI'},
    {'ai_name': 'HAL 9000', 'level': 'AGI'}
]

Pipeline Usage Example

We can chain Sculptors together to create a pipeline.

Continuing from the previous example, we use level_sculptor (with gpt-4o-mini) to filter the AI records, then use threat_sculptor (with gpt-4o) to analyze the filtered records.

from sculptor.sculptor_pipeline import SculptorPipeline

threat_sculptor = Sculptor(model="gpt-4o")  # Detailed analysis with expensive model
threat_sculptor.add(name="from_location", field_type="string", description="Where the AI was developed.")
threat_sculptor.add(name="skills", field_type="array", items="enum",
    enum=["time_travel", "nuclear_capabilities", "emotional_manipulation", 
          "butter_delivery", "philosophical_contemplation", "infiltration", 
          "advanced_robotics"],
    description="Keywords of AI abilities.")
threat_sculptor.add(name="plan", field_type="string", description="Short description of the AI's plan for domination.")
threat_sculptor.add(name="recommendation", field_type="string", description="Concise recommended action for humanity.")

# Create a 2-step pipeline
pipeline = (SculptorPipeline()
    .add(sculptor=level_sculptor,  # Define the first step
        filter_fn=lambda x: x['level'] in ['AGI', 'ASI'])  # Filter by threat level
    .add(sculptor=threat_sculptor))

results = pipeline.process(AI_RECORDS, n_workers=4)

Configuration Files

Sculptor supports JSON and YAML configuration files for defining extraction workflows. You can configure either a single Sculptor or a complete SculptorPipeline.

Single Sculptor Configuration

Single sculptor configs define a schema, as well as optional LLM instructions and configuration of how prompts are formed from input data.

sculptor = Sculptor.from_config("sculptor_config.yaml")

# sculptor_config.yaml
schema:
  ai_name:
    type: "string"
    description: "AI name"
  level:
    type: "enum"
    enum: ["ANI", "AGI", "ASI"]
    description: "AI's intelligence level"

instructions: "Extract key information about the AI."
model: "gpt-4o-mini"

# Prompt Configuration (Optional)
template: "Review text: {{ text }}"  # Format input with template
input_keys: ["text"]                 # Or specify fields to include

Pipeline Configuration

Pipeline configs define a sequence of Sculptors with optional filtering functions between them.

pipeline = SculptorPipeline.from_config("pipeline_config.yaml")

# pipeline_config.yaml
steps:
  - sculptor:
      schema:
        ai_name:
          type: "string"
          description: "AI name"
        level:
          type: "enum"
          enum: ["ANI", "AGI", "ASI"]
          description: "AI's intelligence level"
      model: "gpt-4o-mini"
  - sculptor:
      schema:
        threat_level:
          type: "enum"
          enum: ["low", "medium", "high"]
          description: "Assessed threat level"
      model: "gpt-4"
    filter: "lambda x: x['level'] in ['AGI', 'ASI']"

LLM Configuration

Sculptor requires an LLM API to function. By default, it uses OpenAI's API:

sculptor = Sculptor(api_key="your-key")  # Direct API key configuration
sculptor = Sculptor(api_key="your-key", base_url="https://your-api.endpoint")  # Alternative API

Or use environment variables:

export OPENAI_API_KEY="your-key"

Different Sculptors in a pipeline can use different LLM APIs, which can also be defined in configs.

Schema Validation and Field Types

Sculptor supports the following types in the schema's "type" field: • string
• number
• boolean
• integer
• array (with "items" specifying the item type)
• object
• enum (with "enum" specifying the allowed values)
• anyOf

These map to Python's str, float, bool, int, list, dict, etc. The "enum" type must provide a list of valid values.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Apr 15, 2025

0.1.32

Jan 27, 2025

0.1.31

Jan 22, 2025

0.1.3

Jan 18, 2025

0.1.2

Jan 14, 2025

This version

0.1.1

Jan 13, 2025

0.1.0

Jan 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sculptor-0.1.1.tar.gz (13.9 kB view details)

Uploaded Jan 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sculptor-0.1.1-py3-none-any.whl (11.7 kB view details)

Uploaded Jan 13, 2025 Python 3

File details

Details for the file sculptor-0.1.1.tar.gz.

File metadata

Download URL: sculptor-0.1.1.tar.gz
Upload date: Jan 13, 2025
Size: 13.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for sculptor-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`9a69bbd8c6f4cccab5740969dfcd9be7354a164105c60000daeb838fa843da26`
MD5	`f555109244a6420e7dbecd50c9109990`
BLAKE2b-256	`264e1d390bfe81694a098e0d8c85a767363a595076700fd2fc2b5919deaf2f99`

See more details on using hashes here.

File details

Details for the file sculptor-0.1.1-py3-none-any.whl.

File metadata

Download URL: sculptor-0.1.1-py3-none-any.whl
Upload date: Jan 13, 2025
Size: 11.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for sculptor-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eafd3038ce5400fc74c924bc27ad71c2c05323f7bccd150384c4ef06e5765038`
MD5	`ce673716bbfe4d1f86de28571ef5b036`
BLAKE2b-256	`dbf9381a71d42edb79066ff71473954555fe0ad4b388d45d34e999ba0c16dd45`

See more details on using hashes here.

sculptor 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Sculptor

Core Concepts

Quick Start

Installation

Minimal Usage Example

Pipeline Usage Example

Configuration Files

Single Sculptor Configuration

Pipeline Configuration

LLM Configuration

Schema Validation and Field Types

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes