A workspace library for managing Polars dataframes with parent-child relationships and lazy evaluation

Project description

DocWorkspace

A powerful Python library for managing Polars DataFrames and LazyFrames with parent-child relationships, lazy evaluation, and FastAPI integration. Part of the LDaCA (Language Data Commons of Australia) ecosystem.

Overview

DocWorkspace provides a workspace-based approach to data analysis, where data transformations are tracked as nodes in a directed graph. This enables:

Relationship Tracking: Understand data lineage and transformation history
Lazy Evaluation: Optimize performance with Polars LazyFrames
Multiple Data Types: Support for Polars DataFrames and LazyFrames
FastAPI Integration: Ready-to-use models and utilities for web APIs
Serialization: Save and restore entire workspaces with their relationships

Installation

pip install "docworkspace>=0.2.0"

docworkspace is published on PyPI as a pure-Python package.

Install From Source

git clone https://github.com/Australian-Text-Analytics-Platform/docworkspace.git
cd docworkspace
uv sync --group dev

Dependencies

Python ≥ 3.14
polars
polars-text >= 0.1.0

For FastAPI integration:

pip install pydantic

Quick Start

import polars as pl
from docworkspace import Node, Workspace
# Create a workspace
workspace = Workspace("my_analysis")

# Load data
df = pl.DataFrame({
    "text": ["Hello world", "Data science", "Python rocks"],
    "category": ["greeting", "tech", "programming"],
    "score": [0.8, 0.9, 0.95]
})

# Add data to workspace
data_node = workspace.add_node(Node(df, name="raw_data"))

# Apply transformations (creates new nodes automatically)
filtered = data_node.filter(pl.col("score") > 0.85)
grouped = filtered.group_by("category").agg(pl.col("score").mean())

# Check relationships
print(f"Total nodes: {len(workspace.nodes)}")
print(f"Root nodes: {len(workspace.get_root_nodes())}")
print(f"Leaf nodes: {len(workspace.get_leaf_nodes())}")

# Visualize the computation graph
print(workspace.visualize_graph())

Core Concepts

Node

A Node wraps your data (DataFrames, LazyFrames) and tracks relationships with other nodes. Nodes support:

Transparent Data Access: All DataFrame methods work directly on nodes
Automatic Relationship Tracking: Operations create child nodes
Lazy Evaluation: Maintains laziness for performance
Metadata: Store operation descriptions and custom metadata

# Node automatically creates workspace if none provided
node = Node(df, name="my_data")

# All DataFrame operations work directly
filtered_node = node.filter(pl.col("value") > 10)
sorted_node = filtered_node.sort("value", descending=True)

# Check relationships
print(f"Children of original node: {len(node.children)}")
print(f"Parents of sorted node: {len(sorted_node.parents)}")

Workspace

A Workspace manages collections of nodes and provides graph operations:

Node Management: Add, remove, and retrieve nodes
Graph Operations: Find roots, leaves, descendants, ancestors
Serialization: Save/load entire workspaces
Visualization: Generate text-based and programmatic graph representations

workspace = Workspace("analysis")

# Add nodes
node1 = workspace.add_node(Node(df1, "dataset1"))
node2 = workspace.add_node(Node(df2, "dataset2"))

# Join creates a new node with both parents
joined = node1.join(node2, on="id")

# Explore the graph
roots = workspace.get_root_nodes()
leaves = workspace.get_leaf_nodes()

Supported Data Types

DocWorkspace supports multiple data types from the Polars ecosystem:

Polars Types

pl.DataFrame: Materialized, in-memory data
pl.LazyFrame: Lazy evaluation for performance optimization

Example with Different Types

import polars as pl

# Polars DataFrame (eager)
df = pl.DataFrame({"text": ["hello", "world"], "id": [1, 2]})
node1 = Node(df, "eager_data")

# Polars LazyFrame (lazy)
lazy_df = pl.LazyFrame({"text": ["foo", "bar"], "id": [3, 4]})
node2 = Node(lazy_df, "lazy_data")

# All work seamlessly in the same workspace
workspace = Workspace("mixed_types")
for node in [node1, node2]:
    workspace.add_node(node)

Key Features

1. Lazy Evaluation

DocWorkspace preserves Polars' lazy evaluation capabilities:

# Start with lazy data
lazy_df = pl.scan_csv("large_file.csv")
node = Node(lazy_df, "raw_data")

# Chain operations (all remain lazy)
filtered = node.filter(pl.col("value") > 100)
grouped = filtered.group_by("category").agg(pl.col("value").sum())
sorted_result = grouped.sort("value", descending=True)

# Only materialize when needed
final_result = sorted_result.collect()  # This creates a new materialized node

2. Relationship Tracking

Understand your data lineage:

# Create a processing pipeline
raw_data = Node(df, "raw")
cleaned = raw_data.filter(pl.col("value").is_not_null())
normalized = cleaned.with_columns(pl.col("value") / pl.col("value").max())
final = normalized.select(["id", "normalized_value"])

# Explore relationships
print("Processing chain:")
current = final
while current.parents:
    parent = current.parents[0]
    print(f"{parent.name} -> {current.name} ({current.operation})")
    current = parent

3. FastAPI Integration

Ready-to-use models for web APIs:

from docworkspace import FastAPIUtils, WorkspaceGraph, NodeSummary

# Convert workspace to FastAPI-compatible format
graph_data = workspace.to_api_graph()

# Get node summaries
summaries = [FastAPIUtils.node_to_api_summary(node) for node in workspace.nodes.values()]

4. Serialization

Save and restore complete workspaces:

# Save workspace with all nodes and relationships
workspace.serialize("my_workspace.json")

# Load workspace later
restored_workspace = Workspace.deserialize("my_workspace.json")

# All nodes and relationships are preserved
assert len(restored_workspace.nodes) == len(workspace.nodes)

Advanced Usage

Custom Operations

Create custom operations that maintain relationships:

def custom_transform(node: Node, operation_name: str) -> Node:
    """Apply custom transformation and track the operation."""
    # Your custom logic here
    result_data = node.data.with_columns(pl.col("value") * 2)

    # Create new node with relationship tracking
    return Node(
        data=result_data,
        name=f"{operation_name}_{node.name}",
        workspace=node.workspace,
        parents=[node],
        operation=operation_name
    )

# Use custom operation
transformed = custom_transform(original_node, "double_values")

Graph Analysis

Analyze your computation graph:

# Find all descendants of a node
descendants = workspace.get_descendants(node.id)

# Find all ancestors
ancestors = workspace.get_ancestors(node.id)

# Get topological ordering
ordered_nodes = workspace.get_topological_order()

# Check for cycles (shouldn't happen in normal usage)
has_cycles = workspace.has_cycles()

Working with Document Columns

DocWorkspace tracks the text/document column via node metadata:

# Create a DataFrame with a text column
df = pl.DataFrame({
    "doc_id": ["d1", "d2", "d3"],
    "text": ["Hello world", "Data science", "Python rocks"],
    "metadata": ["type1", "type2", "type1"]
})

node = Node(df, "corpus")
node.document = "text"

# Document metadata is preserved across operations
filtered = node.filter(pl.col("metadata") == "type1")
print(f"Document column preserved: {filtered.document}")

API Reference

Node Class

Constructor

Node(data, name=None, workspace=None, parents=None, operation=None)

Properties

document: Optional[str] - Document column tracked in node metadata
data: DataFrame | LazyFrame - Underlying frame-like object

Methods

collect() -> Node - Materialize lazy data (creates new node)
materialize() -> Node - Alias for collect()
info(json=False) -> Dict - Get node information
json_schema() -> Dict[str, str] - Get JSON-compatible schema

DataFrame Operations

All Polars DataFrame/LazyFrame operations are available directly:

filter(condition) -> Node
select(columns) -> Node
with_columns(*exprs) -> Node
group_by(*columns) -> Node
sort(by, descending=False) -> Node
join(other, on, how="inner") -> Node
And many more...

Workspace Class

Constructor (Workspace)

Workspace(name=None, data=None, data_name=None, csv_lazy=True, **csv_kwargs)

Properties (Workspace)

id: str - Unique workspace identifier
name: str - Human-readable name
nodes: Dict[str, Node] - All nodes in the workspace

Methods (Workspace)

Node Management

add_node(node) -> Node - Add a node to the workspace
remove_node(node_id, materialize_children=False) -> bool - Remove a node
get_node(node_id) -> Optional[Node] - Get node by ID
get_node_by_name(name) -> Optional[Node] - Get node by name
list_nodes() -> List[Node] - Get all nodes

Graph Operations

get_root_nodes() -> List[Node] - Nodes with no parents
get_leaf_nodes() -> List[Node] - Nodes with no children
get_descendants(node_id) -> List[Node] - All descendant nodes
get_ancestors(node_id) -> List[Node] - All ancestor nodes
get_topological_order() -> List[Node] - Topologically sorted nodes

Visualization

visualize_graph() -> str - Text-based graph visualization
graph() -> Dict - Generic graph structure
to_react_flow_json() -> Dict - React Flow compatible format

Serialization

serialize(file_path) - Save workspace to JSON
deserialize(file_path) -> Workspace - Load workspace from JSON
from_dict(workspace_dict) -> Workspace - Create from dictionary

Metadata

get_metadata(key) -> Any - Get workspace metadata
set_metadata(key, value) - Set workspace metadata
summary() -> Dict - Get workspace summary
info() -> Dict - Alias for summary()

FastAPI Integration

Models

NodeSummary - API-friendly node representation
WorkspaceGraph - React Flow compatible graph
PaginatedData - Paginated data response

Utilities

FastAPIUtils.node_to_api_summary(node) -> NodeSummary
FastAPIUtils.workspace_to_ui_graph_payload(workspace) -> WorkspaceGraph

Examples

Example 1: Text Analysis Pipeline

import polars as pl
from docworkspace import Node, Workspace

# Sample text data
df = pl.DataFrame({
    "doc_id": [f"doc_{i}" for i in range(100)],
    "text": [f"Sample text content {i}" for i in range(100)],
    "category": ["news", "blog", "academic"] * 34,
    "year": [2020, 2021, 2022, 2023] * 25
})

# Create workspace
workspace = Workspace("text_analysis")

# Track the document column for text analysis
corpus = workspace.add_node(Node(df, "full_corpus"))
corpus.document = "text"

# Filter by category
news_docs = corpus.filter(pl.col("category") == "news")
blog_docs = corpus.filter(pl.col("category") == "blog")

# Filter by recent years
recent_news = news_docs.filter(pl.col("year") >= 2022)

# Group analysis
year_stats = corpus.group_by(["category", "year"]).agg(
    pl.count().alias("doc_count")
)

# Materialize results
final_stats = year_stats.collect()

# Analyze the computation graph
print(workspace.visualize_graph())
print(f"Total transformations: {len(workspace.nodes)}")

Example 2: Lazy Data Processing

import polars as pl
from docworkspace import Workspace

# Create workspace with lazy CSV loading
workspace = Workspace(
    "large_data_analysis",
    data="large_dataset.csv",  # Path to CSV
    data_name="raw_data",
    csv_lazy=True  # Load as LazyFrame for performance
)

# Get the loaded node
raw_data = workspace.get_node_by_name("raw_data")
print(f"Is lazy: {isinstance(raw_data.data, pl.LazyFrame)}")  # True

# Chain transformations (all remain lazy)
cleaned = raw_data.filter(pl.col("value").is_not_null())
normalized = cleaned.with_columns(
    (pl.col("value") / pl.col("value").max()).alias("normalized")
)
aggregated = normalized.group_by("category").agg([
    pl.col("normalized").mean().alias("avg_normalized"),
    pl.count().alias("count")
])

# Still lazy until we collect (check underlying data type)
print(f"Aggregated is lazy: {isinstance(aggregated.data, pl.LazyFrame)}")  # True

# Materialize only the final result
result = aggregated.collect()
print(f"Result is lazy: {isinstance(result.data, pl.LazyFrame)}")  # False

# Save the entire workspace with lazy evaluation preserved
workspace.serialize("lazy_analysis.json")

Example 3: Multi-Source Data Integration

import polars as pl
from docworkspace import Node, Workspace

workspace = Workspace("data_integration")

# Load data from multiple sources
sales_df = pl.DataFrame({
    "customer_id": [1, 2, 3, 4],
    "sales": [100, 200, 150, 300],
    "region": ["North", "South", "East", "West"]
})

customer_df = pl.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "segment": ["Premium", "Regular", "Premium", "Regular", "Premium"]
})

# Add to workspace
sales_node = workspace.add_node(Node(sales_df, "sales_data"))
customer_node = workspace.add_node(Node(customer_df, "customer_data"))

# Join the datasets
combined = sales_node.join(customer_node, on="customer_id", how="inner")

# Analyze by segment
segment_analysis = combined.group_by("segment").agg([
    pl.col("sales").sum().alias("total_sales"),
    pl.col("sales").mean().alias("avg_sales"),
    pl.count().alias("customer_count")
])

# Filter high-value segments
high_value = segment_analysis.filter(pl.col("total_sales") > 200)

print(f"Nodes in workspace: {len(workspace.nodes)}")
print("Data lineage:")
for node in workspace.get_leaf_nodes():
    print(f"Leaf node: {node.name}")

Development

Running Tests

# Install development dependencies
uv sync --group dev

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=docworkspace

# Run specific test file
uv run pytest tests/test_workspace.py -v

Building Distributions

uv build

This produces a universal wheel and source distribution suitable for PyPI.

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Run the test suite: uv run pytest
Submit a pull request

Project Structure

docworkspace/
├── .github/
│   └── workflows/         # CI and release automation
├── src/
│   └── docworkspace/
│       ├── __init__.py    # Public package exports
│       ├── node/
│       │   ├── __init__.py
│       │   ├── core.py    # Node implementation
│       │   └── io.py      # Node serialization helpers
│       └── workspace/
│           ├── __init__.py
│           ├── core.py    # Workspace implementation
│           ├── io.py      # Workspace serialization helpers
│           └── analysis.py
├── tests/                 # Test suite
│   ├── conftest.py
│   ├── test_fastapi_integration.py
│   ├── test_node.py
│   ├── test_node_io.py
│   ├── test_simple_operations.py
│   ├── test_workspace.py
│   ├── test_workspace_serialization_types.py
│   └── test_workspace_shim.py
├── PUBLISH.md             # Release runbook
├── README.md              # This file
└── pyproject.toml         # Project configuration

License

Part of the LDaCA (Language Data Commons of Australia) ecosystem.

Changelog

Version 0.2.0

Published on PyPI as docworkspace
PyPI consumers can install the package directly instead of relying on a local workspace checkout
Added release automation and publishing runbook for future releases
Continued support for Polars data types, lazy evaluation, FastAPI integration, and serialization

Related Projects

LDaCA Web App: Full-stack web application using DocWorkspace
Polars: Fast DataFrame library with lazy evaluation

Project details

Release history Release notifications | RSS feed

0.2.2

Apr 7, 2026

This version

0.2.1

Mar 25, 2026

0.2.0

Mar 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docworkspace-0.2.1.tar.gz (24.8 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docworkspace-0.2.1-py3-none-any.whl (16.2 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file docworkspace-0.2.1.tar.gz.

File metadata

Download URL: docworkspace-0.2.1.tar.gz
Upload date: Mar 25, 2026
Size: 24.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docworkspace-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`f3cc39d73bfcc8f58a10ae7caf09e8a2289be067e118f7583df8d97275529c55`
MD5	`b279096a2e867875dde7def448e221fa`
BLAKE2b-256	`680179745dbb1f9229fcddb68833d9ddbc046f5b2ea5a1d00c14ded6450afb09`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docworkspace-0.2.1.tar.gz:

Publisher: release.yml on Australian-Text-Analytics-Platform/docworkspace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docworkspace-0.2.1.tar.gz
- Subject digest: f3cc39d73bfcc8f58a10ae7caf09e8a2289be067e118f7583df8d97275529c55
- Sigstore transparency entry: 1182507974
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: Australian-Text-Analytics-Platform/docworkspace@bf1e67694a3d193c6ed4fe855e52dd8f36915dd0
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/Australian-Text-Analytics-Platform
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bf1e67694a3d193c6ed4fe855e52dd8f36915dd0
- Trigger Event: push

File details

Details for the file docworkspace-0.2.1-py3-none-any.whl.

File metadata

Download URL: docworkspace-0.2.1-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 16.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docworkspace-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99425cbb18dab473d8b0b602dc95ef03cfd3b49fdf335930a1c9c85a29b54937`
MD5	`7861984902afddbf128733716b8d5868`
BLAKE2b-256	`82ac13edbcb9b3ea74fc998ffe2f3a5d0cb19c172ed8e3dee5b638d033b66329`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docworkspace-0.2.1-py3-none-any.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/docworkspace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docworkspace-0.2.1-py3-none-any.whl
- Subject digest: 99425cbb18dab473d8b0b602dc95ef03cfd3b49fdf335930a1c9c85a29b54937
- Sigstore transparency entry: 1182508146
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: Australian-Text-Analytics-Platform/docworkspace@bf1e67694a3d193c6ed4fe855e52dd8f36915dd0
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/Australian-Text-Analytics-Platform
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bf1e67694a3d193c6ed4fe855e52dd8f36915dd0
- Trigger Event: push

docworkspace 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

DocWorkspace

Overview

Installation

Install From Source

Dependencies

Quick Start

Core Concepts

Node

Workspace

Supported Data Types

Polars Types

Example with Different Types

Key Features

1. Lazy Evaluation

2. Relationship Tracking

3. FastAPI Integration

4. Serialization

Advanced Usage

Custom Operations

Graph Analysis

Working with Document Columns

API Reference

Node Class

Constructor

Properties

Methods

DataFrame Operations

Workspace Class

Constructor (Workspace)

Properties (Workspace)

Methods (Workspace)

Node Management

Graph Operations

Visualization

Serialization

Metadata

FastAPI Integration

Models

Utilities

Examples

Example 1: Text Analysis Pipeline

Example 2: Lazy Data Processing

Example 3: Multi-Source Data Integration

Development

Running Tests

Building Distributions

Contributing

Project Structure

License

Changelog

Version 0.2.0

Related Projects

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance