Skip to main content

Sema4.ai Data

Project description

⚡️ sema4ai-data

Python library to develop data packages for Sema4.ai. Build powerful data-driven actions that can query databases and work with various data sources. This library is designed to work with Sema4.ai Data Server, which is included in the Sema4.ai Data Access VSCode extension.

Installation

pip install sema4ai-data

Quick Start

from typing import Annotated
from sema4ai.data import query, DataSource, DataSourceSpec
from sema4ai.actions import Response, Table

# Define a data source
PostgresDataSource = Annotated[DataSource, DataSourceSpec(
    name="my_postgres_db",
    engine="postgres",
    description="Main PostgreSQL database"
)]

# Create a data query
@query
def get_users(datasource: PostgresDataSource, limit: int = 10) -> Response[Table]:
    """Get users from the database."""
    result = datasource.query("SELECT * FROM `my_postgres_db`.users LIMIT 5", [limit])
    return Response(result=result.to_table())

Core Concepts

DataSource

The DataSource class is the main interface for executing queries against configured data sources. It's automatically injected by the framework when you use the @query decorator.

Key Methods:

  • query(sql, params=None) - Execute SQL queries with optional parameters
  • native_query(sql, params=None) - Execute engine-specific queries
  • connection() - Get the underlying data server connection

DataSourceSpec

Used to specify the configuration of a data source through type annotations:

from typing import Annotated
from sema4ai.data import DataSource, DataSourceSpec

# Database data source
DatabaseSource = Annotated[DataSource, DataSourceSpec(
    name="my_database",
    engine="postgres",  # or "mysql", "sqlite", etc.
    description="Production database"
)]

# File-based data source
FileSource = Annotated[DataSource, DataSourceSpec(
    engine="files",
    file="data/customers.csv",
    created_table="customers",
    description="Customer data from CSV"
)]

# Knowledge base for semantic search
KnowledgeBaseSource = Annotated[DataSource, DataSourceSpec(
    name="company_kb",
    engine="sema4_knowledge_base",
    description="Company knowledge base for semantic search"
)]

Parameters:

  • engine (required) - The data source engine type
  • name - Name of the data source
  • description - Human-readable description
  • file - File path for file-based sources
  • created_table - Table name created from files
  • setup_sql - SQL commands to run on setup
  • setup_sql_files - SQL files to execute on setup

Decorators

@query

The main decorator for creating data queries that can be executed by sema4ai actions:

from sema4ai.data import query
from sema4ai.actions import Response, Table

@query
def get_countries(datasource: PostgresCustomersDataSource) -> str:
    sql = """
        SELECT distinct(country)
        FROM public_demo.demo_customers
        LIMIT 100;
    """

    result = datasource.query(sql)
    return result.to_markdown()

Parameters:

  • is_consequential - Whether the action has side effects or updates a resource (default: False)
  • display_name - Custom display name for the action

@predict ⚠️ DEPRECATED

Note: The @predict decorator is deprecated as of version 1.0.3. Use @query instead for all operations including predictions.

# OLD (deprecated):
@predict
def predict_something(datasource: SomeDataSource):
    pass

# NEW (recommended):
@query
def predict_something(datasource: SomeDataSource):
    pass

ResultSet

The ResultSet class represents query results and provides various methods to work with the data:

# Convert to different formats
result = datasource.query("SELECT * FROM `my_database`.users")

# As a table for actions
table = result.to_table()

# As a list of dictionaries
dicts = result.to_dict_list()

# As structured objects
from pydantic import BaseModel

class User(BaseModel):
    id: int
    name: str
    email: str

users = result.build_list(User)

# Iterate over results
for row_dict in result.iter_as_dicts():
    print(row_dict)

for row_tuple in result.iter_as_tuples():
    print(row_tuple)

Basic Database Query

from typing import Annotated
from pydantic import BaseModel
from sema4ai.data import query, DataSource, DataSourceSpec
from sema4ai.actions import Response

class Product(BaseModel):
    id: int
    name: str
    price: float
    category: str

ProductDB = Annotated[DataSource, DataSourceSpec(
    name="products",
    engine="postgres",
    description="Product catalog database"
)]

@query
def search_products(
    category: str,
    max_price: float,
    datasource: ProductDB
) -> Response[list[Product]]:
    """Search products by category and price."""
    result = datasource.query(
        """
        SELECT id, name, price, category
        FROM products.products
        WHERE category = ? AND price <= ?
        ORDER BY price ASC
        """,
        [category, max_price]
    )
    return Response(result=result.build_list(Product))

File-based Data Source

SalesData = Annotated[DataSource, DataSourceSpec(
    engine="files",
    file="data/sales_2024.csv",
    created_table="sales",
    description="Sales data for 2024"
)]

@query
def monthly_sales_report(
    month: int,
    datasource: SalesData
) -> Response[Table]:
    """Generate monthly sales report."""
    result = datasource.query(
        """
        SELECT
            product_category,
            SUM(amount) as total_sales,
            COUNT(*) as transaction_count
        FROM files.sales
        WHERE MONTH(sale_date) = ?
        GROUP BY product_category
        ORDER BY total_sales DESC
        """,
        [month]
    )
    return Response(result=result.to_table())

Knowledge Base Search

KnowledgeBase = Annotated[DataSource, DataSourceSpec(
    name="company_kb",
    engine="sema4_knowledge_base",
    description="Company knowledge base for semantic search"
)]

@query
def search_knowledge(
    query_text: str,
    relevance_threshold: float = 0.7,
    datasource: KnowledgeBase
) -> Response[Table]:
    """Search company knowledge base."""
    result = datasource.query(
        """
        SELECT chunk_content, relevance_score, document_name
        FROM company_kb
        WHERE content = ? AND relevance_threshold = ?
        ORDER BY relevance_score DESC
        LIMIT 5
        """,
        [query_text, relevance_threshold]
    )
    return Response(result=result.to_table())

Using native_query for Engine-Specific Syntax

@query
def get_user_by_id(
    user_id: int,
    datasource: MyDataSource
) -> Response[Table]:
    """Get user using native SQL syntax."""
    # Uses engine-specific syntax, automatically wrapped
    result = datasource.native_query(
        "SELECT * FROM user_info WHERE id = $id",
        {"id": user_id}
    )
    return Response(result=result.to_table())

API Reference

Functions

query(func=None, *, is_consequential=None, display_name=None)

Decorator for creating query actions.

predict(func=None, *, is_consequential=None, display_name=None) ⚠️ DEPRECATED

Deprecated: Use @query instead. This decorator is deprecated as of version 1.0.3.

get_connection() -> DataServerConnection

Get a connection to the data server.

metadata(package_root: Path) -> dict

Get metadata about data sources in a package.

get_snowflake_connection_details()

Get Snowflake-specific connection configuration.

Classes

DataSource

Main interface for executing queries against data sources.

Methods:

  • query(sql: str, params: list = None) -> ResultSet
  • native_query(sql: str, params: dict = None) -> ResultSet
  • connection() -> DataServerConnection

Properties:

  • datasource_name: str - Name of the data source

DataSourceSpec

Configuration specification for data sources.

ResultSet

Container for query results with conversion methods.

Methods:

  • to_table() -> Table - Convert to sema4ai Table
  • to_dict_list() -> list[dict] - Convert to list of dictionaries
  • build_list(item_class: type[T]) -> list[T] - Build typed object list
  • iter_as_dicts() -> Iterator[dict] - Iterate as dictionaries
  • iter_as_tuples() -> Iterator[tuple] - Iterate as tuples
  • to_pandas_df() -> pd.DataFrame - Convert to pandas DataFrame
  • to_markdown_table() -> str - Convert to markdown table

Data Models

SourceInfo

Information about a data source configuration.

TableInfo

Metadata about database tables.

ColumnInfo

Information about table columns.

KnowledgeBaseInfo

Metadata about knowledge base configurations.

Changelog

Unreleased

1.2.1 - 2025-10-30

  • Fix to the performance hit in creating snowflake connection.

1.2.0 - 2025-10-30

  • Bringing Snowflake connection and execute query functions into the library to reduce load on customer codes.
    • get_snowflake_connection
    • execute_snowflake_query
    • get_snowflake_connection_details
    • get_snowflake_rest_api_headers
    • get_snowflake_rest_api_headers
  • Support for Snowflake OAuth linking via Sema4.ai Studio

1.1.0 - 2025-10-21

  • Add support for Snowflake SNOWFLAKE_OAUTH_PARTNER and SNOWFLAKE_OAUTH_CUSTOM auth type.

1.0.10 - 2025-09-08

  • Fix KnowledgeBaseInfo params optionality

1.0.9 - 2025-09-08

  • Implement _get_datasource_info private method on DataServerConnection class

1.0.8 - 2025-08-21

  • CVE updates
  • Expose the underlying SQL error when running an query

1.0.7 - 2025-07-28

  • Improve readme and add changelog when publishing to pypi

1.0.6 - 2025-06-18

  • Simplify error message on run_sql function call.

1.0.5 - 2025-05-20

  • Allow extra fields in sf-auth.json without changing behaviour of get_snowflake_connection_details.

1.0.4 - 2025-05-13

  • Add sema4_knowledge_base engine to support knowledge base as a data source

1.0.3 - 2025-04-24

  • Add deprecation warning for @predict decorator and DataServerConnection.predict method as Lightwood is being phased out for data server predictions. Use @query or connection.query() instead.
  • Update to latest sema4ai-actions version

1.0.2 - 2025-03-06

  • Fix Snowflake local auth file path for Windows

1.0.1 - 2025-02-28

  • Fix to the private key passphrase hanling

1.0.0 - 2025-02-25

  • Add private_key_file_pwd to snowflake connection details when it exists in auth config file
  • SnowflakeAuthenticationError now inherits from ActionError.

0.1.0 - 2025-02-18

  • Added native_query() method which will automatically wrap the query in a SELECT * FROM <datasource_name> (<query>) clause so that the query can be executed in the native SQL syntax of the data source instead of the syntax required by the data server.
  • If no parameters are provided, the query is returned as is (even if parameters are detected in the query -- added so that the user can do the escaping themselves if needed if the SQL syntax accepts the parameters in a different way).

0.0.9 - 2025-02-14

  • Correct the local authentication JSON file path for Snowflake in get_snowflake_connection_details

0.0.8 - 2025-02-14

  • Add get_snowflake_connection_details helper function to get the connection details for Snowflake.

0.0.7 - 2025-02-06

  • Corrected typo in ColumInfo.
  • Updated list_knowledge_bases method to return KnowledgeBaseInfo.

0.0.6 - 2025-01-31

  • Add data utilitary methods to DataServerConnection

0.0.5 - 2024-12-20

  • Added execute_sql() to the DataSource class.

0.0.4 - 2024-12-19

  • New utility methods for the ResultSet class:
    • to_dataframe() (alias for as_dataframe)
    • to_table() (creates a Table object that can be used to build a structured response)
    • to_dict_list() (returns a list of dictionaries)
    • __iter__() (same as iter_as_dicts)
    • __len__()
  • Retry login if the server returns a 401 error.
  • Retry SQL requests (once) if the server returns an unexpected error (as it may be a transient error).
  • Added sema4ai.data.get_connection() to get the configured connection to the data server.
  • Backward incompatible change: The queries/predictions must always use the full data source name to access a table and not just the table name regardless of the data source name configured in the DataSourceSpec. i.e.: SQL like SELECT * FROM my_datasource.my_table is required instead of SELECT * FROM my_table.

0.0.3 - 2024-11-27

  • Using REST API instead of PyMySQL.
  • ResultSet APIs (provisional):
    • iter_as_dicts() (new in 0.0.3)
    • iter_as_tuples() (new in 0.0.3)
    • as_dataframe() (new in 0.0.1)
    • build_list(item_class) (new in 0.0.1)
    • to_markdown() (new in 0.0.1)

0.0.2 - 2024-11-25

  • Changed metadata format to have _ instead of - in names.
  • Made defined_at/file in metadata relative.
  • Added support for setup_sql_files in DataSourceSpec.
  • Default datasource named models is used for custom and prediction engines.

0.0.1 - 2024-11-18

  • Initial release
  • Added API:
    • from sema4ai.data import query to mark function as @query
    • from sema4ai.data import predict to mark function as @predict
    • from sema4ai.data import DataSource to define a data source
    • from sema4ai.data import DataSourceSpec to define a data source specification using an Annotated type

License

See LICENSE - Sema4.ai End User License Agreement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sema4ai_data-1.2.1.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sema4ai_data-1.2.1-py3-none-any.whl (41.0 kB view details)

Uploaded Python 3

File details

Details for the file sema4ai_data-1.2.1.tar.gz.

File metadata

  • Download URL: sema4ai_data-1.2.1.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.10.12 Linux/6.8.0-1036-azure

File hashes

Hashes for sema4ai_data-1.2.1.tar.gz
Algorithm Hash digest
SHA256 c3cdb5d6fc51466bb1c36a6b7e0836b98e36aa9a172223fedc3154a24fb1298b
MD5 5b4c39358f044971d88c104393cb7c83
BLAKE2b-256 e12944b7797a718ccc330fd9b6fc31cc79893122aed87e0627bfe76d3b7f1853

See more details on using hashes here.

File details

Details for the file sema4ai_data-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: sema4ai_data-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 41.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.10.12 Linux/6.8.0-1036-azure

File hashes

Hashes for sema4ai_data-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cd3db089c3cd35c7c647ddaa4c80085ab37e943835a59e35cf6126c3a44f6ad5
MD5 9d141fc61853a15123ad6391f31ced25
BLAKE2b-256 ea6bd0c5e495250efd41f709445500a3b9ed784d8cb37e552da1eaa640d4b777

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page