Boil your tabular data down to serializations.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

daniel_gomm

These details have not been verified by PyPI

Project description

Cover Image

Table Serialization Kitchen

Use the Table Serialization Kitchen to boil your tabular data down to serializations. Provide a recipe and some ingredients and serialize your tables in no-time. You can easily spice things up by extending Table Serialization Kitchen with your own serialization ideas!

This blog post gives an example of how to use table serialization kitchen for rapid experimentation with table serializations.

Disclaimer: This project is still in the baking! Some things will still change

What is it useful for?

The Table Serializer Kitchen package is essential for converting tabular data into textual formats that Large Language Models (LLMs) can understand and process effectively. This process, known as table serialization, is crucial for tasks like question answering, and text-to-SQL generation. Additionally, table serializations are useful as basis for text-embeddings for dense retrieval over tabular data. By experimenting with different serialization strategies, you can significantly enhance the performance of LLMs and embedding models on tabular data, making your models more accurate and relevant.

Table serialization kitchen provides a robust foundation for exploring various table serialization approaches. It allows you to easily adjust how tables are serialized, set up replicable experiments, and identify the optimal serialization method for your specific use case. Whether you're a data scientist, NLP practitioner, or machine learning engineer, the Table Serializer Kitchen helps you unlock the full potential of LLMs for tasks involving structured data.

Installation

Install table serialization kitchen from pypi:

pip install tableserializer

Install from source

To install Table Serialization Kitchen from source, clone the repository and run the following command from the root directory of the repository:

pip install -e .

Usage

This description provides a high-level overview of the table serialization kitchen. For more details, consult the documentation. If you want to see an example of the package in action have a look at this blog post.

The central components for creating serializers with table serialization kitchen are recipes, component serializers for metadata, schema, and raw tables, row samplers, and table preprocessors. These components are combined into a central Serializer object.

Serializer

The Serializer class is the central instance in table serialization kitchen. It integrates the different components into a single instance that handles the serialization.

from tableserializer.serializer import Serializer
from tableserializer.serializer.table import MarkdownRawTableSerializer
from tableserializer.serializer.metadata import PairwiseMetadataSerializer
from tableserializer.recipe import SerializationRecipe
from tableserializer.table.row_sampler import RandomRowSampler
from tableserializer.table.preprocessor import StringTruncationPreprocessor

# Define recipe
recipe = SerializationRecipe("Metadata:\n{META}\n\nTable:\n{TABLE}")

# Create metadata serializer
metadata_serializer = PairwiseMetadataSerializer()

# Create raw table serializer
table_serializer = MarkdownRawTableSerializer()

# Create row sampler
row_sampler = RandomRowSampler(rows_to_sample=2, deterministic=False)

# Create table preprocessor
table_preprocessor = StringTruncationPreprocessor(max_len=100, apply_before_row_sampling=False)

# Put everything together into a Serializer
serializer = Serializer(recipe=recipe,
                        metadata_serializer=metadata_serializer,
                        table_serializer=table_serializer,
                        row_sampler=row_sampler,
                        table_preprocessors=[table_preprocessor])

A serializer can be called, providing a Table and some optional metadata as input. It outputs a serialization according to its specification.

import pandas as pd
from tableserializer.table import Table

example_df = pd.DataFrame([[2012, "From the Rough", "Edward"],
                           [1997, "The Borrowers", "Peagreen Clock"],
                           [2013, "In Secret", "Camille Raquin"]], columns=["Year", "Title", "Role"])

example_table = Table(example_df)

example_metadata = {"table_page_title": "Tom Felton", "table_section_title": "Films"}

# Serialize the example table
serialization = serializer.serialize(table=example_table, metadata=example_metadata)

print(serialization)

Output:

Metadata:
table_page_title: Tom Felton
table_section_title: Films

Table:
| Year | Title| Role |
|---|---|---|
| 2013 | In Secret| Camille Raquin |
| 2012 | From the Rough | Edward |

Recipe

The recipe provides the overarching outline for the serialization. The recipe is defined as an SerializationRecipe instance. The recipe contains placeholder values that are dynamically filled-in during serialization.

from tableserializer import SerializationRecipe

# Recipe with all placeholders
recipe = SerializationRecipe(
"""Metadata: 
{META}

Schema:
{SCHEMA}

Table:
{TABLE}
""")

# Recipe with a placeholder for raw table contents only
simple_recipe = SerializationRecipe(
"""Table:
{TABLE}
""")

You can place the placeholders values {META}, {SCHEMA}, and {TABLE} in the recipe. You design everything around these placeholders to your taste. You also do not need to use all the placeholders in a recipe.

The placeholders get filled-in at serialization time. The META placeholder reserves a space for metadata related to the table. The {SCHEMA} placeholder provides a space for the serialized schema of the table. Finally, the {TABLE} placeholder reserves a space for serialized raw table contents. The value that is filled into each of the placeholders of each of these components is generated by a component-specific serializer.

Component Serializers

Component serializers are tasked with serializing a specific component within the full serialization recipe. You can use one of the pre-built component serializers, or implement your own serializer.

Metadata Serializers

A metadata serializer serializes table-related metadata. Metadata serializers extend the MetadataSerializer base class. Metadata is expected to have the format of a dictionary Dict[str, Any]. It is serialized by the serialize_metadata function that a MetadataSerializer implementation must override.

from typing import Dict, Any
from tableserializer.serializer.metadata import MetadataSerializer

class ExampleMetadataSerializer(MetadataSerializer):

    def serialize_metadata(self, metadata: Dict[str, Any]) -> str:
        # This metadata serializer serializes the metadata as a newline separated concatenation of the values in the 
        # metadata dictionary
        serialization = ""
        for value in metadata.values():
            serialization += str(value) + "\n"
        return serialization[:-1]

Table serialization kitchen provides two default implementations of the MetadataSerializer base class:

PairwiseMetadataSerializer: Serializes the metadata as "key: value" pairs
JSONMetadataSerializer: Serializes the metadata dictionary as JSON string.

Schema Serializers

A schema serializer creates a serialization of the schema of a table. Schema serializers extend the SchemaSerializer base class. Schemas are serialized through the serialize_schema function that a SchemaSerializer implementation must override.

from typing import Dict, Any, Optional
from tableserializer.table import Table
from tableserializer.serializer.schema import SchemaSerializer


class ExampleSchemaSerializer(SchemaSerializer):
    
    def serialize_schema(self, table: Table, metadata: Optional[Dict[str, Any]] = None) -> str:
        # This schema serializer serializes the schema as a comma separated list of column names
        serialization = ""
        for column_name in table.as_dataframe().columns():
            serialization += column_name + ", "
        return serialization[:-2]

Table serialization kitchen provides two default implementations of the SchemaSerializer base class:

ColumnNameSchemaSerializer: Serializes the schema as a concatenation of the column names in the table, delimited by a specified delimiter.
SQLSchemaSerializer: Serializes the schema as a SQL CREATE TABLE statement.

Raw Table Serializers

A raw table serializer generates a serialized representation of a raw table and its contents. Raw table serializers extend the RawTableSerializer base class. Table contents are serialized through the serialize_raw_table function that a RawTableSerializer implementation must override.

from tableserializer.table import Table
from tableserializer.serializer.table import RawTableSerializer


class ExampleRawTableSerializer(RawTableSerializer):
    
    def serialize_raw_table(self, table: Table) -> str:
        # This raw table serializer serializes the table as csv
        return table.as_dataframe().to_csv(index=False)

Table serialization kitchen provides a collection of default implementations of the RawTableSerializer base class:

MarkdownRawTableSerializer: Serializes the table contents in Markdown table format.
JSONRawTableSerializer: Serializes raw tables to row-wise JSON representations
CSVRawTableSerializer: Serializes the table contents in csv format.
LatexRawTableSerializer: Serializes the table contents as LaTeX table.

Row Sampler

A row sampler is tasked with sampling a set number of rows from a table to limit the size of the table for the serialization. Row samplers extend the RowSampler base class. Rows are sampled through the sample function that a RowSerializer implementation must override.

from tableserializer.table.row_sampler import RowSampler
from tableserializer.table import Table

class ExampleRowSampler(RowSampler):
    
    def sample(self, table: Table) -> Table:
        # This row sampler samples the last rows from the dataframe
        last_rows_only_df = table.as_dataframe()[-self.rows_to_sample:].reset_index(drop=True)
        return Table(last_rows_only_df)

Table serialization kitchen provides a collection of default implementations of the RowSampler base class:

RandomRowSampler: Samples rows at random.
FirstRowSampler: Samples the first rows of the table.
KMeansRowSampler: Samples a diverse set of rows by employing k-means clustering.

Table Preprocessors

Table preprocessors are employed to transform the raw table before serialization. One motivation for this is to compress the table contents. For example, a table containing overly long string values may make the table serializations too long for embedding models. Table preprocessors extend the TablePreprocessor base class. An implementation must override the process function. The apply_before_row_sampling field specifies if the table preprocessor is executed before or after rows are sampled. Filtering columns may make sense to be done before row sampling, whereas preprocessors that compress strings (e.g., by generating summaries) may best be applied after row sampling.

from tableserializer.table.preprocessor import TablePreprocessor
from tableserializer.table import Table

class ExampleTablePreprocessor(TablePreprocessor):
    
    def __init__(self):
        super().__init__(apply_before_row_sampling=True)
    
    def process(self, table: Table) -> Table:
        # Preprocessor that removes the first column of the table
        table_df = table.as_dataframe()
        transformed_df = table_df.drop(columns=table_df.columns[0])
        return Table(transformed_df)

Table serialization kitchen provides two default implementations of the TablePreprocessor base class:

ColumnDroppingPreprocessor: Transforms a table by dropping specified columns.
StringTruncationPreprocessor: Truncates all strings in the table to a set maximum length before serialization.

Table Serialization Kitchen

WIP: Section still in the baking!

Hungry for more? Have a look at the table serialization kitchen API documentation.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

daniel_gomm

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 4, 2025

0.0.3

Feb 28, 2025

0.0.2

Feb 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tableserializer-0.1.0.tar.gz (21.2 kB view details)

Uploaded Mar 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tableserializer-0.1.0-py3-none-any.whl (22.2 kB view details)

Uploaded Mar 4, 2025 Python 3

File details

Details for the file tableserializer-0.1.0.tar.gz.

File metadata

Download URL: tableserializer-0.1.0.tar.gz
Upload date: Mar 4, 2025
Size: 21.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for tableserializer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0f339ac170612a4112940085cfcaa61f26d292a4be8c88741271778cb8afc1e4`
MD5	`7fa22fd541826cb7232015f38a777a7c`
BLAKE2b-256	`a252de180f53874c34d7f51e9f3d4484e8809f27110789328f8eb94e97ca4807`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tableserializer-0.1.0.tar.gz:

Publisher: publish.yml on daniel-gomm/table-serialization-kitchen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tableserializer-0.1.0.tar.gz
- Subject digest: 0f339ac170612a4112940085cfcaa61f26d292a4be8c88741271778cb8afc1e4
- Sigstore transparency entry: 177020307
- Sigstore integration time: Mar 4, 2025
Source repository:
- Permalink: daniel-gomm/table-serialization-kitchen@c9c48a9b4e7e3f34049464ecc02d7c902e0e79d3
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/daniel-gomm
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c9c48a9b4e7e3f34049464ecc02d7c902e0e79d3
- Trigger Event: release

File details

Details for the file tableserializer-0.1.0-py3-none-any.whl.

File metadata

Download URL: tableserializer-0.1.0-py3-none-any.whl
Upload date: Mar 4, 2025
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for tableserializer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ffb612c79ed11c88ab11dc7c65e85eea0ddb60e8ca34ec75c4d6aa06436f042b`
MD5	`503effe41cf02bd509f78c509afa64e6`
BLAKE2b-256	`6a954fc90f0f29529d933d3e575a89c24e2297321de1275083c302f5913c3e13`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tableserializer-0.1.0-py3-none-any.whl:

Publisher: publish.yml on daniel-gomm/table-serialization-kitchen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tableserializer-0.1.0-py3-none-any.whl
- Subject digest: ffb612c79ed11c88ab11dc7c65e85eea0ddb60e8ca34ec75c4d6aa06436f042b
- Sigstore transparency entry: 177020310
- Sigstore integration time: Mar 4, 2025
Source repository:
- Permalink: daniel-gomm/table-serialization-kitchen@c9c48a9b4e7e3f34049464ecc02d7c902e0e79d3
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/daniel-gomm
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c9c48a9b4e7e3f34049464ecc02d7c902e0e79d3
- Trigger Event: release

tableserializer 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Table Serialization Kitchen

What is it useful for?

Installation

Install from source

Usage

Serializer

Recipe

Component Serializers

Metadata Serializers

Schema Serializers

Raw Table Serializers

Row Sampler

Table Preprocessors

Table Serialization Kitchen

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance