Skip to main content

Graph-oriented Synthetic data generation Pipeline library

Project description

SyGra: Graph-oriented Synthetic data generation Pipeline

CI Releases Documentation arXiv Licence


Framework to easily generate complex synthetic data pipelines by visualizing and configuring the pipeline as a computational graph. LangGraph is used as the underlying graph configuration/execution library. Refer to LangGraph examples to get a sense of the different kinds of computational graph which can be configured.

Introduction

SyGra Framework is created to generate synthetic data. As it is a complex process to define the flow, this design simplifies the synthetic data generation process. SyGra platform will support the following:

  • Defining the seed data configuration
  • Define a task, which involves graph node configuration, flow between nodes and conditions between the node
  • Define the output location to dump the generated data

Seed data can be pulled from either Huggingface or file system. Once the seed data is loaded, SyGra platform allows datagen users to write any data processing using the data transformation module. When the data is ready, users can define the data flow with various types of nodes. A node can also be a subgraph defined in another yaml file.

Each node can be defined with preprocessing, post processing, and LLM prompt with model parameters. Prompts can use seed data as python template keys.
Edges define the flow between nodes, which can be conditional or non-conditional, with support for parallel and one-to-many flows.

At the end, generated data is collected in the graph state for a specific record, processed further to generate the final dictionary to be written to the configured data sink.

SygraFramework


Installation

Pick how you want to use SyGra:

Install as Framework    Install as Library

Which one should I choose?

  • Framework → Run end-to-end pipelines from YAML graphs + CLI tooling and project scaffolding. (Start here: Installation)

  • Library → Import SyGra in your own Python app/notebook; call APIs directly. (Start here: SyGra Library)

[!NOTE]
Before running the commands below, make sure to add your model configuration in config/models.yaml and set environment variables for credentials and chat templates as described in the Model Configuration docs.

TL;DR – Framework Setup

See full steps in Installation.

git clone git@github.com:ServiceNow/SyGra.git

cd SyGra
poetry run python main.py --task examples.glaive_code_assistant --num_records=1
TL;DR – Library Setup

See full steps in Sygra Library.

pip install sygra   
import sygra

workflow = sygra.Workflow("tasks/examples/glaive_code_assistant")
workflow.run(num_records=1)

Quick Start

[!NOTE] To get started with SyGra, please refer to some Example Tasks or SyGra Documentation


Components

The SyGra architecture is composed of multiple components. The following diagrams illustrate the four primary components and their associated modules.

Data Handler

Data handler is used for reading and writing the data. Currently, it supports file handler with various file types and huggingface handler. When reading data from huggingface, it can read the whole dataset and process, or it can stream chunk of data.

DataHandler

Graph Node Module

This module is responsible for building various kind of nodes like LLM node, Multi-LLM node, Lambda node, Agent node etc. Each node is defined for various task, for example multi-llm node is used to load-balance the data among various inference point running same model.

Nodes

Graph Edge Connection

Once node are built, we can connect them with simple edge or conditional edge. Conditional edge uses python code to decide the path. Conditional edge helps implimenting if-else flow as well as loops in the graph.

Edges

Model clients

SyGra doesn't support inference within the framework, but it supports various clients, which helps connecting with different kind of servers. For example, openai client is being supported by Huggingface TGI, vLLM server and Azure services. However, model configuration does not allow to change clients, but it can be configured in models code.

ModelClient

Task Components

SyGra supports extendability and ease of implementation—most tasks are defined as graph configuration YAML files. Each task consists of two major components: a graph configuration and Python code to define conditions and processors. YAML contains various parts:

  • Data configuration : Configure file or huggingface as source and sink for the task.
  • Data transformation : Configuration to transform the data into the format it can be used in the graph.
  • Node configuration : Configure nodes and corresponding properties, preprocessor and post processor.
  • Edge configuration : Connect the nodes configured above with or without conditions.
  • Output configuration : Configuration for data tranformation before writing the data into sink.

A node is defined by the node module, supporting types like LLM call, multiple LLM call, lambda node, and sampler node.

LLM-based nodes require a model configured in models.yaml and runtime parameters. Sampler nodes pick random samples from static YAML lists. For custom node types, you can implement new nodes in the platform.

As of now, LLM inference is supported for TGI, vLLM, Azure, Azure OpenAI, Ollama and Triton compatible servers. Model deployment is external and configured in models.yaml.

Contact

To contact us, please send us an email!

License

The package is licensed by ServiceNow, Inc. under the Apache 2.0 license. See LICENSE for more details.


Questions?
Open an issue or start a discussion! Contributions are welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sygra-1.0.1.tar.gz (166.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sygra-1.0.1-py3-none-any.whl (409.6 kB view details)

Uploaded Python 3

File details

Details for the file sygra-1.0.1.tar.gz.

File metadata

  • Download URL: sygra-1.0.1.tar.gz
  • Upload date:
  • Size: 166.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for sygra-1.0.1.tar.gz
Algorithm Hash digest
SHA256 85d6a611deacf29b3b9a6155e0edb105571b1f2e8ce0f87ccead15218e9e1367
MD5 86e74a1ed6356d772f537472c174a188
BLAKE2b-256 9986bf345260b3bfb56d52920dccc6eae92f24c43ce89344c026964acc8ac345

See more details on using hashes here.

File details

Details for the file sygra-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: sygra-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 409.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for sygra-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a17e6fdb94c8e11640c0afd6ec9ba760cf57ac0c13e3fed87655de53237b8d46
MD5 2891c14376129c33dc74fbd977eedc34
BLAKE2b-256 12cf1f7586a537d36de68a89d582034820a4bca8d8ec1d1e54a4caabaf05d8c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page