Skip to main content

Connector components for the Sayou Data Platform

Project description

sayou-connector

Build Status License: Apache 2.0 Docs

sayou-connector is the universal data ingestion engine for the Sayou Data Platform. It provides a unified interface to fetch data from diverse sources—Local Files, Web URLs, APIs, and Databases.

Unlike simple HTTP clients or file readers, sayou-connector is designed as a Recursive Crawling Engine. It separates the logic of "What to fetch" (Navigation) from "How to fetch" (Transport), allowing for complex, stateful data collection strategies like web crawling or database pagination.

Philosophy

"Navigate First, Fetch Later." Data collection is not a one-off task; it's a discovery process. We define two distinct roles:

  1. Generator (Navigator): Determines the next target (e.g., finds the next page URL, calculates DB offset).
  2. Fetcher (Driver): Executes the retrieval (e.g., sends HTTP GET, executes SQL).

This separation allows the pipeline to be infinitely extensible—from a simple file walker to an AI-powered web crawler.

🚀 Key Features

  • Strategy-Based Execution: Switch between local_scan, web_crawl, or sql_scan with a single parameter.
  • Recursive & Stateful: Supports BFS/DFS crawling for websites and directories with depth control.
  • Smart Filtering: Built-in support for regex-based URL filtering and file extension filtering.
  • AI-Ready: Designed to integrate with LLMs (Tier 3 Plugin) to intelligently identify CSS selectors or generate SQL queries dynamically.

📦 Installation

pip install sayou-connector

⚡ Quickstart

The ConnectorPipeline manages the loop between Generators and Fetchers.

from sayou.connector.pipeline import ConnectorPipeline

def run_demo():
    # 1. Initialize Pipeline
    pipeline = ConnectorPipeline()
    pipeline.initialize()

    # 2. Run (Example: Web Crawling)
    print("Starting Web Crawl...")
    results = pipeline.run(
        source="BASE_URL",
        strategy="web_crawl",
        # Generator Options
        link_pattern="BASE_PATTERN",
        max_depth=1
    )

    # 3. Process Results (Stream)
    for res in results:
        print(f"[Fetched] {res.task.uri}")
        # res.data contains extracted content or raw HTML
        # res.task contains metadata

if __name__ == "__main__":
    run_demo()

🤝 Contributing

We welcome contributions for new Fetchers (e.g., S3Fetcher, KafkaFetcher) or Generators (e.g., SitemapGenerator).

📜 License

Apache 2.0 License © 2025 Sayouzone

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sayou_connector-0.1.1.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sayou_connector-0.1.1-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file sayou_connector-0.1.1.tar.gz.

File metadata

  • Download URL: sayou_connector-0.1.1.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_connector-0.1.1.tar.gz
Algorithm Hash digest
SHA256 dd932455b0b9eadc55ad5a631cf83836167a5530f2650cc6ec5e3b86bd91549f
MD5 f4411fb98a1fa21125ecdce926052e56
BLAKE2b-256 7544b664ee68b4afda2ac1f43d9676bff50a9cb941d2c9a7d6cc4ffc7a3c5948

See more details on using hashes here.

File details

Details for the file sayou_connector-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sayou_connector-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 39592da7f519de899ecc4e69ec9678f78682d32e3872fd9de3de16b30e4c17aa
MD5 94b648f4fbef7665401789b56570267d
BLAKE2b-256 1225cb3c4bf594c21eb73cb119ac31152479cf490269edb263cf55087b03f850

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page