Connector components for the Sayou Data Platform
Project description
sayou-connector
sayou-connector is the universal data ingestion engine for the Sayou Data Platform. It provides a unified interface to fetch data from diverse sources—Local Files, Web URLs, APIs, and Databases.
Unlike simple HTTP clients or file readers, sayou-connector is designed as a Recursive Crawling Engine. It separates the logic of "What to fetch" (Navigation) from "How to fetch" (Transport), allowing for complex, stateful data collection strategies like web crawling or database pagination.
Philosophy
"Navigate First, Fetch Later." Data collection is not a one-off task; it's a discovery process. We define two distinct roles:
- Generator (Navigator): Determines the next target (e.g., finds the next page URL, calculates DB offset).
- Fetcher (Driver): Executes the retrieval (e.g., sends HTTP GET, executes SQL).
This separation allows the pipeline to be infinitely extensible—from a simple file walker to an AI-powered web crawler.
🚀 Key Features
- Strategy-Based Execution: Switch between
local_scan,web_crawl, orsql_scanwith a single parameter. - Recursive & Stateful: Supports BFS/DFS crawling for websites and directories with depth control.
- Smart Filtering: Built-in support for regex-based URL filtering and file extension filtering.
- AI-Ready: Designed to integrate with LLMs (Tier 3 Plugin) to intelligently identify CSS selectors or generate SQL queries dynamically.
📦 Installation
pip install sayou-connector
⚡ Quickstart
The ConnectorPipeline manages the loop between Generators and Fetchers.
from sayou.connector.pipeline import ConnectorPipeline
def run_demo():
# 1. Initialize Pipeline
pipeline = ConnectorPipeline()
pipeline.initialize()
# 2. Run (Example: Web Crawling)
print("Starting Web Crawl...")
results = pipeline.run(
source="BASE_URL",
strategy="web_crawl",
# Generator Options
link_pattern="BASE_PATTERN",
max_depth=1
)
# 3. Process Results (Stream)
for res in results:
print(f"[Fetched] {res.task.uri}")
# res.data contains extracted content or raw HTML
# res.task contains metadata
if __name__ == "__main__":
run_demo()
🤝 Contributing
We welcome contributions for new Fetchers (e.g., S3Fetcher, KafkaFetcher) or Generators (e.g., SitemapGenerator).
📜 License
Apache 2.0 License © 2025 Sayouzone
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sayou_connector-0.1.1.tar.gz.
File metadata
- Download URL: sayou_connector-0.1.1.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd932455b0b9eadc55ad5a631cf83836167a5530f2650cc6ec5e3b86bd91549f
|
|
| MD5 |
f4411fb98a1fa21125ecdce926052e56
|
|
| BLAKE2b-256 |
7544b664ee68b4afda2ac1f43d9676bff50a9cb941d2c9a7d6cc4ffc7a3c5948
|
File details
Details for the file sayou_connector-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sayou_connector-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39592da7f519de899ecc4e69ec9678f78682d32e3872fd9de3de16b30e4c17aa
|
|
| MD5 |
94b648f4fbef7665401789b56570267d
|
|
| BLAKE2b-256 |
1225cb3c4bf594c21eb73cb119ac31152479cf490269edb263cf55087b03f850
|