Connector components for the Sayou Data Platform
Project description
sayou-connector
The Universal Data Ingestion Engine for Sayou Fabric.
sayou-connector provides a unified interface to fetch data from diverse sources—Local Files, Web URLs, and Databases—normalizing everything into a standard format called SayouPacket.
It separates the logic of Navigation (Generator) from Retrieval (Fetcher), enabling complex recursive crawling and pagination strategies out of the box.
💡 Core Philosophy
"Navigate First, Fetch Later."
Data collection is not just about downloading; it's about discovery. We decouple the responsibility into two roles:
- Generator (Navigator): The "Brain". It decides what to fetch next (e.g., calculates DB offsets, finds next page links) and yields a Task.
- Fetcher (Driver): The "Muscle". It executes the actual retrieval (e.g., HTTP GET, SQL Query) and returns a Packet.
This separation enables the Feedback Loop, where the result of a fetch (e.g., found links) feeds back into the Generator to discover more targets.
📦 Installation
pip install sayou-connector
⚡ Quick Start
The ConnectorPipeline manages the feedback loop between Generators and Fetchers.
from sayou.connector.pipeline import ConnectorPipeline
def run_demo():
# 1. Initialize Pipeline
pipeline = ConnectorPipeline()
pipeline.initialize()
# 2. Run (Example: Web Crawling)
print("Starting Web Crawl...")
# Returns an iterator of 'SayouPacket' objects
packets = pipeline.run(
source="https://news.daum.net/tech",
strategy="requests",
link_pattern=r"https://v\.daum\.net/v/\d+",
max_depth=1
)
# 3. Process Results (Stream)
for packet in packets:
if packet.success:
print(f"[Fetched] {packet.task.uri}")
# packet.data contains the extracted content (dict, bytes, etc.)
print(f" Data: {str(packet.data)[:50]}...")
else:
print(f"[Error] {packet.error}")
if __name__ == "__main__":
run_demo()
🔑 Key Concepts
Generators
FileGenerator: Recursively scans directories to find files matching extensions or patterns.SqlGenerator: Generates paginated SQL queries (LIMIT/OFFSET) to fetch large tables in batches.WebCrawlGenerator: Manages a URL frontier queue for BFS/DFS web crawling with depth control.
Fetchers
FileFetcher: Reads binary or text content from the local file system.SqliteFetcher: Executes SQL queries against SQLite databases securely.SimpleWebFetcher: Fetches HTML pages and extracts data/links using BeautifulSoup.
🤝 Contributing
We welcome contributions for new Fetchers (e.g., S3Fetcher, KafkaFetcher) or Generators (e.g., SitemapGenerator)!
📜 License
Apache 2.0 License © 2025 Sayouzone
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sayou_connector-0.3.6.tar.gz.
File metadata
- Download URL: sayou_connector-0.3.6.tar.gz
- Upload date:
- Size: 24.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b04962ccbd48712f57e01c5ff604729932a8f992d2d7b6fa0d9e8a3bc06530f
|
|
| MD5 |
8051a52b8c204741a3a08551d28db003
|
|
| BLAKE2b-256 |
504508c8de257a2254df91f3c7b7674968ebc0e72e5362a62b22f093d3203eb6
|
File details
Details for the file sayou_connector-0.3.6-py3-none-any.whl.
File metadata
- Download URL: sayou_connector-0.3.6-py3-none-any.whl
- Upload date:
- Size: 24.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
783517fa99e98952155ab6dff980c65dcd8127d10ecb058d528b7ef27dd249ba
|
|
| MD5 |
7262f9e5d131a12989bd9f8aeb36b11f
|
|
| BLAKE2b-256 |
644eac71dc454bf452feac7a6624ee41b0f1e6986872cdbf2cfeef12d19f55fc
|