A high-performance, unifying library for data ingestion pipelines from multiple sources.
Project description
FeedUnify
A high-performance, asynchronous Python library designed to unify and simplify data ingestion from multiple sources like RSS feeds and APIs into a single, clean format.
About The Project
Developers often need to pull data from various inconsistent sources like RSS feeds, Atom feeds, JSON APIs, and more. Each source has its own data structure and quirks, leading to brittle, custom code for each one.
feedunify solves this by providing a single, elegant interface to fetch, parse, and standardize content from any source into a predictable, easy-to-use FeedItem object.
Key Features
- Unified Schema: All data is parsed into a standard
FeedItemobject with consistent fields like.title,.url, and.published_at. - Asynchronous-First: Built from the ground up with
asyncioandhttpxto handle hundreds of sources concurrently without blocking. - Extensible Architecture: Designed around a
BaseConnectorclass, allowing new connectors for different source types to be easily added. - Type-Safe & Robust: Leverages
pydanticfor powerful data validation and parsing, preventing errors from malformed data.
Installation
you can install the library with:
pip install feedunify
Changelog
See the [CHANGELOG.md] file for a detailed history of changes to the project.
Quickstart
Here's how easy it is to fetch articles from multiple RSS feeds at the same time.
import asyncio
from feedunify import Forge
# A list of RSS feeds to fetch from.
SOURCES = [
"https://www.theverge.com/rss/index.xml",
"https://www.wired.com/feed/rss",
"https://hnrss.org/frontpage"
]
async def main():
"""Main function to run the fetching process."""
# 1. Create an instance of the main Forge class.
forge = Forge()
# 2. Fetch all items concurrently.
print(f"Fetching from {len(SOURCES)} sources")
all_items = await forge.fetch_all(sources=SOURCES)
print(f"Found {len(all_items)} total items.")
# 3. Work with the clean, standardized data.
print("\nLatest from The Verge:")
for item in all_items:
if "theverge.com" in str(item.source_url):
print(f"- {item.title}")
if __name__ == "__main__":
asyncio.run(main())
Usage
First, fetch a list of items from your desired sources.
import asyncio
from feedunify import Forge
SOURCES = ["https://www.theverge.com/rss/index.xml", "https://hnrss.org/frontpage"]
async def get_items():
forge = Forge()
all_items = await forge.fetch_all(sources=SOURCES)
return all_items
items = asyncio.run(get_items())
Once you have the items list, you can easily work with the standardized data.
Example 1: Find All Articles About "AI"
ai_articles = [
item for item in items
if "ai" in item.title.lower()
]
print("AI Articles Found:")
for article in ai_articles:
print(f"- {article.title}")
Example 2: Get the 5 Most Recent Articles
# Filter out items that might not have a publication date.
dated_items = [item for item in items if item.published_at]
# Sort the items by date.
dated_items.sort(key=lambda item: item.published_at, reverse=True)
print("\nMost Recent Articles:")
for article in dated_items[:5]:
print(f"- {article.title} (Published: {article.published_at.strftime('%Y-%m-%d')})")
The FeedItem Object
The primary output of feedunify is a list of FeedItem objects. This object provides a standardized interface to the data, regardless of the original source.
Key Attributes
item.id(str): A unique identifier for the item.item.title(str): The headline or title.item.url(HttpUrl): A validated Pydantic URL object for the original content.item.source_url(HttpUrl): The URL of the feed this item came from.item.summary(str | None): A short summary or description.item.published_at(datetime | None): A timezone-aware datetime object of when the item was published.item.authors(List[Author]): A list ofAuthorobjects, each with.nameand.urlattributes.item.tags(List[str]): A list of tags or categories.item.raw(dict | None): The original, unprocessed data from the source, useful for debugging.
Future Plans
feedunify is actively being developed. Future goals include:
- Adding a connector for common JSON APIs.
- Implementing intelligent HTTP caching (ETags, Last-Modified).
- Improving source detection logic.
- Exploring support for more complex sources like newsletters.
Contributing
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
- Fork the Project
- Create your Feature Branch
- Commit your Changes
- Push to the Branch
- Open a Pull Request
License
Distributed under the MIT License. See LICENSE for more information.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file feedunify-0.3.3.tar.gz.
File metadata
- Download URL: feedunify-0.3.3.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41bed1a8ebbf3e9cd781ef9f82e0c5fcdacd850b9dcff87cca41e0ba0967c9f0
|
|
| MD5 |
93507ca74aecb191b2564c577bde5210
|
|
| BLAKE2b-256 |
93087ca7bf0cbd224bdd45630cfe8fa019c02790c95466b15cd14a285e8b05f4
|
File details
Details for the file feedunify-0.3.3-py3-none-any.whl.
File metadata
- Download URL: feedunify-0.3.3-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73be216a989d6eaa5ab9cfa4926e127fe1ce007a3ef3c8fbc711588e4f0c8d45
|
|
| MD5 |
28010c16fb459ed8c96261a5c2fcf2e2
|
|
| BLAKE2b-256 |
80430dfe43fd12818ad532de0590d78ef9bda91666914e33b27b4104c66373b2
|