Streaming ingestion pipeline for AMFI NAV and scheme master data
Project description
amfi-stream
Streaming-first ingestion pipeline for AMFI mutual fund data, built on Apache Arrow.
It transforms raw AMFI datasets into schema-safe, analytics-ready tables using a lightweight, parallel streaming engine.
What is amfi-stream
amfi-stream is a data ingestion layer that sits between AMFI data sources and analytics tools.
It is designed for:
- Streaming ingestion of NAV and scheme master data
- Automatic normalization of AMFI formats
- Schema enforcement using Apache Arrow
- Parallel data fetching and processing
- Clean outputs for downstream analytics systems
Ecosystem overview
AMFI Data Sources (NAV, Scheme Files)
↓
amfi-stream
(Streaming Ingestion Engine)
↓
Sanitization + Normalization
(Arrow Schema Enforcement)
↓
Apache Arrow Tables
↓
Downstream Analytics Tools
(Polars / DuckDB / Pandas / Spark)
amfi-stream is a streaming ingestion and normalization layer, not a data API wrapper or analytics engine.
Ecosystem comparison
| Solution | Type | Access Model | Structure | Multi-fund Support | Streaming | Cost | Key Limitation |
|---|---|---|---|---|---|---|---|
| amfi-stream | Ingestion pipeline | Bulk streaming ingestion | Arrow schema enforced | Native dataset-level | Yes | Free | Focused on ingestion, not APIs |
| mfapi.in | API service | REST endpoints | JSON structured | Client-side aggregation | Limited | Free | Request-per-fund model |
| navpipe | SDK | Fund-code queries | Polars output | Requires fund list | Yes | Free | Not dataset ingestion |
| mftool | Library | Scraping-based | Partial | Manual aggregation | No | Free | Fragile parsing logic |
| AMFI India Portal | Raw source | File downloads | None | Post-processing required | No | Free | Unstructured format |
Core design principle
- Most tools assume: Data is already structured and ready to consume.
- amfi-stream assumes: Data is streamed, raw, and must be normalized deterministically before analysis.
Features
- Streaming ingestion via HTTP (fsspec)
- Automatic AMFI data sanitization
- Schema enforcement using Apache Arrow
- Parallel execution engine
- Composable job-based architecture
- Arrow-native outputs (no Pandas required)
Quick start
from amfi_stream import AMFIPipeline, stream_latest_nav, stream_scheme_master, stream_historical_nav
jobs = [
stream_scheme_master(),
stream_latest_nav(),
stream_historical_nav("1-May-2025", "1-May-2026")
]
with AMFIPipeline(max_workers=4) as pipeline:
result = pipeline.run(jobs)
print(result.latest_nav)
Output Format
All outputs are returned as PyArrow tables:
AMFIResult(
scheme_master=pa.Table | None,
latest_nav=pa.Table | None,
historical_nav=pa.Table | None,
)
Architecture
URL Sources → Streaming Engine → Sanitizer → CSV Parser → Arrow Tables → Normalisers → Pipeline Output
Coming Soon
We are introducing an enhanced output schema that extends raw AMFI NAV data with additional derived, analytics-ready columns.
These improvements will provide a more structured and computation-friendly dataset on top of the standard AMFI format, reducing the need for post-processing in downstream tools and improving compatibility with analytical workflows in Arrow-native environments.
Design Philosophy
- Streaming over batch processing
- Schema-first ingestion
- Apache Arrow as canonical format
- Minimal dependencies
- Deterministic, reproducible pipelines
Contributing
This project is released under the Apache 2.0 License, and contributions are welcome.
Areas where contributions are especially useful:
- Historical NAV ingestion implementation
- Performance improvements in ingestion engine
- Additional normalization rules for AMFI formats
- Test coverage expansion
License
Apache License 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amfi_stream-0.2.0.tar.gz.
File metadata
- Download URL: amfi_stream-0.2.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd31586bfb4dc4478fe8b026c473a08976fe4550ee8dbfa4f9247ee28ccb2356
|
|
| MD5 |
03c948a865f93bf3e763cd0feb0a0b40
|
|
| BLAKE2b-256 |
ba22c0f45a9628a71fe774b6cab5a68970931ae6b72fab730ee4a2b1ce8638e7
|
Provenance
The following attestation bundles were made for amfi_stream-0.2.0.tar.gz:
Publisher:
publish.yml on MSM2002/amfi-stream
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
amfi_stream-0.2.0.tar.gz -
Subject digest:
cd31586bfb4dc4478fe8b026c473a08976fe4550ee8dbfa4f9247ee28ccb2356 - Sigstore transparency entry: 1420187914
- Sigstore integration time:
-
Permalink:
MSM2002/amfi-stream@175ab00669378a7b1f9787f8eb02159f704be261 -
Branch / Tag:
refs/tags/0.2.0 - Owner: https://github.com/MSM2002
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@175ab00669378a7b1f9787f8eb02159f704be261 -
Trigger Event:
push
-
Statement type:
File details
Details for the file amfi_stream-0.2.0-py3-none-any.whl.
File metadata
- Download URL: amfi_stream-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe6c8cb30ffd4c768dd348b7ed81514e807080e457ad4955ff6435fd9f56e2b2
|
|
| MD5 |
599598d956b121e5c60f39aadfd44200
|
|
| BLAKE2b-256 |
324abfdff85f0abfc5ddca053a4ef8141a690caad1913fda46a429c60f33d81a
|
Provenance
The following attestation bundles were made for amfi_stream-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on MSM2002/amfi-stream
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
amfi_stream-0.2.0-py3-none-any.whl -
Subject digest:
fe6c8cb30ffd4c768dd348b7ed81514e807080e457ad4955ff6435fd9f56e2b2 - Sigstore transparency entry: 1420187989
- Sigstore integration time:
-
Permalink:
MSM2002/amfi-stream@175ab00669378a7b1f9787f8eb02159f704be261 -
Branch / Tag:
refs/tags/0.2.0 - Owner: https://github.com/MSM2002
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@175ab00669378a7b1f9787f8eb02159f704be261 -
Trigger Event:
push
-
Statement type: