FERPA-compliant document filter for Haystack RAG pipelines — enforces identity-scoped access control before documents reach the LLM
Project description
ferpa-haystack
FERPA-compliant document filtering for Haystack RAG pipelines.
Enforces 34 CFR § 99 identity-scoped access control at the retrieval layer — before any document reaches the LLM context window.
The Problem
Standard Haystack pipelines retrieve documents and pass them directly to the LLM with no enforcement of who is allowed to see what. In higher-education deployments, this creates a structural FERPA compliance gap: a student advising chatbot may return another student's academic record, financial aid details, or disciplinary history in response to a query.
This component closes that gap by adding a two-layer compliance filter between your retriever and your LLM.
Architecture
Haystack Pipeline
│
▼
InMemoryEmbeddingRetriever (or any retriever)
│ documents (all retrieved)
▼
FERPAMetadataFilter
│ Layer 1: Identity pre-filter (student_id + institution_id)
│ Layer 2: Category authorization (academic_record, financial_aid, ...)
│
├── documents ──────────────► LLM (only authorized records)
└── disclosure_record ──────► Audit log (34 CFR § 99.32)
Documents without identity metadata (course catalogues, policy handbooks) pass through both layers unchanged — shared knowledge-base content is never blocked.
Installation
pip install ferpa-haystack
Quick Start
from haystack import Pipeline
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.filters.ferpa_filter import FERPAMetadataFilter
doc_store = InMemoryDocumentStore()
ferpa_filter = FERPAMetadataFilter(
student_id="stu_001",
institution_id="univ_abc",
authorized_categories=["academic_record", "financial_aid"],
requesting_user_id="advisor_007",
)
pipeline = Pipeline()
pipeline.add_component("retriever", InMemoryEmbeddingRetriever(doc_store))
pipeline.add_component("ferpa_filter", ferpa_filter)
pipeline.add_component("llm", OpenAIGenerator(model="gpt-4o"))
pipeline.connect("retriever.documents", "ferpa_filter.documents")
pipeline.connect("ferpa_filter.documents", "llm.documents")
result = pipeline.run({"retriever": {"query_embedding": query_emb}})
# Only stu_001's authorized records reached the LLM
authorized_docs = result["ferpa_filter"]["documents"]
# 34 CFR § 99.32 audit entry — log this to your compliance system
audit_record = result["ferpa_filter"]["disclosure_record"]
print(audit_record.to_log_entry())
Filtering Layers
Layer 1 — Identity Pre-Filter
Documents are matched against student_id and institution_id metadata fields.
| Document metadata | Outcome |
|---|---|
No student_id or institution_id |
Pass — treated as shared content |
student_id matches |
Continue to Layer 2 |
student_id does not match |
Blocked |
Layer 2 — Category Authorization
When authorized_categories is non-empty, the document's category field must be in the authorized set.
# Only academic records and financial aid — disciplinary records are blocked
FERPAMetadataFilter(
student_id="stu_001",
institution_id="univ_abc",
authorized_categories=["academic_record", "financial_aid"],
# "disciplinary" is blocked even if identity matches
)
Audit Record (34 CFR § 99.32)
Every call to run() produces a FERPADisclosureRecord regardless of how many documents are authorized:
@dataclass
class FERPADisclosureRecord:
student_id: str
institution_id: str
requesting_user_id: str
disclosed_at: datetime # UTC timestamp
total_retrieved: int # documents from retriever
total_disclosed: int # documents that passed filtering
categories_disclosed: list[str] # record categories in result
pipeline_context: str # pipeline/workflow label
Log it to your compliance database:
import logging
compliance_logger = logging.getLogger("ferpa.audit")
compliance_logger.info(result["ferpa_filter"]["disclosure_record"].to_log_entry())
Configuration
FERPAMetadataFilter(
student_id="stu_001",
institution_id="univ_abc",
authorized_categories=["academic_record"], # empty = all categories allowed
requesting_user_id="advisor_007", # recorded in audit log
student_id_field="student_id", # custom meta key
institution_id_field="institution_id", # custom meta key
category_field="category", # custom meta key
pipeline_context="advising_pipeline", # audit label
raise_on_violation=False, # True = raise PermissionError
)
Custom Field Names
If your document store uses different metadata keys:
FERPAMetadataFilter(
student_id="stu_001",
institution_id="univ_abc",
student_id_field="learner_id", # your custom key
institution_id_field="campus_code", # your custom key
category_field="record_type", # your custom key
)
Pipeline Serialization
The component is fully serializable for YAML/JSON pipeline storage:
pipeline.to_yaml("advising_pipeline.yaml")
pipeline_restored = Pipeline.from_yaml("advising_pipeline.yaml")
Regulatory Basis
| Regulation | Section | What this component enforces |
|---|---|---|
| FERPA | 34 CFR § 99.31(a)(1) | Legitimate educational interest — only authorized roles access records |
| FERPA | 34 CFR § 99.32 | Record of disclosures — structured audit entry on every access |
Related Projects
- enterprise-rag-patterns — FERPA, HIPAA, GDPR compliance patterns for RAG across 50+ regulated sectors
- regulated-ai-governance — Policy enforcement for AI agents across 25 jurisdictions
License
Apache License 2.0 — see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ferpa_haystack-0.1.0.tar.gz.
File metadata
- Download URL: ferpa_haystack-0.1.0.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a0729085bcc6dbd809a62e5379c817378120ad85761085afb9faa710eb88a25
|
|
| MD5 |
c55abf22f559175d4a92369bd4ae3d17
|
|
| BLAKE2b-256 |
ea944daedda3f1fe89d9143a1505088ac54a719dca233f3533d293c59bf16c70
|
File details
Details for the file ferpa_haystack-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ferpa_haystack-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37ba0e3db692acc023bd12b543492561ae2e42223361571fba51e81a07ae4f5a
|
|
| MD5 |
b48e1b98546941ba04020fc8ab084c5a
|
|
| BLAKE2b-256 |
72478e484fd89ab2de556c00de0d93117ecbd31000260705d1304ec862c0e8f4
|