Skip to main content

FERPA-compliant document filter for Haystack RAG pipelines — enforces identity-scoped access control before documents reach the LLM

Project description

ferpa-haystack

PyPI Python Tests License Downloads

FERPA-compliant document filtering for Haystack RAG pipelines.

Enforces 34 CFR § 99 identity-scoped access control at the retrieval layer — before any document reaches the LLM context window.


The Problem

Standard Haystack pipelines retrieve documents and pass them directly to the LLM with no enforcement of who is allowed to see what. In higher-education deployments, this creates a structural FERPA compliance gap: a student advising chatbot may return another student's academic record, financial aid details, or disciplinary history in response to a query.

This component closes that gap by adding a two-layer compliance filter between your retriever and your LLM.


Architecture

Haystack Pipeline
     │
     ▼
InMemoryEmbeddingRetriever (or any retriever)
     │  documents (all retrieved)
     ▼
FERPAMetadataFilter
     │  Layer 1: Identity pre-filter (student_id + institution_id)
     │  Layer 2: Category authorization (academic_record, financial_aid, ...)
     │
     ├── documents ──────────────► LLM (only authorized records)
     └── disclosure_record ──────► Audit log (34 CFR § 99.32)

Documents without identity metadata (course catalogues, policy handbooks) pass through both layers unchanged — shared knowledge-base content is never blocked.


Installation

pip install ferpa-haystack

Quick Start

from haystack import Pipeline
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.filters.ferpa_filter import FERPAMetadataFilter

doc_store = InMemoryDocumentStore()

ferpa_filter = FERPAMetadataFilter(
    student_id="stu_001",
    institution_id="univ_abc",
    authorized_categories=["academic_record", "financial_aid"],
    requesting_user_id="advisor_007",
)

pipeline = Pipeline()
pipeline.add_component("retriever", InMemoryEmbeddingRetriever(doc_store))
pipeline.add_component("ferpa_filter", ferpa_filter)
pipeline.add_component("llm", OpenAIGenerator(model="gpt-4o"))

pipeline.connect("retriever.documents", "ferpa_filter.documents")
pipeline.connect("ferpa_filter.documents", "llm.documents")

result = pipeline.run({"retriever": {"query_embedding": query_emb}})

# Only stu_001's authorized records reached the LLM
authorized_docs = result["ferpa_filter"]["documents"]

# 34 CFR § 99.32 audit entry — log this to your compliance system
audit_record = result["ferpa_filter"]["disclosure_record"]
print(audit_record.to_log_entry())

Filtering Layers

Layer 1 — Identity Pre-Filter

Documents are matched against student_id and institution_id metadata fields.

Document metadata Outcome
No student_id or institution_id Pass — treated as shared content
student_id matches Continue to Layer 2
student_id does not match Blocked

Layer 2 — Category Authorization

When authorized_categories is non-empty, the document's category field must be in the authorized set.

# Only academic records and financial aid — disciplinary records are blocked
FERPAMetadataFilter(
    student_id="stu_001",
    institution_id="univ_abc",
    authorized_categories=["academic_record", "financial_aid"],
    # "disciplinary" is blocked even if identity matches
)

Audit Record (34 CFR § 99.32)

Every call to run() produces a FERPADisclosureRecord regardless of how many documents are authorized:

@dataclass
class FERPADisclosureRecord:
    student_id: str
    institution_id: str
    requesting_user_id: str
    disclosed_at: datetime          # UTC timestamp
    total_retrieved: int            # documents from retriever
    total_disclosed: int            # documents that passed filtering
    categories_disclosed: list[str] # record categories in result
    pipeline_context: str           # pipeline/workflow label

Log it to your compliance database:

import logging
compliance_logger = logging.getLogger("ferpa.audit")
compliance_logger.info(result["ferpa_filter"]["disclosure_record"].to_log_entry())

Configuration

FERPAMetadataFilter(
    student_id="stu_001",
    institution_id="univ_abc",
    authorized_categories=["academic_record"],   # empty = all categories allowed
    requesting_user_id="advisor_007",            # recorded in audit log
    student_id_field="student_id",               # custom meta key
    institution_id_field="institution_id",       # custom meta key
    category_field="category",                   # custom meta key
    pipeline_context="advising_pipeline",        # audit label
    raise_on_violation=False,                    # True = raise PermissionError
)

Custom Field Names

If your document store uses different metadata keys:

FERPAMetadataFilter(
    student_id="stu_001",
    institution_id="univ_abc",
    student_id_field="learner_id",        # your custom key
    institution_id_field="campus_code",   # your custom key
    category_field="record_type",         # your custom key
)

Pipeline Serialization

The component is fully serializable for YAML/JSON pipeline storage:

pipeline.to_yaml("advising_pipeline.yaml")
pipeline_restored = Pipeline.from_yaml("advising_pipeline.yaml")

Regulatory Basis

Regulation Section What this component enforces
FERPA 34 CFR § 99.31(a)(1) Legitimate educational interest — only authorized roles access records
FERPA 34 CFR § 99.32 Record of disclosures — structured audit entry on every access

Related Projects


License

Apache License 2.0 — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ferpa_haystack-0.1.0.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ferpa_haystack-0.1.0-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file ferpa_haystack-0.1.0.tar.gz.

File metadata

  • Download URL: ferpa_haystack-0.1.0.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ferpa_haystack-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2a0729085bcc6dbd809a62e5379c817378120ad85761085afb9faa710eb88a25
MD5 c55abf22f559175d4a92369bd4ae3d17
BLAKE2b-256 ea944daedda3f1fe89d9143a1505088ac54a719dca233f3533d293c59bf16c70

See more details on using hashes here.

File details

Details for the file ferpa_haystack-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ferpa_haystack-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ferpa_haystack-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 37ba0e3db692acc023bd12b543492561ae2e42223361571fba51e81a07ae4f5a
MD5 b48e1b98546941ba04020fc8ab084c5a
BLAKE2b-256 72478e484fd89ab2de556c00de0d93117ecbd31000260705d1304ec862c0e8f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page