High-performance, native parallel Rust PDF parser engine for VTU provisional results

These details have not been verified by PyPI

Project links

Project description

⚡ acatrack-pdf-parser-rs

A high-performance, native Rust PDF parsing engine designed specifically to extract structured academic student records and provisional exam marks. Bridged seamlessly to Python via PyO3 and Maturin, it leverages multi-threaded CPU parallel processing via Rayon to slash batch ingestion processing times.

Developed as the core ingestion engine of AcaTrack, this parser solves complex visual layout alignment issues mathematically and runs 38.4x faster than traditional sequential Python parsers.

🚀 Key Features

🏎️ Rayon Parallelization: GIL-free multi-threaded PDF table extraction using CPU core saturation.
🛡️ Spacing-Robust Digit Concatenation: Reconstructs fragmented, narrow visual columns (e.g., visual layout splits like "4" and "5" for a score of 45) automatically.
📐 Virtual Row Splitting: Automatically splits stacked cell values separated by newlines (\n) into neat, index-aligned rows.
📐 Mathematical Verification: Automatically executes algebraic checksum checks ($\text{IA} + \text{SEE} == \text{Total}$) to guarantee $100%$ parsing accuracy.
⚡ PyO3 FFI Bridge: Compiled into a native .so / .pyd module that can be imported directly in Python with zero performance loss.
📊 FFI Telemetry: Streams granular Rust execution logs back into Python for instant diagnostic debugging.

📊 Performance Benchmarks (1,308 PDFs)

Tested over 1,308 PDFs (across 4 ZIP upload requests) containing freshman provisional university results:

Metric	Sequential Python Core	Parallel Rust Engine (This Library) 🚀	Net Improvement
Total Parsing Duration	`21.78` minutes	`34.06` seconds (~0.57 min)	`38.4x` Faster 🚀
Speed per PDF	`0.9992` seconds	`0.0260` seconds	`38.4x` Faster 🚀
Memory Net Impact	`+268.48` MB	`+76.43` MB	`71.5%` Lower RAM 📉

🛠️ Architecture

The parser integrates a dual-tier parsing fallback mechanism to remain robust across layout changes:

graph TD
    A[Raw PDF Page] --> B{Tier 1: Clean Column Scan}
    B -- Found Code & Split Cells --> C[Unified Token Concatenation & Math Verification]
    B -- Layout Grid Failure --> D{Tier 2: Fallback Flat Text Scan}
    D --> E[Tokenize flat whitespace stream]
    E --> F[Match codes & parse trailing numeric pairs]
    C --> G[StudentRecord PyDict Object]
    F --> G

📦 Getting Started

Prerequisites

Rust Toolchain: rustup, rustc, cargo (Latest stable edition)
Python: 3.8+
Maturin: pip install maturin

Local Development & Setup

Clone the repository:

git clone https://github.com/chetanuchiha16/acatrack-pdf-parser-rs.git
cd acatrack-pdf-parser-rs

Compile and install locally into your active Python environment:

# Builds in release mode and sets up an editable package link
maturin develop --release

Verify compilation:

python -c "import acatrack_rust; print(acatrack_rust.__doc__)"

🐍 Python Usage Example

import acatrack_rust

# Target subjects to scan for
target_subjects = ["BMATS101", "BCHES102", "BCEDK103", "BENGK106"]

# Parse a single PDF file
record = acatrack_rust.parse_single_pdf(
    pdf_path="path/to/student_result.pdf",
    subject_codes=target_subjects
)

if record:
    print(f"USN: {record['usn']}")
    print(f"Name: {record['name']}")
    print(f"Marks Extracted: {record['marks']}")
    print("\n--- Telemetry Logs ---")
    for log in record['logs']:
        print(log)

📄 License

Licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 27, 2026

This version

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acatrack_pdf_parser_rs-0.1.0.tar.gz (24.6 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

acatrack_pdf_parser_rs-0.1.0-cp314-cp314-manylinux_2_34_x86_64.whl (1.7 MB view details)

Uploaded May 27, 2026 CPython 3.14manylinux: glibc 2.34+ x86-64

File details

Details for the file acatrack_pdf_parser_rs-0.1.0.tar.gz.

File metadata

Download URL: acatrack_pdf_parser_rs-0.1.0.tar.gz
Upload date: May 27, 2026
Size: 24.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for acatrack_pdf_parser_rs-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d38d299dd76389595f0bd82aa5c166996afe447d17e70b53b03263e62d06d376`
MD5	`02b3c1e7f1483a14a8b8fbb0ffd1a7db`
BLAKE2b-256	`30435a6a05ee39b39c3408fa2929ea865418db8bc1285e34e27b06cda56870d2`

See more details on using hashes here.

File details

Details for the file acatrack_pdf_parser_rs-0.1.0-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

Download URL: acatrack_pdf_parser_rs-0.1.0-cp314-cp314-manylinux_2_34_x86_64.whl
Upload date: May 27, 2026
Size: 1.7 MB
Tags: CPython 3.14, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for acatrack_pdf_parser_rs-0.1.0-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`c1d68bcb15e67f57499a37c30923f6e5cd8dfc5377c587f7cb241205a33e8387`
MD5	`8308f9511387b9dc8a8188c254905979`
BLAKE2b-256	`a2aa25227e81befd5b483136a473293663c296bfc96a4b13721c0992ac72c1c7`

See more details on using hashes here.

acatrack-pdf-parser-rs 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

⚡ acatrack-pdf-parser-rs

🚀 Key Features

📊 Performance Benchmarks (1,308 PDFs)

🛠️ Architecture

📦 Getting Started

Prerequisites

Local Development & Setup

🐍 Python Usage Example

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes