Skip to main content

High-performance, native parallel Rust PDF parser engine for VTU provisional results

Project description

⚡ acatrack-pdf-parser-rs

A high-performance, native Rust PDF parsing engine designed specifically to extract structured academic student records and provisional exam marks. Bridged seamlessly to Python via PyO3 and Maturin, it leverages multi-threaded CPU parallel processing via Rayon to slash batch ingestion processing times.

Developed as the core ingestion engine of AcaTrack, this parser solves complex visual layout alignment issues mathematically and runs 38.4x faster than traditional sequential Python parsers.


🚀 Key Features

  • 🏎️ Rayon Parallelization: GIL-free multi-threaded PDF table extraction using CPU core saturation.
  • 🛡️ Spacing-Robust Digit Concatenation: Reconstructs fragmented, narrow visual columns (e.g., visual layout splits like "4" and "5" for a score of 45) automatically.
  • 📐 Virtual Row Splitting: Automatically splits stacked cell values separated by newlines (\n) into neat, index-aligned rows.
  • 📐 Mathematical Verification: Automatically executes algebraic checksum checks ($\text{IA} + \text{SEE} == \text{Total}$) to guarantee $100%$ parsing accuracy.
  • ⚡ PyO3 FFI Bridge: Compiled into a native .so / .pyd module that can be imported directly in Python with zero performance loss.
  • 📊 FFI Telemetry: Streams granular Rust execution logs back into Python for instant diagnostic debugging.

📊 Performance Benchmarks (1,308 PDFs)

Tested over 1,308 PDFs (across 4 ZIP upload requests) containing freshman provisional university results:

Metric Sequential Python Core Parallel Rust Engine (This Library) 🚀 Net Improvement
Total Parsing Duration 21.78 minutes 34.06 seconds (~0.57 min) 38.4x Faster 🚀
Speed per PDF 0.9992 seconds 0.0260 seconds 38.4x Faster 🚀
Memory Net Impact +268.48 MB +76.43 MB 71.5% Lower RAM 📉

🛠️ Architecture

The parser integrates a dual-tier parsing fallback mechanism to remain robust across layout changes:

graph TD
    A[Raw PDF Page] --> B{Tier 1: Clean Column Scan}
    B -- Found Code & Split Cells --> C[Unified Token Concatenation & Math Verification]
    B -- Layout Grid Failure --> D{Tier 2: Fallback Flat Text Scan}
    D --> E[Tokenize flat whitespace stream]
    E --> F[Match codes & parse trailing numeric pairs]
    C --> G[StudentRecord PyDict Object]
    F --> G

📦 Getting Started

Prerequisites

  • Rust Toolchain: rustup, rustc, cargo (Latest stable edition)
  • Python: 3.8+
  • Maturin: pip install maturin

Local Development & Setup

  1. Clone the repository:

    git clone https://github.com/chetanuchiha16/acatrack-pdf-parser-rs.git
    cd acatrack-pdf-parser-rs
    
  2. Compile and install locally into your active Python environment:

    # Builds in release mode and sets up an editable package link
    maturin develop --release
    
  3. Verify compilation:

    python -c "import acatrack_rust; print(acatrack_rust.__doc__)"
    

🐍 Python Usage Example

import acatrack_rust

# Target subjects to scan for
target_subjects = ["BMATS101", "BCHES102", "BCEDK103", "BENGK106"]

# Parse a single PDF file
record = acatrack_rust.parse_single_pdf(
    pdf_path="path/to/student_result.pdf",
    subject_codes=target_subjects
)

if record:
    print(f"USN: {record['usn']}")
    print(f"Name: {record['name']}")
    print(f"Marks Extracted: {record['marks']}")
    print("\n--- Telemetry Logs ---")
    for log in record['logs']:
        print(log)


📄 License

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acatrack_pdf_parser_rs-0.1.0.tar.gz (24.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acatrack_pdf_parser_rs-0.1.0-cp314-cp314-manylinux_2_34_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

File details

Details for the file acatrack_pdf_parser_rs-0.1.0.tar.gz.

File metadata

File hashes

Hashes for acatrack_pdf_parser_rs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d38d299dd76389595f0bd82aa5c166996afe447d17e70b53b03263e62d06d376
MD5 02b3c1e7f1483a14a8b8fbb0ffd1a7db
BLAKE2b-256 30435a6a05ee39b39c3408fa2929ea865418db8bc1285e34e27b06cda56870d2

See more details on using hashes here.

File details

Details for the file acatrack_pdf_parser_rs-0.1.0-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for acatrack_pdf_parser_rs-0.1.0-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c1d68bcb15e67f57499a37c30923f6e5cd8dfc5377c587f7cb241205a33e8387
MD5 8308f9511387b9dc8a8188c254905979
BLAKE2b-256 a2aa25227e81befd5b483136a473293663c296bfc96a4b13721c0992ac72c1c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page