Skip to main content

High-performance, native parallel Rust PDF parser engine for VTU provisional results

Project description

⚡ acatrack-pdf-parser-rs

A high-performance, native Rust PDF parsing engine designed specifically to extract structured academic student records and provisional exam marks. Bridged seamlessly to Python via PyO3 and Maturin, it leverages multi-threaded CPU parallel processing via Rayon to slash batch ingestion processing times.

Developed as the core ingestion engine of AcaTrack, this parser solves complex visual layout alignment issues mathematically and runs 38.4x faster than traditional sequential Python parsers.


🚀 Key Features

  • 🏎️ Rayon Parallelization: GIL-free multi-threaded PDF table extraction using CPU core saturation.
  • 🛡️ Spacing-Robust Digit Concatenation: Reconstructs fragmented, narrow visual columns (e.g., visual layout splits like "4" and "5" for a score of 45) automatically.
  • 📐 Virtual Row Splitting: Automatically splits stacked cell values separated by newlines (\n) into neat, index-aligned rows.
  • 📐 Mathematical Verification: Automatically executes algebraic checksum checks ($\text{IA} + \text{SEE} == \text{Total}$) to guarantee $100%$ parsing accuracy.
  • ⚡ PyO3 FFI Bridge: Compiled into a native .so / .pyd module that can be imported directly in Python with zero performance loss.
  • 📊 FFI Telemetry: Streams granular Rust execution logs back into Python for instant diagnostic debugging.

📊 Performance Benchmarks (1,308 PDFs)

Tested over 1,308 PDFs (across 4 ZIP upload requests) containing freshman provisional university results:

Metric Sequential Python Core Parallel Rust Engine (This Library) 🚀 Net Improvement
Total Parsing Duration 21.78 minutes 34.06 seconds (~0.57 min) 38.4x Faster 🚀
Speed per PDF 0.9992 seconds 0.0260 seconds 38.4x Faster 🚀
Memory Net Impact +268.48 MB +76.43 MB 71.5% Lower RAM 📉

🛠️ Architecture

The parser integrates a dual-tier parsing fallback mechanism to remain robust across layout changes:

graph TD
    A[Raw PDF Page] --> B{Tier 1: Clean Column Scan}
    B -- Found Code & Split Cells --> C[Unified Token Concatenation & Math Verification]
    B -- Layout Grid Failure --> D{Tier 2: Fallback Flat Text Scan}
    D --> E[Tokenize flat whitespace stream]
    E --> F[Match codes & parse trailing numeric pairs]
    C --> G[StudentRecord PyDict Object]
    F --> G

📦 Getting Started

Prerequisites

  • Rust Toolchain: rustup, rustc, cargo (Latest stable edition)
  • Python: 3.8+
  • Maturin: pip install maturin

Local Development & Setup

  1. Clone the repository:

    git clone https://github.com/chetanuchiha16/acatrack-pdf-parser-rs.git
    cd acatrack-pdf-parser-rs
    
  2. Compile and install locally into your active Python environment:

    # Builds in release mode and sets up an editable package link
    maturin develop --release
    
  3. Verify compilation:

    python -c "import acatrack_rust; print(acatrack_rust.__doc__)"
    

🐍 Python Usage Example

import acatrack_rust

# Target subjects to scan for
target_subjects = ["BMATS101", "BCHES102", "BCEDK103", "BENGK106"]

# Parse a single PDF file
record = acatrack_rust.parse_single_pdf(
    pdf_path="path/to/student_result.pdf",
    subject_codes=target_subjects
)

if record:
    print(f"USN: {record['usn']}")
    print(f"Name: {record['name']}")
    print(f"Marks Extracted: {record['marks']}")
    print("\n--- Telemetry Logs ---")
    for log in record['logs']:
        print(log)


📄 License

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

acatrack_pdf_parser_rs-0.1.1-cp310-cp310-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.10Windows x86-64

acatrack_pdf_parser_rs-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

acatrack_pdf_parser_rs-0.1.1-cp310-cp310-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file acatrack_pdf_parser_rs-0.1.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for acatrack_pdf_parser_rs-0.1.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 832238dfc656625ea6fbc5fe9d92516d62ec10ef41aeb395522a9ab90c87e9e4
MD5 49eab81ac721728d1a60d578db190f64
BLAKE2b-256 304671c08483059055e2ae0204bcaacb0c98bf315d2aba39d67e36d22281ca87

See more details on using hashes here.

File details

Details for the file acatrack_pdf_parser_rs-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for acatrack_pdf_parser_rs-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 98a6fb4b3aaad35798dc33cbcd5b14257bed33362994603d551e3424e57b67d8
MD5 5a081daec0ee4f7948d8d26fe5a05730
BLAKE2b-256 b9afcd7fdcc078c3fe86cf1d2900468a69d9e0e1825717b9dcfdd05532ac116e

See more details on using hashes here.

File details

Details for the file acatrack_pdf_parser_rs-0.1.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for acatrack_pdf_parser_rs-0.1.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5807ce39b76873bae2c35b0d9b842b6bf3e555c89f81dd9d00cfe64b2c43b28b
MD5 9893b5f632e4864832c2feb4bc4ca984
BLAKE2b-256 55f05d4cb79a72a02ca4f3037d56a518a97c43792032c97fdfe6029e223e085b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page