High-performance, native parallel Rust PDF parser engine for VTU provisional results
Project description
⚡ acatrack-pdf-parser-rs
A high-performance, native Rust PDF parsing engine designed specifically to extract structured academic student records and provisional exam marks. Bridged seamlessly to Python via PyO3 and Maturin, it leverages multi-threaded CPU parallel processing via Rayon to slash batch ingestion processing times.
Developed as the core ingestion engine of AcaTrack, this parser solves complex visual layout alignment issues mathematically and runs 38.4x faster than traditional sequential Python parsers.
🚀 Key Features
- 🏎️ Rayon Parallelization: GIL-free multi-threaded PDF table extraction using CPU core saturation.
- 🛡️ Spacing-Robust Digit Concatenation: Reconstructs fragmented, narrow visual columns (e.g., visual layout splits like
"4"and"5"for a score of45) automatically. - 📐 Virtual Row Splitting: Automatically splits stacked cell values separated by newlines (
\n) into neat, index-aligned rows. - 📐 Mathematical Verification: Automatically executes algebraic checksum checks ($\text{IA} + \text{SEE} == \text{Total}$) to guarantee $100%$ parsing accuracy.
- ⚡ PyO3 FFI Bridge: Compiled into a native
.so/.pydmodule that can be imported directly in Python with zero performance loss. - 📊 FFI Telemetry: Streams granular Rust execution logs back into Python for instant diagnostic debugging.
📊 Performance Benchmarks (1,308 PDFs)
Tested over 1,308 PDFs (across 4 ZIP upload requests) containing freshman provisional university results:
| Metric | Sequential Python Core | Parallel Rust Engine (This Library) 🚀 | Net Improvement |
|---|---|---|---|
| Total Parsing Duration | 21.78 minutes |
34.06 seconds (~0.57 min) |
38.4x Faster 🚀 |
| Speed per PDF | 0.9992 seconds |
0.0260 seconds |
38.4x Faster 🚀 |
| Memory Net Impact | +268.48 MB |
+76.43 MB |
71.5% Lower RAM 📉 |
🛠️ Architecture
The parser integrates a dual-tier parsing fallback mechanism to remain robust across layout changes:
graph TD
A[Raw PDF Page] --> B{Tier 1: Clean Column Scan}
B -- Found Code & Split Cells --> C[Unified Token Concatenation & Math Verification]
B -- Layout Grid Failure --> D{Tier 2: Fallback Flat Text Scan}
D --> E[Tokenize flat whitespace stream]
E --> F[Match codes & parse trailing numeric pairs]
C --> G[StudentRecord PyDict Object]
F --> G
📦 Getting Started
Prerequisites
- Rust Toolchain:
rustup,rustc,cargo(Latest stable edition) - Python:
3.8+ - Maturin:
pip install maturin
Local Development & Setup
-
Clone the repository:
git clone https://github.com/chetanuchiha16/acatrack-pdf-parser-rs.git cd acatrack-pdf-parser-rs
-
Compile and install locally into your active Python environment:
# Builds in release mode and sets up an editable package link maturin develop --release
-
Verify compilation:
python -c "import acatrack_rust; print(acatrack_rust.__doc__)"
🐍 Python Usage Example
import acatrack_rust
# Target subjects to scan for
target_subjects = ["BMATS101", "BCHES102", "BCEDK103", "BENGK106"]
# Parse a single PDF file
record = acatrack_rust.parse_single_pdf(
pdf_path="path/to/student_result.pdf",
subject_codes=target_subjects
)
if record:
print(f"USN: {record['usn']}")
print(f"Name: {record['name']}")
print(f"Marks Extracted: {record['marks']}")
print("\n--- Telemetry Logs ---")
for log in record['logs']:
print(log)
📄 License
Licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file acatrack_pdf_parser_rs-0.1.0.tar.gz.
File metadata
- Download URL: acatrack_pdf_parser_rs-0.1.0.tar.gz
- Upload date:
- Size: 24.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d38d299dd76389595f0bd82aa5c166996afe447d17e70b53b03263e62d06d376
|
|
| MD5 |
02b3c1e7f1483a14a8b8fbb0ffd1a7db
|
|
| BLAKE2b-256 |
30435a6a05ee39b39c3408fa2929ea865418db8bc1285e34e27b06cda56870d2
|
File details
Details for the file acatrack_pdf_parser_rs-0.1.0-cp314-cp314-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: acatrack_pdf_parser_rs-0.1.0-cp314-cp314-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.14, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1d68bcb15e67f57499a37c30923f6e5cd8dfc5377c587f7cb241205a33e8387
|
|
| MD5 |
8308f9511387b9dc8a8188c254905979
|
|
| BLAKE2b-256 |
a2aa25227e81befd5b483136a473293663c296bfc96a4b13721c0992ac72c1c7
|