Scientific paper processing pipeline for metadata extraction from PDF/DOCX/LaTeX/etc.
Reason this release was yanked:
wrong contact info
Project description
bibr 🦫
Description
Scientific paper processing pipeline, extracts comprehensive metadata from PDF/DOCX/LaTeX/etc. Designed as a preprocessing backend for Metacheck.
Overview
At its core, bibr aims to provide better accuracy than other tools, using more modern approaches and offering additional features. Written in Python.
Features
- easy-to-use Python library, install with uv
- modular, scalable pipeline architecture
- supports PDF, TXT, Markdown, DOCX, HTML, etc. as input
- outputs an Arrow stream with parsed metadata, references, full-text content, etc.
- extracts:
- paper metadata (title, authors, DOI, etc.) with high accuracy
- sections with semantic classification of canonical sections (abstract, methods, etc.)
- sentences (using spaCy tokenization for better accuracy)
- tables with representation in markdown and JSON (currently fragile)
- references/bibs
- (URL) links
- uses a mix of traditional methods (Regex, NLP) and structured LLM-based extraction for better accuracy
- response caching with Redis, with TTL and versioning options
- API key, rate limiting, security features
Dependencies
Not every service is used for every document; the pipeline adapts based on input type and available metadata.
- LightOnOCR 2: Open source OCR model for PDFs, made in EU (https://huggingface.co/lightonai/LightOnOCR-2-1B)
- Pandoc: Converts markdown/docx/LaTeX/etc. documents into a standardized structured tree format (AST)
- LLM API: Helps extract paper metadata with structured output
- Crossref API: Optional - enhances reference metadata, checks for DOIs, etc.
LLM disclaimer
In some places, bibr selectively uses Large Language Models to extract metadata with more accuracy. This is done only where strictly needed - such as with references and citations, where traditional methods (Regex) are not always accurate and sensitive across varying contexts and citation styles.
Supported LLM Providers
bibr supports multiple LLM providers. Set LLM_PROVIDER in your .env file:
google(default): Google AI Studio (Gemini). RequiresGOOGLE_API_KEY.openai: OpenAI API or any OpenAI-compatible endpoint. RequiresLLM_API_KEY. UseLLM_BASE_URLfor custom endpoints (vLLM, LM Studio, etc.).ollama: Local Ollama instance. SetOLLAMA_BASE_URLif not using the defaulthttp://localhost:11434.
Privacy and security
For LLM providers, we recommend using providers with strict privacy policies or open source models. Please be careful with your API keys in the .env file and if deploying in production, use a Secrets Manager instead.
How to run
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bibr-0.0.1.tar.gz.
File metadata
- Download URL: bibr-0.0.1.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24d63e38d15c9aab137d3d2b891d742c97cf4c5149823475f535e2896c168126
|
|
| MD5 |
b6408ffcf48d8c7eaf8e2089613d1802
|
|
| BLAKE2b-256 |
616a82faa00fbaa56bb360e52791f778785636c703a54a241df47c377ed1821a
|
File details
Details for the file bibr-0.0.1-py3-none-any.whl.
File metadata
- Download URL: bibr-0.0.1-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07b3b7e724405b08a3cc726d7720c5fac0f0d18692490bfd29d32a28d4f613f0
|
|
| MD5 |
ca3298eec12d9364332fca274cb3e88b
|
|
| BLAKE2b-256 |
ac7745832549ab5be4427f6e4c09a78493c5c476b052a63eff0532f816871358
|