Scientific paper processing pipeline for metadata extraction from PDF/DOCX/LaTeX/etc.

These details have not been verified by PyPI

Project links

Project description

bibr 🦫

Version Made in Europe

Description

Scientific paper processing pipeline, extracts comprehensive metadata from PDF/DOCX/LaTeX/etc. Designed as a preprocessing backend for Metacheck.

Overview

At its core, bibr aims to provide better accuracy than other tools, using more modern approaches and offering additional features. Written in Python.

Features

easy-to-use Python library, install with uv
modular, scalable pipeline architecture
supports PDF, TXT, Markdown, DOCX, HTML, etc. as input
outputs an Arrow stream with parsed metadata, references, full-text content, etc.
extracts:
- paper metadata (title, authors, DOI, etc.) with high accuracy
- sections with semantic classification of canonical sections (abstract, methods, etc.)
- sentences (using spaCy tokenization for better accuracy)
- tables with representation in markdown and JSON (currently fragile)
- references/bibs
- (URL) links
uses a mix of traditional methods (Regex, NLP) and structured LLM-based extraction for better accuracy
response caching with Redis, with TTL and versioning options
API key, rate limiting, security features

Dependencies

Not every service is used for every document; the pipeline adapts based on input type and available metadata.

LightOnOCR 2: Open source OCR model for PDFs, made in EU (https://huggingface.co/lightonai/LightOnOCR-2-1B)
Pandoc: Converts markdown/docx/LaTeX/etc. documents into a standardized structured tree format (AST)
LLM API: Helps extract paper metadata with structured output
Crossref API: Optional - enhances reference metadata, checks for DOIs, etc.

LLM disclaimer

In some places, bibr selectively uses Large Language Models to extract metadata with more accuracy. This is done only where strictly needed - such as with references and citations, where traditional methods (Regex) are not always accurate and sensitive across varying contexts and citation styles.

Supported LLM Providers

bibr supports multiple LLM providers. Set LLM_PROVIDER in your .env file:

google (default): Google AI Studio (Gemini). Requires GOOGLE_API_KEY.
openai: OpenAI API or any OpenAI-compatible endpoint. Requires LLM_API_KEY. Use LLM_BASE_URL for custom endpoints (vLLM, LM Studio, etc.).
ollama: Local Ollama instance. Set OLLAMA_BASE_URL if not using the default http://localhost:11434.

Privacy and security

For LLM providers, we recommend using providers with strict privacy policies or open source models. Please be careful with your API keys in the .env file and if deploying in production, use a Secrets Manager instead.

How to run

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.2

Feb 5, 2026

This version

0.0.1 yanked

Feb 5, 2026

Reason this release was yanked:

wrong contact info

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bibr-0.0.1.tar.gz (15.1 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bibr-0.0.1-py3-none-any.whl (15.2 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file bibr-0.0.1.tar.gz.

File metadata

Download URL: bibr-0.0.1.tar.gz
Upload date: Feb 5, 2026
Size: 15.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.8

File hashes

Hashes for bibr-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`24d63e38d15c9aab137d3d2b891d742c97cf4c5149823475f535e2896c168126`
MD5	`b6408ffcf48d8c7eaf8e2089613d1802`
BLAKE2b-256	`616a82faa00fbaa56bb360e52791f778785636c703a54a241df47c377ed1821a`

See more details on using hashes here.

File details

Details for the file bibr-0.0.1-py3-none-any.whl.

File metadata

Download URL: bibr-0.0.1-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 15.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.8

File hashes

Hashes for bibr-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07b3b7e724405b08a3cc726d7720c5fac0f0d18692490bfd29d32a28d4f613f0`
MD5	`ca3298eec12d9364332fca274cb3e88b`
BLAKE2b-256	`ac7745832549ab5be4427f6e4c09a78493c5c476b052a63eff0532f816871358`

See more details on using hashes here.

bibr 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bibr 🦫

Description

Overview

Features

Dependencies

LLM disclaimer

Supported LLM Providers

Privacy and security

How to run

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes