Skip to main content

Scientific paper processing pipeline for metadata extraction from PDF/DOCX/LaTeX/etc.

Reason this release was yanked:

wrong contact info

Project description

bibr 🦫

Version Made in Europe Lifecycle: experimental codecov

Description

Scientific paper processing pipeline, extracts comprehensive metadata from PDF/DOCX/LaTeX/etc. Designed as a preprocessing backend for Metacheck.

Overview

At its core, bibr aims to provide better accuracy than other tools, using more modern approaches and offering additional features. Written in Python.

Features

  • easy-to-use Python library, install with uv
  • modular, scalable pipeline architecture
  • supports PDF, TXT, Markdown, DOCX, HTML, etc. as input
  • outputs an Arrow stream with parsed metadata, references, full-text content, etc.
  • extracts:
    • paper metadata (title, authors, DOI, etc.) with high accuracy
    • sections with semantic classification of canonical sections (abstract, methods, etc.)
    • sentences (using spaCy tokenization for better accuracy)
    • tables with representation in markdown and JSON (currently fragile)
    • references/bibs
    • (URL) links
  • uses a mix of traditional methods (Regex, NLP) and structured LLM-based extraction for better accuracy
  • response caching with Redis, with TTL and versioning options
  • API key, rate limiting, security features

Dependencies

Not every service is used for every document; the pipeline adapts based on input type and available metadata.

  • LightOnOCR 2: Open source OCR model for PDFs, made in EU (https://huggingface.co/lightonai/LightOnOCR-2-1B)
  • Pandoc: Converts markdown/docx/LaTeX/etc. documents into a standardized structured tree format (AST)
  • LLM API: Helps extract paper metadata with structured output
  • Crossref API: Optional - enhances reference metadata, checks for DOIs, etc.

LLM disclaimer

In some places, bibr selectively uses Large Language Models to extract metadata with more accuracy. This is done only where strictly needed - such as with references and citations, where traditional methods (Regex) are not always accurate and sensitive across varying contexts and citation styles.

Supported LLM Providers

bibr supports multiple LLM providers. Set LLM_PROVIDER in your .env file:

  • google (default): Google AI Studio (Gemini). Requires GOOGLE_API_KEY.
  • openai: OpenAI API or any OpenAI-compatible endpoint. Requires LLM_API_KEY. Use LLM_BASE_URL for custom endpoints (vLLM, LM Studio, etc.).
  • ollama: Local Ollama instance. Set OLLAMA_BASE_URL if not using the default http://localhost:11434.

Privacy and security

For LLM providers, we recommend using providers with strict privacy policies or open source models. Please be careful with your API keys in the .env file and if deploying in production, use a Secrets Manager instead.

How to run

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bibr-0.0.1.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bibr-0.0.1-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file bibr-0.0.1.tar.gz.

File metadata

  • Download URL: bibr-0.0.1.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.8

File hashes

Hashes for bibr-0.0.1.tar.gz
Algorithm Hash digest
SHA256 24d63e38d15c9aab137d3d2b891d742c97cf4c5149823475f535e2896c168126
MD5 b6408ffcf48d8c7eaf8e2089613d1802
BLAKE2b-256 616a82faa00fbaa56bb360e52791f778785636c703a54a241df47c377ed1821a

See more details on using hashes here.

File details

Details for the file bibr-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: bibr-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.8

File hashes

Hashes for bibr-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 07b3b7e724405b08a3cc726d7720c5fac0f0d18692490bfd29d32a28d4f613f0
MD5 ca3298eec12d9364332fca274cb3e88b
BLAKE2b-256 ac7745832549ab5be4427f6e4c09a78493c5c476b052a63eff0532f816871358

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page