Skip to main content

PDF extraction pipeline for scientific papers

Project description

acatome-extract

PDF extraction and enrichment pipeline for scientific papers. Converts PDFs into structured, searchable bundles with block-level summaries and embeddings.

Features

  • Marker PDF extraction — structured block extraction with headings, tables, figures
  • Fitz fallback — recursive character chunking when Marker is unavailable
  • LLM enrichment — block and paper summaries via Ollama or litellm
  • Embeddings — sentence-transformer embeddings for semantic search
  • File watcheracatome-extract watch monitors an inbox folder
  • Bundle format.acatome companion files for sharing pre-built extractions
  • CLIacatome-extract command for extract, enrich, and watch workflows

Installation

uv pip install -e .

On macOS/Linux this includes Marker for structured PDF extraction. On Windows it installs with the lighter pymupdf (fitz) backend by default. To add Marker on Windows (requires C build tools):

uv pip install -e ".[marker]"

With GPU acceleration (embeddings + torch):

uv pip install -e ".[gpu]"

Everything at once:

uv pip install -e ".[full]"

Usage

from acatome_extract.pipeline import extract

bundle = extract("/path/to/paper.pdf")

CLI

# Extract (RAKE summaries included automatically, no LLM needed)
acatome-extract extract paper.pdf
acatome-extract extract --type datasheet TI_LM317.pdf   # non-article types

# Enrich — embeddings only by default; add --summarize for LLM summaries
acatome-extract enrich /path/to/bundle
acatome-extract enrich --summarize /path/to/bundle       # enable LLM summaries
acatome-extract enrich --summarize --skip-existing dir/   # incremental LLM pass

# Watch — extract + embed + ingest; LLM summaries off by default
acatome-extract watch ~/papers/inbox
acatome-extract watch ~/papers/inbox --summarize          # enable LLM summaries

# Migrate old bundles to new summaries dict format + add RAKE
acatome-extract migrate ~/.acatome/papers
acatome-extract migrate ~/.acatome/papers --dry-run       # preview changes

# Supplements
acatome-extract attach parent-slug supplement.pdf --name s1

Summaries

Extraction always generates RAKE (extractive keyword) summaries — instant, no LLM required. LLM-based summaries are opt-in via --summarize and require an Ollama or litellm-compatible model.

RAKE summaries are used as the default for search and display. To add LLM summaries later:

acatome-extract enrich --summarize --skip-existing ~/.acatome/papers

Sidecar metadata

Place a <stem>.meta.json alongside any PDF to override metadata:

{"type": "datasheet", "title": "LM317 Regulator", "author": "Texas Instruments", "year": 2022}

Supported fields: type, title, author (string or list), year, doi, abstract, journal, s2_id, arxiv_id, verified.

Use explicit null to clear a field that the upstream Crossref/S2 lookup got wrong — empty strings are ignored for backward compatibility:

{"doi": "10.1002/9781118519301.ch5", "s2_id": null, "verified": true}

When "verified": true is set, the fuzzy-title verification gate is bypassed — useful for PDFs where the real title doesn't appear on page 1 (Elsevier header strips, ACS internal production PDFs, abstract collections).

Dependencies

  • acatome-meta — metadata lookup and verification
  • marker-pdf — structured PDF extraction
  • litellm / Ollama — LLM-based enrichment

Testing

uv run python -m pytest tests/ -v

License

GPL-3.0-or-later — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acatome_extract-0.6.1.tar.gz (53.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acatome_extract-0.6.1-py3-none-any.whl (42.2 kB view details)

Uploaded Python 3

File details

Details for the file acatome_extract-0.6.1.tar.gz.

File metadata

  • Download URL: acatome_extract-0.6.1.tar.gz
  • Upload date:
  • Size: 53.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for acatome_extract-0.6.1.tar.gz
Algorithm Hash digest
SHA256 090efe421a6d37b1c87ad4fd51d2f369808c9544eb9dfd913d09b36cffcb59e1
MD5 f25ea049ce4cb31ebaea0073d9c4ba59
BLAKE2b-256 32f6da378d9fcf9652511c32a0a3332695d76de3f490761af1a342212dfb8cfa

See more details on using hashes here.

Provenance

The following attestation bundles were made for acatome_extract-0.6.1.tar.gz:

Publisher: publish.yml on retospect/acatome-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file acatome_extract-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: acatome_extract-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 42.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for acatome_extract-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a99a58cdb8ec9ce874a0c10d8ce836ad6ad113fc4d569b9ec512bb51d33e6159
MD5 d82e6fff87c8aaa6c9d0d8a59d726ae7
BLAKE2b-256 50f11777f3d2e65f91922fc70ed48a56054e1cdac68b6e01409cd3d531d57ca8

See more details on using hashes here.

Provenance

The following attestation bundles were made for acatome_extract-0.6.1-py3-none-any.whl:

Publisher: publish.yml on retospect/acatome-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page