Skip to main content

Agent/LLM-enabled narrative reviews of academic manuscripts

Project description

Article-Q

Agent/LLM-enabled narrative reviews of academic manuscripts. Parses PDFs, extracts structured data using LLM agents guided by a questions spreadsheet, and validates results through a multi-agent consensus mechanism.

Installation

Requires Python 3.11+.

pip install -e .

Step-by-step workflow

Step 1: Initialize the project

articleq init

This creates articleq.toml with default settings. Open it and set:

  • project.papers_dir — directory containing your PDF manuscripts
  • project.questions_file — path to your questions CSV (see Step 2)
  • llm.api_keys.openai or llm.api_keys.google — your API key (or set the OPENAI_API_KEY / GEMINI_API_KEY environment variables)

Step 2: Create a questions file

Prepare a CSV (or Excel) file defining the data you want to extract. Required columns are id and question. Optional columns:

Column Description Default
id Unique identifier for the question (required)
question The question text (required)
category Grouping label (e.g. "methods", "outcomes") general
output_type One of text, category, numeric, boolean, list text
options Comma-separated valid answers (for category type)
description Additional guidance for the extraction agent
depends_on Comma-separated IDs of questions this one depends on

Example:

id,question,category,output_type,options,description,depends_on
sample_size,What was the total sample size?,demographics,numeric,,Total number of participants enrolled,
primary_outcome,What was the primary outcome?,outcomes,text,,The main outcome measure,
study_design,What was the study design?,methods,category,"RCT,cohort,case-control,cross-sectional",,
study_design_other,If other please specify,methods,text,,,study_design
blinding,Was the study blinded?,methods,boolean,,Whether any form of blinding was used,

Step 3: Parse PDFs

articleq parse -c articleq.toml

This converts each PDF to structured markdown and saves the output to output/parsed/. Each paper produces:

  • A .json file containing the parsed blocks (the source of truth)
  • A .md file for human-readable review
  • Extracted figures saved as PNGs in output/parsed/figures/

Two parsing backends are available (set parsing.backend in config):

  • pymupdf (default) — fast, uses pymupdf4llm layout detection
  • marker — uses marker-pdf with OCR; better for scanned documents

Step 4: Review and clean parsed content (optional)

Preview the parsed papers in a browser:

articleq visualize --parsed-dir output/parsed/

The JSON files in output/parsed/ are the source of truth. Each file contains a blocks array — the LLM agents read from the content field of each block, so edits there directly affect extraction. Do not edit the .md files or the raw_markdown field in the JSON; both are regenerated by articleq rebuild.

Each block looks like this:

{
  "block_type": "text",
  "content": "The study enrolled 150 partcipants between Jan and Dec 2020.",
  "page_number": 3,
  "section": "Methods"
}

Common edits:

  • Fix OCR errors — correct garbled text, broken words, or misrecognized characters (e.g. "partcipants""participants")
  • Remove noise — delete blocks containing headers, footers, page numbers, or watermarks that the parser picked up
  • Fix broken tables — repair malformed markdown tables in "table" blocks
  • Remove irrelevant blocks — delete entire blocks (e.g. reference lists, copyright notices) that add noise without useful content

After editing blocks, rebuild the markdown:

articleq rebuild -c articleq.toml

This regenerates the .md files and updates raw_markdown in the JSON caches to match the block content.

Step 5: Run LLM extraction

articleq extract -c articleq.toml

This sends each question to the extraction agents for every paper, runs validation, and writes results to output/results.json. The extract command will refuse to run if the markdown is out of sync with the blocks — run articleq rebuild first if you've edited blocks.

Alternatively, run everything (parse + extract) in one shot:

articleq run -c articleq.toml

Step 6: Export and visualize results

Convert results to CSV or Excel:

articleq export output/results.json --format csv
articleq export output/results.json --format excel

Generate an interactive HTML evidence viewer:

articleq visualize -r output/results.json

The viewer shows each paper's content alongside extracted answers, with evidence passages highlighted in the text. Pass -q questions.csv to include question text in the results panel.

How it works

Each question for each paper goes through a three-agent workflow:

Paper + Question
      |
      v
  Extraction Agent  -->  Answer A
      |
      v
  Validation Agent  -->  Answer B  (blind, independent)
      |
      v
  Compare A and B
      |
      +-- AGREE + high confidence --> Accept A as final
      |
      +-- DISAGREE --> Consensus Agent reviews both --> Final answer
  • The extraction agent reads the paper and extracts an answer with evidence quotes, page numbers, and a confidence score.
  • The validation agent performs a blind, independent re-extraction (it does not see the first answer).
  • If the two answers agree and both have confidence above auto_accept_threshold, the answer is accepted directly.
  • If they disagree, the consensus agent reviews both answers against the source material and produces a final arbitrated answer.

Question dependencies

Some questions depend on the answers to earlier questions. For example, a follow-up like "If other, please specify" only makes sense after the study type has been determined. Use the depends_on column to declare these relationships:

id,question,depends_on
study_type,What was the study design?,
study_type_other,"If other, please specify",study_type

When dependencies are present, questions are processed in waves — all questions with no unmet dependencies run concurrently, then questions whose dependencies are satisfied by the previous wave, and so on. Within each wave, concurrency is controlled by pipeline.concurrency as usual. Dependent questions receive a "Prior Answers" section in their prompt containing the question text and answer of each dependency.

The depends_on column is optional. CSVs without it continue to work as before (all questions run concurrently in a single wave). Circular dependencies and references to nonexistent question IDs are caught at load time.

Agreement checking is type-aware:

  • Categorical/boolean: exact match
  • Numeric: within 5% tolerance
  • Text: normalized string comparison

Configuration reference

[project]
name = "my-review"              # Project name used in output
papers_dir = "./papers"          # Directory containing PDF files
questions_file = "./questions.csv"  # Path to questions CSV/Excel
output_dir = "./output"          # Where results are written
# context_file = "./context.md" # Optional: additional instructions for the LLM

[parsing]
backend = "pymupdf"             # "pymupdf" or "marker"
reparse = false                 # Force re-parsing even if cached results exist

[llm]
extraction_model = "openai:gpt-4o"   # Model for primary extraction
validation_model = "openai:gpt-4o"   # Model for validation pass
consensus_model = "openai:gpt-4o"    # Model for arbitration
# temperature = 0.0                  # LLM sampling temperature (omit to use provider default)

[llm.api_keys]
openai = "${OPENAI_API_KEY}"    # Supports environment variable expansion
google = "${GEMINI_API_KEY}"

[pipeline]
concurrency = 5                 # Max concurrent agent calls
skip_validation = false         # Set true to skip the validation/consensus step
checkpoint = true               # Save per-paper checkpoints for resume
chunk_max_tokens = 8000         # Max tokens per chunk for large PDFs

[validation]
auto_accept_threshold = 0.9     # Min confidence to auto-accept agreement
always_validate_categories = ["primary_outcome"]  # Always run full 3-agent flow for these

Additional topics

Caching and re-parsing

During parse, each parsed PDF is saved as both a markdown file and a JSON cache file under {output_dir}/parsed/. On subsequent runs, cached JSON files are loaded automatically, skipping PDF re-parsing.

To force re-parsing (e.g. after replacing a PDF or upgrading the parser), use the --reparse flag:

articleq parse -c articleq.toml --reparse

To re-parse a single paper, delete its cached .json file and run parse again.

Output directory structure:

output/
├── parsed/
│   ├── study_smith_2020.pdf.md
│   ├── study_smith_2020.pdf.json   # cached ParsedPaper (used on re-runs)
│   ├── study_jones_2021.pdf.md
│   ├── study_jones_2021.pdf.json
│   └── figures/
│       ├── study_smith_2020_img1.png
│       ├── study_smith_2020_img2.png
│       └── study_jones_2021_img1.png
└── results.json

Context file

You can provide a markdown file with additional instructions and domain knowledge to guide the LLM agents. Set context_file in the [project] section of your config:

[project]
context_file = "./context.md"

The contents are passed as additional system instructions to all three agents (extraction, validation, consensus). Use this for:

  • Domain-specific definitions and terminology
  • Important distinctions the LLM should be aware of
  • Guidance on how to handle ambiguous cases
  • Any background knowledge relevant to the review

Example context.md:

# Extraction Context

This review focuses on dentin hypersensitivity (DH) clinical trials.

## Key Definitions

The Holland 1997 definition of DH: "short, sharp pain arising from exposed
dentine in response to stimuli typically thermal, evaporative, tactile, osmotic
or chemical and which cannot be ascribed to any other form of dental defect or
pathology."

## Important Distinctions

- Distinguish between stimuli used for DIAGNOSIS versus OUTCOME MEASURES
- "dh_threshold_teeth" refers to minimum teeth per PATIENT, not total in study

Large PDFs

Papers exceeding chunk_max_tokens are handled with a two-pass approach:

  1. The paper is split into chunks by content blocks.
  2. Chunks are scored for relevance to the current question using keyword overlap.
  3. Only the most relevant chunks (within the token budget) are sent to the agent.

Multimodal support

Figures extracted from PDFs are sent to the LLM as binary images alongside the text content. This happens automatically — if a parsed paper contains image data, the images are included in the prompt sent to the extraction, validation, and consensus agents.

  • Both the pymupdf and marker backends extract images and store them as base64 in the parsed data.
  • Text placeholders like [Image from page N] remain in the text for positional context, and the actual image binaries are appended after the text.
  • No configuration is needed. If image data exists in the parsed paper, it is included. Models that do not support vision will receive only the text portion.

Evaluation

You can evaluate extraction results against manually-created ground truth using the eval command. This is useful for benchmarking accuracy across models, prompts, or configurations.

Benchmark layout:

benchmarks/
└── example/
    ├── papers/          # PDF manuscripts
    ├── questions.csv    # Questions used for extraction
    └── expected.csv     # Ground truth answers

The expected.csv uses the same column format as articleq export output. At minimum it needs paper, question_id, and final_value columns:

paper,question_id,final_value
study_smith_2020.pdf,sample_size,150
study_smith_2020.pdf,study_design,RCT
study_smith_2020.pdf,primary_outcome,overall survival

Running an evaluation:

articleq run -c benchmarks/example/config.toml
articleq export output/results.json --format csv
articleq eval output/results.csv benchmarks/example/expected.csv -q benchmarks/example/questions.csv

The -q flag is optional but recommended — it enables type-aware comparison (numeric tolerance, boolean normalization, etc.) by reading each question's output_type from the questions file.

The report shows:

  • Overall accuracy — percentage of answers matching ground truth
  • Per-question breakdown — accuracy for each question across all papers
  • Detailed mismatches — expected vs actual value for every disagreement

LLM-as-judge evaluation:

Strict string comparison can produce false negatives for free-text answers where the meaning matches but the wording differs (e.g. "RCT, parallel group" vs "Randomised controlled trial - Parallel group trial"). The --judge-model option enables an LLM judge that re-evaluates deterministic mismatches for semantic equivalence:

articleq eval output/results.csv benchmarks/example/expected.csv \
  -q benchmarks/example/questions.csv \
  --judge-model openai:gpt-4o-mini \
  -c benchmarks/example/config.toml

When enabled:

  • Answers that match deterministically are accepted as before (no LLM call).
  • Mismatches on text, category, and list type questions are sent to the judge model, which decides whether the answers are semantically equivalent.
  • boolean and numeric types keep their existing deterministic checks only.
  • The -q questions file is required when using --judge-model (the question text provides context to the judge).
  • The -c config file is optional — used to resolve API keys. Without it, keys are read from environment variables.

The report distinguishes strict matches from judge matches and includes the judge's reasoning for any answers it accepted:

  Strict matches:    12
  Judge matches:     3
  Mismatches:        5
  Matched:           15
  Accuracy:          75.0%

CLI reference

articleq init [-o PATH]              Generate a starter config file
articleq run -c CONFIG [--reparse]    Run the full pipeline (parse + extract)
articleq parse -c CONFIG [--reparse]  Parse PDFs and save as markdown (no LLM calls)
articleq rebuild -c CONFIG            Rebuild markdown and JSON from edited blocks
articleq extract -c CONFIG            Run LLM extraction on pre-parsed papers
articleq export RESULTS [--format csv|excel] [-o PATH]   Export to CSV/Excel
articleq eval RESULTS EXPECTED [-q QUESTIONS] [--judge-model MODEL] [-c CONFIG]   Evaluate against ground truth
articleq visualize -r RESULTS [-o PATH] [-q QUESTIONS]   Generate HTML evidence viewer
articleq visualize --parsed-dir DIR [-o PATH]            Preview parsed papers (no results)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_q-0.2.1.tar.gz (997.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

article_q-0.2.1-py3-none-any.whl (49.4 kB view details)

Uploaded Python 3

File details

Details for the file article_q-0.2.1.tar.gz.

File metadata

  • Download URL: article_q-0.2.1.tar.gz
  • Upload date:
  • Size: 997.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for article_q-0.2.1.tar.gz
Algorithm Hash digest
SHA256 89cc622354f0c616d03724f6290c96bfbd6291ae700d90b7da1245eb6e5699c0
MD5 cfeed4f0c03ace266148c94402876c0b
BLAKE2b-256 a96c883b72ed566f07081a447e6d2b14ceb8b96098e5953221dd483f84d4bdd6

See more details on using hashes here.

File details

Details for the file article_q-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: article_q-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 49.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for article_q-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f3e3b3b02a980d14363c89052f09dccf4a09bb3b336a64f5508312d67ff18c20
MD5 aa7ea92152407f641a9e3f9d15b01df7
BLAKE2b-256 3e1fd94164efc886926891fe83d5c9a625ecff01700802d26c2546ed869230f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page