Agent/LLM-enabled narrative reviews of academic manuscripts
Project description
Article-Q
Agent/LLM-enabled narrative reviews of academic manuscripts. Parses PDFs, extracts structured data using LLM agents guided by a questions spreadsheet, and validates results through a multi-agent consensus mechanism.
Installation
Requires Python 3.11+.
pip install -e .
Step-by-step workflow
Step 1: Initialize the project
articleq init
This creates articleq.toml with default settings. Open it and set:
project.papers_dir— directory containing your PDF manuscriptsproject.questions_file— path to your questions CSV (see Step 2)llm.api_keys.openaiorllm.api_keys.google— your API key (or set theOPENAI_API_KEY/GEMINI_API_KEYenvironment variables)
Step 2: Create a questions file
Prepare a CSV (or Excel) file defining the data you want to extract. Required columns are id and question. Optional columns:
| Column | Description | Default |
|---|---|---|
id |
Unique identifier for the question | (required) |
question |
The question text | (required) |
category |
Grouping label (e.g. "methods", "outcomes") | general |
output_type |
One of text, category, numeric, boolean, list |
text |
options |
Comma-separated valid answers (for category type) |
|
description |
Additional guidance for the extraction agent | |
depends_on |
Comma-separated IDs of questions this one depends on |
Example:
id,question,category,output_type,options,description,depends_on
sample_size,What was the total sample size?,demographics,numeric,,Total number of participants enrolled,
primary_outcome,What was the primary outcome?,outcomes,text,,The main outcome measure,
study_design,What was the study design?,methods,category,"RCT,cohort,case-control,cross-sectional",,
study_design_other,If other please specify,methods,text,,,study_design
blinding,Was the study blinded?,methods,boolean,,Whether any form of blinding was used,
Step 3: Parse PDFs
articleq parse -c articleq.toml
This converts each PDF to structured markdown and saves the output to output/parsed/. Each paper produces:
- A
.jsonfile containing the parsed blocks (the source of truth) - A
.mdfile for human-readable review - Extracted figures saved as PNGs in
output/parsed/figures/
Two parsing backends are available (set parsing.backend in config):
pymupdf(default) — fast, uses pymupdf4llm layout detectionmarker— uses marker-pdf with OCR; better for scanned documents
Step 4: Review and clean parsed content (optional)
Preview the parsed papers in a browser:
articleq visualize --parsed-dir output/parsed/
The JSON files in output/parsed/ are the source of truth. Each file contains a blocks array — the LLM agents read from the content field of each block, so edits there directly affect extraction. Do not edit the .md files or the raw_markdown field in the JSON; both are regenerated by articleq rebuild.
Each block looks like this:
{
"block_type": "text",
"content": "The study enrolled 150 partcipants between Jan and Dec 2020.",
"page_number": 3,
"section": "Methods"
}
Common edits:
- Fix OCR errors — correct garbled text, broken words, or misrecognized characters (e.g.
"partcipants"→"participants") - Remove noise — delete blocks containing headers, footers, page numbers, or watermarks that the parser picked up
- Fix broken tables — repair malformed markdown tables in
"table"blocks - Remove irrelevant blocks — delete entire blocks (e.g. reference lists, copyright notices) that add noise without useful content
After editing blocks, rebuild the markdown:
articleq rebuild -c articleq.toml
This regenerates the .md files and updates raw_markdown in the JSON caches to match the block content.
Step 5: Run LLM extraction
articleq extract -c articleq.toml
This sends each question to the extraction agents for every paper, runs validation, and writes results to output/results.json. The extract command will refuse to run if the markdown is out of sync with the blocks — run articleq rebuild first if you've edited blocks.
Alternatively, run everything (parse + extract) in one shot:
articleq run -c articleq.toml
Step 6: Export and visualize results
Convert results to CSV or Excel:
articleq export output/results.json --format csv
articleq export output/results.json --format excel
Generate an interactive HTML evidence viewer:
articleq visualize -r output/results.json
The viewer shows each paper's content alongside extracted answers, with evidence passages highlighted in the text. Pass -q questions.csv to include question text in the results panel.
How it works
Each question for each paper goes through a three-agent workflow:
Paper + Question
|
v
Extraction Agent --> Answer A
|
v
Validation Agent --> Answer B (blind, independent)
|
v
Compare A and B
|
+-- AGREE + high confidence --> Accept A as final
|
+-- DISAGREE --> Consensus Agent reviews both --> Final answer
- The extraction agent reads the paper and extracts an answer with evidence quotes, page numbers, and a confidence score.
- The validation agent performs a blind, independent re-extraction (it does not see the first answer).
- If the two answers agree and both have confidence above
auto_accept_threshold, the answer is accepted directly. - If they disagree, the consensus agent reviews both answers against the source material and produces a final arbitrated answer.
Question dependencies
Some questions depend on the answers to earlier questions. For example, a follow-up like "If other, please specify" only makes sense after the study type has been determined. Use the depends_on column to declare these relationships:
id,question,depends_on
study_type,What was the study design?,
study_type_other,"If other, please specify",study_type
When dependencies are present, questions are processed in waves — all questions with no unmet dependencies run concurrently, then questions whose dependencies are satisfied by the previous wave, and so on. Within each wave, concurrency is controlled by pipeline.concurrency as usual. Dependent questions receive a "Prior Answers" section in their prompt containing the question text and answer of each dependency.
The depends_on column is optional. CSVs without it continue to work as before (all questions run concurrently in a single wave). Circular dependencies and references to nonexistent question IDs are caught at load time.
Agreement checking is type-aware:
- Categorical/boolean: exact match
- Numeric: within 5% tolerance
- Text: normalized string comparison
Configuration reference
[project]
name = "my-review" # Project name used in output
papers_dir = "./papers" # Directory containing PDF files
questions_file = "./questions.csv" # Path to questions CSV/Excel
output_dir = "./output" # Where results are written
# context_file = "./context.md" # Optional: additional instructions for the LLM
[parsing]
backend = "pymupdf" # "pymupdf" or "marker"
reparse = false # Force re-parsing even if cached results exist
[llm]
extraction_model = "openai:gpt-4o" # Model for primary extraction
validation_model = "openai:gpt-4o" # Model for validation pass
consensus_model = "openai:gpt-4o" # Model for arbitration
# temperature = 0.0 # LLM sampling temperature (omit to use provider default)
[llm.api_keys]
openai = "${OPENAI_API_KEY}" # Supports environment variable expansion
google = "${GEMINI_API_KEY}"
[pipeline]
concurrency = 5 # Max concurrent agent calls
skip_validation = false # Set true to skip the validation/consensus step
checkpoint = true # Save per-paper checkpoints for resume
chunk_max_tokens = 8000 # Max tokens per chunk for large PDFs
[validation]
auto_accept_threshold = 0.9 # Min confidence to auto-accept agreement
always_validate_categories = ["primary_outcome"] # Always run full 3-agent flow for these
Additional topics
Caching and re-parsing
During parse, each parsed PDF is saved as both a markdown file and a JSON cache file under {output_dir}/parsed/. On subsequent runs, cached JSON files are loaded automatically, skipping PDF re-parsing.
To force re-parsing (e.g. after replacing a PDF or upgrading the parser), use the --reparse flag:
articleq parse -c articleq.toml --reparse
To re-parse a single paper, delete its cached .json file and run parse again.
Output directory structure:
output/
├── parsed/
│ ├── study_smith_2020.pdf.md
│ ├── study_smith_2020.pdf.json # cached ParsedPaper (used on re-runs)
│ ├── study_jones_2021.pdf.md
│ ├── study_jones_2021.pdf.json
│ └── figures/
│ ├── study_smith_2020_img1.png
│ ├── study_smith_2020_img2.png
│ └── study_jones_2021_img1.png
└── results.json
Context file
You can provide a markdown file with additional instructions and domain knowledge to guide the LLM agents. Set context_file in the [project] section of your config:
[project]
context_file = "./context.md"
The contents are passed as additional system instructions to all three agents (extraction, validation, consensus). Use this for:
- Domain-specific definitions and terminology
- Important distinctions the LLM should be aware of
- Guidance on how to handle ambiguous cases
- Any background knowledge relevant to the review
Example context.md:
# Extraction Context
This review focuses on dentin hypersensitivity (DH) clinical trials.
## Key Definitions
The Holland 1997 definition of DH: "short, sharp pain arising from exposed
dentine in response to stimuli typically thermal, evaporative, tactile, osmotic
or chemical and which cannot be ascribed to any other form of dental defect or
pathology."
## Important Distinctions
- Distinguish between stimuli used for DIAGNOSIS versus OUTCOME MEASURES
- "dh_threshold_teeth" refers to minimum teeth per PATIENT, not total in study
Large PDFs
Papers exceeding chunk_max_tokens are handled with a two-pass approach:
- The paper is split into chunks by content blocks.
- Chunks are scored for relevance to the current question using keyword overlap.
- Only the most relevant chunks (within the token budget) are sent to the agent.
Multimodal support
Figures extracted from PDFs are sent to the LLM as binary images alongside the text content. This happens automatically — if a parsed paper contains image data, the images are included in the prompt sent to the extraction, validation, and consensus agents.
- Both the
pymupdfandmarkerbackends extract images and store them as base64 in the parsed data. - Text placeholders like
[Image from page N]remain in the text for positional context, and the actual image binaries are appended after the text. - No configuration is needed. If image data exists in the parsed paper, it is included. Models that do not support vision will receive only the text portion.
Evaluation
You can evaluate extraction results against manually-created ground truth using the eval command. This is useful for benchmarking accuracy across models, prompts, or configurations.
Benchmark layout:
benchmarks/
└── example/
├── papers/ # PDF manuscripts
├── questions.csv # Questions used for extraction
└── expected.csv # Ground truth answers
The expected.csv uses the same column format as articleq export output. At minimum it needs paper, question_id, and final_value columns:
paper,question_id,final_value
study_smith_2020.pdf,sample_size,150
study_smith_2020.pdf,study_design,RCT
study_smith_2020.pdf,primary_outcome,overall survival
Running an evaluation:
articleq run -c benchmarks/example/config.toml
articleq export output/results.json --format csv
articleq eval output/results.csv benchmarks/example/expected.csv -q benchmarks/example/questions.csv
The -q flag is optional but recommended — it enables type-aware comparison (numeric tolerance, boolean normalization, etc.) by reading each question's output_type from the questions file.
The report shows:
- Overall accuracy — percentage of answers matching ground truth
- Per-question breakdown — accuracy for each question across all papers
- Detailed mismatches — expected vs actual value for every disagreement
LLM-as-judge evaluation:
Strict string comparison can produce false negatives for free-text answers where the meaning matches but the wording differs (e.g. "RCT, parallel group" vs "Randomised controlled trial - Parallel group trial"). The --judge-model option enables an LLM judge that re-evaluates deterministic mismatches for semantic equivalence:
articleq eval output/results.csv benchmarks/example/expected.csv \
-q benchmarks/example/questions.csv \
--judge-model openai:gpt-4o-mini \
-c benchmarks/example/config.toml
When enabled:
- Answers that match deterministically are accepted as before (no LLM call).
- Mismatches on
text,category, andlisttype questions are sent to the judge model, which decides whether the answers are semantically equivalent. booleanandnumerictypes keep their existing deterministic checks only.- The
-qquestions file is required when using--judge-model(the question text provides context to the judge). - The
-cconfig file is optional — used to resolve API keys. Without it, keys are read from environment variables.
The report distinguishes strict matches from judge matches and includes the judge's reasoning for any answers it accepted:
Strict matches: 12
Judge matches: 3
Mismatches: 5
Matched: 15
Accuracy: 75.0%
CLI reference
articleq init [-o PATH] Generate a starter config file
articleq run -c CONFIG [--reparse] Run the full pipeline (parse + extract)
articleq parse -c CONFIG [--reparse] Parse PDFs and save as markdown (no LLM calls)
articleq rebuild -c CONFIG Rebuild markdown and JSON from edited blocks
articleq extract -c CONFIG Run LLM extraction on pre-parsed papers
articleq export RESULTS [--format csv|excel] [-o PATH] Export to CSV/Excel
articleq eval RESULTS EXPECTED [-q QUESTIONS] [--judge-model MODEL] [-c CONFIG] Evaluate against ground truth
articleq visualize -r RESULTS [-o PATH] [-q QUESTIONS] Generate HTML evidence viewer
articleq visualize --parsed-dir DIR [-o PATH] Preview parsed papers (no results)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file article_q-0.2.1.tar.gz.
File metadata
- Download URL: article_q-0.2.1.tar.gz
- Upload date:
- Size: 997.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89cc622354f0c616d03724f6290c96bfbd6291ae700d90b7da1245eb6e5699c0
|
|
| MD5 |
cfeed4f0c03ace266148c94402876c0b
|
|
| BLAKE2b-256 |
a96c883b72ed566f07081a447e6d2b14ceb8b96098e5953221dd483f84d4bdd6
|
File details
Details for the file article_q-0.2.1-py3-none-any.whl.
File metadata
- Download URL: article_q-0.2.1-py3-none-any.whl
- Upload date:
- Size: 49.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3e3b3b02a980d14363c89052f09dccf4a09bb3b336a64f5508312d67ff18c20
|
|
| MD5 |
aa7ea92152407f641a9e3f9d15b01df7
|
|
| BLAKE2b-256 |
3e1fd94164efc886926891fe83d5c9a625ecff01700802d26c2546ed869230f4
|