Skip to main content

LLM-powered category analysis for academic paper abstracts

Project description

cat-ademic Logo

cat-ademic

LLM-powered category analysis for academic paper abstracts via OpenAlex.

PyPI - Version PyPI - Python Version


The Problem

If you study a research field, you know the challenge: hundreds of papers need to be characterized before you can map what a journal publishes, how methods are evolving, or where the gaps are. Manual reading doesn't scale. Keyword search misses nuance.

The Solution

cat-ademic fetches paper abstracts directly from OpenAlex and uses LLMs to classify, extract, explore, and summarize them. It handles:

  • Category Assignment (classify): Classify papers into your predefined categories (multi-label supported)
  • Category Extraction (extract): Automatically discover and extract categories from abstracts when you don't have a predefined scheme
  • Category Exploration (explore): Analyze category stability and saturation through repeated raw extraction
  • Summarization (summarize): Summarize text or PDF documents using single or multiple models

cat-ademic also works with arbitrary text, images, and PDFs — not just OpenAlex papers. No manual downloading. Point it at a journal ISSN (or OpenAlex topic), set a date range, and get back a structured CSV.

Built on cat-stack, which provides the core LLM classification engine across 8 providers.


Table of Contents

Installation

pip install cat-ademic

For PDF support:

pip install "cat-ademic[pdf]"

Quick Start

import catademic as cat
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.environ["OPENAI_API_KEY"]

# Classify 250 recent papers from Social Science Computer Review
results = cat.classify(
    categories=[
        "Introduces new computational tool or method",
        "Applies LLM/AI to social science data",
        "Evaluates or benchmarks a method",
        "Improves survey or data collection",
        "Theory-driven / conceptual",
        "Other",
    ],
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Academic papers from Social Science Computer Review",
    api_key=api_key,
    filename="out/sscr_classified.csv",
)

print(results.head(10))
# Discover emergent categories without a predefined scheme
results = cat.extract(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Academic papers from Social Science Computer Review",
    api_key=api_key,
)

print(results["top_categories"])
# Raw category extraction for saturation analysis
raw_categories = cat.explore(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Academic papers from Social Science Computer Review",
    api_key=api_key,
    filename="out/sscr_categories_raw.csv",
)

print(f"Total raw category strings extracted: {len(raw_categories)}")
# Summarize text
results = cat.summarize(
    input_data=df["abstracts"],
    description="Academic paper abstracts",
    instructions="Summarize the key findings in 2-3 sentences",
    api_key=api_key,
)

Best Practices for Classification

These recommendations are based on empirical testing across multiple datasets and models (7B to frontier-class).

What works

  • Detailed category descriptions: The single biggest lever for accuracy. Instead of short labels like "Methods paper", use verbose descriptions like "Introduces a new computational tool or method, including software packages, algorithms, or pipelines." This consistently improves accuracy across all models.
  • Include an "Other" category: Adding a catch-all category prevents the model from forcing ambiguous papers into ill-fitting categories.
  • Low temperature (creativity=0): For classification tasks, deterministic output is generally preferable.

What doesn't help (or hurts)

  • Chain of Thought (chain_of_thought): Does not reliably improve classification accuracy and adds cost.
  • Chain of Verification (chain_of_verification): Uses ~4x the API calls. Tends to retract correct classifications during the verification step. Not recommended.
  • Step-back prompting (step_back_prompt): Inconsistent results across datasets. Not recommended as a default.
  • Thinking/reasoning (thinking_budget): Negligible accuracy gains (<1 pp) with significantly increased latency and cost.
  • Few-shot examples (example1example6): Degrades accuracy ~1 pp on average. Encourages over-classification (more false positives). Use verbose category definitions instead.

Summary

The most effective approach is: write detailed category descriptions, include an "Other" category, and use a capable model at low temperature.


Configuration

Get Your API Key

Get an API key from your preferred provider:

OpenAlex is unauthenticated and requires no key. Providing a polite_email in your requests is recommended for higher rate limits.

Supported Models

  • OpenAI: GPT-4o, GPT-4o-mini, GPT-5, o-series (o1, o3, o4), etc.
  • Anthropic: Claude Sonnet 4, Claude 3.5 Sonnet, Claude Haiku, etc.
  • Google: Gemini 2.5 Flash, Gemini 2.5 Pro, etc.
  • Huggingface: Qwen, Llama 4, DeepSeek, and thousands of community models
  • xAI: Grok models
  • Mistral: Mistral Large, Mistral Small, etc.
  • Perplexity: Sonar Large, Sonar Small, etc.
  • Ollama: Any locally hosted model

The provider is auto-detected from the model name, or you can set it explicitly with model_source=.

Note: For best results, start with OpenAI or Anthropic.


API Reference

classify()

Classify text, image, or PDF inputs into predefined categories. When you provide a journal/topic source, abstracts are fetched automatically from OpenAlex.

Academic source parameters

Parameter Type Default Description
journal_issn str None Journal ISSN to pull abstracts from via OpenAlex (e.g. "0894-4393")
journal_name str None Journal name — auto-resolved to ISSN via find_journal()
journal_field str None Discipline name to fetch across all matching journals
topic_name str None Filter papers by research topic (content-based)
topic_id str None OpenAlex concept ID for exact topic match
paper_limit int 50 Number of papers to fetch
date_from str None Start date filter ("YYYY-MM-DD")
date_to str None End date filter ("YYYY-MM-DD")
polite_email str None Email for OpenAlex polite pool (higher rate limits)

Academic context parameters

These are injected into the classification prompt to give the model context:

Parameter Type Default Description
journal str None Journal name (e.g. "Nature Human Behaviour")
field str None Academic field (e.g. "computational social science")
research_focus str None Research focus string
paper_metadata dict None Additional key-value context injected into the prompt

Core parameters

Parameter Type Default Description
categories list required List of category names/descriptions for classification
input_data list/Series None Text, image paths, or PDF paths. Omit when using a journal source.
api_key str None API key for the LLM provider (single-model mode)
description str "" Description of the corpus context
user_model str "gpt-4o" Model to use
model_source str "auto" Provider: "auto", "openai", "anthropic", "google", "mistral", "perplexity", "huggingface", "xai", "ollama"
creativity float None Temperature (0.0–1.0). None = provider default. 0.0 is valid and recommended for classification.
multi_label bool True If True, multiple categories can be assigned per item. If False, forces single-label.
filename str None Output CSV filename
save_directory str None Directory to save results

Multi-model ensemble parameters

Parameter Type Default Description
models list None List of (model, provider, api_key) tuples. Overrides user_model/api_key/model_source.
consensus_threshold str/float "unanimous" Agreement threshold: "unanimous" (100%), "majority" (50%), "two-thirds" (67%), or a float 0–1

Batch mode parameters

Parameter Type Default Description
batch_mode bool False Use async batch API (50% cost savings). Text input only.
batch_poll_interval float 30.0 Seconds between batch job status checks
batch_timeout float 86400.0 Max seconds to wait for batch completion (default 24h)

Prompt strategy parameters

Parameter Type Default Description
chain_of_thought bool False Step-by-step reasoning (no measurable accuracy gain)
chain_of_verification bool False CoVe verification step (~4x cost, degrades accuracy ~2 pp)
step_back_prompt bool False Step-back prompting (~2x cost, inconsistent gains)
context_prompt bool False Add expert context prefix
thinking_budget int 0 Provider-specific reasoning budget (Google/OpenAI/Anthropic)
example1example6 str None Few-shot examples (degrades accuracy ~1 pp — use verbose categories instead)

Advanced parameters

Parameter Type Default Description
mode str "image" PDF processing mode: "image", "text", or "both"
pdf_dpi int 150 DPI for PDF page rendering
use_json_schema bool True Use JSON schema for structured output
safety bool False Save progress after each item
max_workers int None Max parallel workers (None = auto)
parallel bool None Force parallel (True) or sequential (False) processing
fail_strategy str "partial" "partial" (keep successful rows) or "strict" (fail entirely)
max_retries int 5 Max retries per API call
batch_retries int 2 Max retries for batch-level failures
retry_delay float 1.0 Delay between retries (seconds)
row_delay float 0.0 Delay between rows (seconds) — useful for rate-limited providers
auto_download bool False Auto-download Ollama models
embeddings bool False Add embedding similarity scores per category
embedding_tiebreaker bool False Use embedding centroids to break ensemble ties
json_formatter bool False Use local fine-tuned model fallback for malformed JSON
categories_per_call int None Split large category lists into chunks of this size
progress_callback callable None Callback for progress updates: callback(current, total, label)

Returns: pandas.DataFrame with one binary column per category (category_1, category_2, …), plus OpenAlex metadata columns (title, doi, publication_date, cited_by_count, authors, etc.) when using a journal source.

Example:

import catademic as cat

results = cat.classify(
    categories=[
        "Introduces new computational tool or method",
        "Applies LLM/AI to social science data",
        "Theory-driven / conceptual",
        "Other",
    ],
    journal_issn="0894-4393",
    paper_limit=100,
    date_from="2023-01-01",
    description="Social Science Computer Review papers",
    api_key=api_key,
    filename="sscr_classified.csv",
)

# Add readable column names
results["new_tool"] = results["category_1"]
results["applies_ai"] = results["category_2"]
results.to_csv("sscr_classified.csv", index=False)

extract()

Automatically discover and extract categories from paper abstracts when you don't have a predefined scheme. Returns a clean, deduplicated, semantically merged set of categories.

Academic source parameters

Same as classify(): journal_issn, journal_name, journal_field, topic_name, topic_id, paper_limit, date_from, date_to, polite_email.

Academic context parameters

Same as classify(): journal, field, research_focus, paper_metadata.

Core parameters

Parameter Type Default Description
input_data list/Series None Text data. Omit when using a journal source.
api_key str required API key for the LLM provider
description str "" Description of the corpus
input_type str "text" Input type: "text", "image", or "pdf"
max_categories int 12 Maximum number of final categories to return
categories_per_chunk int 10 Categories to extract per chunk
divisions int 12 Number of chunks to divide data into
iterations int 8 Number of extraction passes over the data
user_model str "gpt-4o" Model to use
model_source str "auto" Provider
creativity float None Temperature
specificity str "broad" "broad" or "specific" category granularity
research_question str None Research context to guide extraction
focus str None Focus instruction (e.g. "methodological contributions")
mode str "text" PDF mode: "text", "image", or "both"
filename str None Output CSV filename
random_state int None Random seed for reproducibility
chunk_delay float 0.0 Delay between API calls (seconds)
progress_callback callable None Callback for progress updates

Returns: dict with keys:

  • counts_df: DataFrame of categories with counts
  • top_categories: List of top category names
  • raw_top_text: Raw model output from final merge step

Example:

import catademic as cat

results = cat.extract(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Social Science Computer Review papers",
    api_key=api_key,
    max_categories=10,
    focus="methodological contributions",
)

print(results["top_categories"])
# ['Computational text analysis', 'Survey methodology', 'Network analysis', ...]

explore()

Raw category extraction for frequency and saturation analysis. Unlike extract(), which normalizes and merges categories into a clean final set, explore() returns every category string from every chunk across every iteration — with duplicates intact.

This is useful for analyzing which categories are robust (consistently discovered across runs) versus noise (appearing only once or twice). Increasing iterations lets you build saturation curves showing when category discovery converges.

Academic source parameters

Same as classify(): journal_issn, journal_name, journal_field, topic_name, topic_id, paper_limit, date_from, date_to, polite_email.

Core parameters

Parameter Type Default Description
input_data list/Series None Text data. Omit when using a journal source.
api_key str required API key for the LLM provider
description str "" Description of the corpus
max_categories int 12 Max categories per chunk
categories_per_chunk int 10 Categories to extract per chunk
divisions int 12 Number of chunks to divide data into
iterations int 8 Number of passes over the data
user_model str "gpt-4o" Model to use
model_source str "auto" Provider
creativity float None Temperature
specificity str "broad" "broad" or "specific" category granularity
research_question str None Research context
focus str None Focus instruction for extraction
random_state int None Random seed for reproducibility
filename str None Output CSV filename (one category per row)
chunk_delay float 0.0 Delay between API calls (seconds)
progress_callback callable None Callback for progress updates

Returns: list[str] — every category extracted from every chunk across every iteration. Length ≈ iterations × divisions × categories_per_chunk.

Example:

import catademic as cat
from collections import Counter

raw_categories = cat.explore(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Social Science Computer Review papers",
    api_key=api_key,
    iterations=20,
    filename="sscr_categories_raw.csv",
)

counts = Counter(raw_categories)
for category, freq in counts.most_common(15):
    print(f"{freq:3d}x  {category}")

summarize()

Summarize text or PDF documents using LLMs. Supports single-model and multi-model (ensemble) summarization. In multi-model mode, summaries from all models are synthesized into a consensus summary. Input type is auto-detected.

Parameter Type Default Description
input_data list/Series/str required Text list, pandas Series, single string, PDF directory, or list of PDF paths
api_key str None API key (single-model mode)
description str "" Description of the content (provides context)
instructions str "" Specific summarization instructions (e.g. "bullet points")
max_length int None Maximum summary length in words
focus str None What to focus on (e.g. "key findings", "methodology")
user_model str "gpt-4o" Model to use
model_source str "auto" Provider
mode str "image" PDF processing mode: "image", "text", or "both"
pdf_dpi int 150 DPI for PDF rendering
creativity float None Temperature
thinking_budget int 0 Provider-specific reasoning budget
chain_of_thought bool True Enable step-by-step reasoning
context_prompt bool False Add expert context prefix
step_back_prompt bool False Step-back prompting
filename str None Output CSV filename
save_directory str None Directory to save results
progress_callback callable None Callback for progress updates
models list None List of (model, provider, api_key) tuples for multi-model mode
max_workers int None Max parallel workers
safety bool False Save progress after each item
max_retries int 5 Max retries per API call
fail_strategy str "partial" "partial" or "strict"
row_delay float 0.0 Delay between rows (seconds)
batch_mode bool False Use async batch API (text input only)
batch_poll_interval float 30.0 Seconds between batch status checks
batch_timeout float 86400.0 Max seconds to wait for batch completion

Returns: pandas.DataFrame with columns:

  • survey_input: Original text or page label (PDFs)
  • summary: Generated summary (or consensus for multi-model)
  • summary_<model>: Per-model summaries (multi-model only)
  • processing_status: "success", "error", or "skipped"
  • failed_models: Comma-separated list (multi-model only)
  • pdf_path, page_index: PDF metadata (PDF mode only)

Example:

import catademic as cat

# Single model
results = cat.summarize(
    input_data=df["abstracts"],
    description="Academic paper abstracts",
    instructions="Summarize the key findings in 2-3 bullet points",
    max_length=100,
    api_key=api_key,
)

# Multi-model ensemble
results = cat.summarize(
    input_data=df["abstracts"],
    description="Academic paper abstracts",
    models=[
        ("gpt-4o", "openai", openai_key),
        ("claude-sonnet-4-5-20250929", "anthropic", anthropic_key),
    ],
)

# PDF summarization
results = cat.summarize(
    input_data="/path/to/papers/",
    description="Research papers",
    mode="text",
    api_key=api_key,
)

Discovery Functions

These functions help you find journals and topics in OpenAlex before running classification or extraction.

find_journal(name)

Search OpenAlex for journals matching a name string.

>>> cat.find_journal("socius")
                                      display_name       issn  works_count
0  Socius Sociological Research for a Dynamic World  2378-0231         1039
1                                     Jurnal Socius  2089-9661          315

find_journals_by_field(field)

Search for journals whose name or focus matches a field.

>>> cat.find_journals_by_field("demography")
# Returns DataFrame with display_name, issn, works_count, publisher

# Then fetch across all demography journals:
>>> cat.classify(categories=[...], journal_field="demography", api_key=api_key)

find_topic(name)

Search for academic topics/concepts in OpenAlex.

>>> cat.find_topic("machine learning")
        display_name  level  works_count
0  Machine learning      2      1234567
1    Deep learning       3       567890

# Then filter papers by topic:
>>> cat.classify(categories=[...], topic_name="machine learning", api_key=api_key)

fetch_academic_papers()

Fetch papers directly for custom analysis.

papers = cat.fetch_academic_papers(
    journal_issn="0894-4393",
    limit=100,
    date_from="2023-01-01",
    polite_email="you@example.com",
)
# Returns DataFrame with: text, title, doi, publication_year, authors,
# cited_by_count, fwci, keywords, concepts, topic, ...

Multi-Model Ensemble

Use multiple models to classify the same data and aggregate their decisions via consensus voting. This reduces false positives and improves reliability.

results = cat.classify(
    categories=["Empirical", "Theoretical", "Methodological", "Other"],
    journal_issn="0894-4393",
    paper_limit=50,
    models=[
        ("gpt-4o", "openai", openai_key),
        ("claude-sonnet-4-5-20250929", "anthropic", anthropic_key),
        ("gemini-2.5-flash", "google", google_key),
    ],
    consensus_threshold="unanimous",  # or "majority", "two-thirds", or 0.75
)

The result DataFrame includes:

  • Per-model columns (e.g. category_1_gpt_4o, category_1_claude_sonnet_...)
  • Consensus columns (category_1_consensus, category_2_consensus, …)
  • agreement_score: Proportion of models that agreed

Thresholds:

  • "unanimous" (default): All models must agree — highest precision, most conservative
  • "majority": >50% agreement
  • "two-thirds": >67% agreement
  • Any float between 0 and 1

Per-model temperature can be set with 4-tuples:

models=[
    ("gpt-4o", "openai", openai_key, {"creativity": 0.0}),
    ("gpt-4o", "openai", openai_key, {"creativity": 0.3}),
]

Batch Mode

For large classification jobs, batch mode uses the provider's async batch API — typically 50% cheaper with higher rate limits. Supported by OpenAI, Anthropic, Google, Mistral, and xAI.

results = cat.classify(
    categories=categories,
    journal_issn="0894-4393",
    paper_limit=500,
    api_key=api_key,
    batch_mode=True,
    batch_poll_interval=60,   # check every 60 seconds
    batch_timeout=86400,      # wait up to 24 hours
)

Limitations: text input only (no PDF/image), single model only (no ensemble), no progress_callback.


Related Projects

  • cat-stack: The shared classification engine that powers cat-ademic.
  • cat-llm: The full CatLLM ecosystem meta-package for text classification.
  • cat-vader: Social media text classification with platform-aware context injection.

Academic Research

If you use this package for research, please cite:

Soria, C. (2025). cat-ademic (0.1.0). GitHub. https://github.com/chrissoria/cat-ademic

Contributing & Support

License

cat-ademic is distributed under the terms of the GNU GPL-3.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_ademic-0.1.2.tar.gz (335.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cat_ademic-0.1.2-py3-none-any.whl (337.9 kB view details)

Uploaded Python 3

File details

Details for the file cat_ademic-0.1.2.tar.gz.

File metadata

  • Download URL: cat_ademic-0.1.2.tar.gz
  • Upload date:
  • Size: 335.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_ademic-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d865dff65009bd730adf6b996248d3c8262321b60606fb4e2b41bae59c88eec4
MD5 ce8a7f10fc63687118de667d8bf4a2ff
BLAKE2b-256 a1ee3409979edc962e76faead132512856cc6292cbfea7cc1352345007dbd8da

See more details on using hashes here.

File details

Details for the file cat_ademic-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: cat_ademic-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 337.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_ademic-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c63acacbd4c20d4a4eb9828d3fa14ba124a9d67572d20176afb6299762c49f5d
MD5 80d7b12853f3b290614a41add04b7f67
BLAKE2b-256 55379217623a2dbbbc8f86a228e7d95382195f1a200987c785cc420ce0223e55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page