LLM-powered category analysis for academic paper abstracts

These details have not been verified by PyPI

Project links

Project description

cat-ademic Logo

cat-ademic

LLM-powered category analysis for academic paper abstracts via OpenAlex.

The Problem

If you study a research field, you know the challenge: hundreds of papers need to be characterized before you can map what a journal publishes, how methods are evolving, or where the gaps are. Manual reading doesn't scale. Keyword search misses nuance.

The Solution

cat-ademic fetches paper abstracts directly from OpenAlex and uses LLMs to classify, extract, explore, and summarize them. It handles:

Category Assignment (classify): Classify papers into your predefined categories (multi-label supported)
Category Extraction (extract): Automatically discover and extract categories from abstracts when you don't have a predefined scheme
Category Exploration (explore): Analyze category stability and saturation through repeated raw extraction
Summarization (summarize): Summarize text or PDF documents using single or multiple models

cat-ademic also works with arbitrary text, images, and PDFs — not just OpenAlex papers. No manual downloading. Point it at a journal ISSN (or OpenAlex topic), set a date range, and get back a structured CSV.

Built on cat-stack, which provides the core LLM classification engine across 8 providers.

Installation
Quick Start
Best Practices for Classification
Configuration
Supported Models
API Reference
Multi-Model Ensemble
Batch Mode
Related Projects
Academic Research
Contributing & Support
License

Installation

pip install cat-ademic

For PDF support:

pip install "cat-ademic[pdf]"

Quick Start

import catademic as cat
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.environ["OPENAI_API_KEY"]

# Classify 250 recent papers from Social Science Computer Review
results = cat.classify(
    categories=[
        "Introduces new computational tool or method",
        "Applies LLM/AI to social science data",
        "Evaluates or benchmarks a method",
        "Improves survey or data collection",
        "Theory-driven / conceptual",
        "Other",
    ],
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Academic papers from Social Science Computer Review",
    api_key=api_key,
    filename="out/sscr_classified.csv",
)

print(results.head(10))

# Discover emergent categories without a predefined scheme
results = cat.extract(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Academic papers from Social Science Computer Review",
    api_key=api_key,
)

print(results["top_categories"])

# Raw category extraction for saturation analysis
raw_categories = cat.explore(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Academic papers from Social Science Computer Review",
    api_key=api_key,
    filename="out/sscr_categories_raw.csv",
)

print(f"Total raw category strings extracted: {len(raw_categories)}")

# Summarize text
results = cat.summarize(
    input_data=df["abstracts"],
    description="Academic paper abstracts",
    instructions="Summarize the key findings in 2-3 sentences",
    api_key=api_key,
)

Best Practices for Classification

These recommendations are based on empirical testing across multiple datasets and models (7B to frontier-class).

What works

Detailed category descriptions: The single biggest lever for accuracy. Instead of short labels like "Methods paper", use verbose descriptions like "Introduces a new computational tool or method, including software packages, algorithms, or pipelines." This consistently improves accuracy across all models.
Include an "Other" category: Adding a catch-all category prevents the model from forcing ambiguous papers into ill-fitting categories.
Low temperature (creativity=0): For classification tasks, deterministic output is generally preferable.

What doesn't help (or hurts)

Chain of Thought (chain_of_thought): Does not reliably improve classification accuracy and adds cost.
Chain of Verification (chain_of_verification): Uses ~4x the API calls. Tends to retract correct classifications during the verification step. Not recommended.
Step-back prompting (step_back_prompt): Inconsistent results across datasets. Not recommended as a default.
Thinking/reasoning (thinking_budget): Negligible accuracy gains (<1 pp) with significantly increased latency and cost.
Few-shot examples (example1–example6): Degrades accuracy ~1 pp on average. Encourages over-classification (more false positives). Use verbose category definitions instead.

Summary

The most effective approach is: write detailed category descriptions, include an "Other" category, and use a capable model at low temperature.

Configuration

Get Your API Key

Get an API key from your preferred provider:

OpenAI: platform.openai.com
Anthropic: console.anthropic.com
Google: aistudio.google.com
Huggingface: huggingface.co/settings/tokens
xAI: console.x.ai
Mistral: console.mistral.ai
Perplexity: perplexity.ai/settings/api

OpenAlex is unauthenticated and requires no key. Providing a polite_email in your requests is recommended for higher rate limits.

Supported Models

OpenAI: GPT-4o, GPT-4o-mini, GPT-5, o-series (o1, o3, o4), etc.
Anthropic: Claude Sonnet 4, Claude 3.5 Sonnet, Claude Haiku, etc.
Google: Gemini 2.5 Flash, Gemini 2.5 Pro, etc.
Huggingface: Qwen, Llama 4, DeepSeek, and thousands of community models
xAI: Grok models
Mistral: Mistral Large, Mistral Small, etc.
Perplexity: Sonar Large, Sonar Small, etc.
Ollama: Any locally hosted model

The provider is auto-detected from the model name, or you can set it explicitly with model_source=.

Note: For best results, start with OpenAI or Anthropic.

API Reference

`classify()`

Classify text, image, or PDF inputs into predefined categories. When you provide a journal/topic source, abstracts are fetched automatically from OpenAlex.

Academic source parameters

Parameter	Type	Default	Description
`journal_issn`	str	None	Journal ISSN to pull abstracts from via OpenAlex (e.g. `"0894-4393"`)
`journal_name`	str	None	Journal name — auto-resolved to ISSN via `find_journal()`
`journal_field`	str	None	Discipline name to fetch across all matching journals
`topic_name`	str	None	Filter papers by research topic (content-based)
`topic_id`	str	None	OpenAlex concept ID for exact topic match
`paper_limit`	int	50	Number of papers to fetch
`date_from`	str	None	Start date filter (`"YYYY-MM-DD"`)
`date_to`	str	None	End date filter (`"YYYY-MM-DD"`)
`polite_email`	str	None	Email for OpenAlex polite pool (higher rate limits)

Academic context parameters

These are injected into the classification prompt to give the model context:

Parameter	Type	Default	Description
`journal`	str	None	Journal name (e.g. `"Nature Human Behaviour"`)
`field`	str	None	Academic field (e.g. `"computational social science"`)
`research_focus`	str	None	Research focus string
`paper_metadata`	dict	None	Additional key-value context injected into the prompt

Core parameters

Parameter	Type	Default	Description
`categories`	list	required	List of category names/descriptions for classification
`input_data`	list/Series	None	Text, image paths, or PDF paths. Omit when using a journal source.
`api_key`	str	None	API key for the LLM provider (single-model mode)
`description`	str	`""`	Description of the corpus context
`user_model`	str	`"gpt-4o"`	Model to use
`model_source`	str	`"auto"`	Provider: `"auto"`, `"openai"`, `"anthropic"`, `"google"`, `"mistral"`, `"perplexity"`, `"huggingface"`, `"xai"`, `"ollama"`
`creativity`	float	None	Temperature (0.0–1.0). `None` = provider default. `0.0` is valid and recommended for classification.
`multi_label`	bool	True	If True, multiple categories can be assigned per item. If False, forces single-label.
`filename`	str	None	Output CSV filename
`save_directory`	str	None	Directory to save results

Multi-model ensemble parameters

Parameter	Type	Default	Description
`models`	list	None	List of `(model, provider, api_key)` tuples. Overrides `user_model`/`api_key`/`model_source`.
`consensus_threshold`	str/float	`"unanimous"`	Agreement threshold: `"unanimous"` (100%), `"majority"` (50%), `"two-thirds"` (67%), or a float 0–1

Batch mode parameters

Parameter	Type	Default	Description
`batch_mode`	bool	False	Use async batch API (50% cost savings). Text input only.
`batch_poll_interval`	float	30.0	Seconds between batch job status checks
`batch_timeout`	float	86400.0	Max seconds to wait for batch completion (default 24h)

Prompt strategy parameters

Parameter	Type	Default	Description
`chain_of_thought`	bool	False	Step-by-step reasoning (no measurable accuracy gain)
`chain_of_verification`	bool	False	CoVe verification step (~4x cost, degrades accuracy ~2 pp)
`step_back_prompt`	bool	False	Step-back prompting (~2x cost, inconsistent gains)
`context_prompt`	bool	False	Add expert context prefix
`thinking_budget`	int	0	Provider-specific reasoning budget (Google/OpenAI/Anthropic)
`example1`–`example6`	str	None	Few-shot examples (degrades accuracy ~1 pp — use verbose categories instead)

Advanced parameters

Parameter	Type	Default	Description
`mode`	str	`"image"`	PDF processing mode: `"image"`, `"text"`, or `"both"`
`pdf_dpi`	int	150	DPI for PDF page rendering
`use_json_schema`	bool	True	Use JSON schema for structured output
`safety`	bool	False	Save progress after each item
`max_workers`	int	None	Max parallel workers (None = auto)
`parallel`	bool	None	Force parallel (True) or sequential (False) processing
`fail_strategy`	str	`"partial"`	`"partial"` (keep successful rows) or `"strict"` (fail entirely)
`max_retries`	int	5	Max retries per API call
`batch_retries`	int	2	Max retries for batch-level failures
`retry_delay`	float	1.0	Delay between retries (seconds)
`row_delay`	float	0.0	Delay between rows (seconds) — useful for rate-limited providers
`auto_download`	bool	False	Auto-download Ollama models
`embeddings`	bool	False	Add embedding similarity scores per category
`embedding_tiebreaker`	bool	False	Use embedding centroids to break ensemble ties
`json_formatter`	bool	False	Use local fine-tuned model fallback for malformed JSON
`categories_per_call`	int	None	Split large category lists into chunks of this size
`progress_callback`	callable	None	Callback for progress updates: `callback(current, total, label)`

Returns: pandas.DataFrame with one binary column per category (category_1, category_2, …), plus OpenAlex metadata columns (title, doi, publication_date, cited_by_count, authors, etc.) when using a journal source.

Example:

import catademic as cat

results = cat.classify(
    categories=[
        "Introduces new computational tool or method",
        "Applies LLM/AI to social science data",
        "Theory-driven / conceptual",
        "Other",
    ],
    journal_issn="0894-4393",
    paper_limit=100,
    date_from="2023-01-01",
    description="Social Science Computer Review papers",
    api_key=api_key,
    filename="sscr_classified.csv",
)

# Add readable column names
results["new_tool"] = results["category_1"]
results["applies_ai"] = results["category_2"]
results.to_csv("sscr_classified.csv", index=False)

`extract()`

Automatically discover and extract categories from paper abstracts when you don't have a predefined scheme. Returns a clean, deduplicated, semantically merged set of categories.

Academic source parameters

Same as classify(): journal_issn, journal_name, journal_field, topic_name, topic_id, paper_limit, date_from, date_to, polite_email.

Academic context parameters

Same as classify(): journal, field, research_focus, paper_metadata.

Core parameters

Parameter	Type	Default	Description
`input_data`	list/Series	None	Text data. Omit when using a journal source.
`api_key`	str	required	API key for the LLM provider
`description`	str	`""`	Description of the corpus
`input_type`	str	`"text"`	Input type: `"text"`, `"image"`, or `"pdf"`
`max_categories`	int	12	Maximum number of final categories to return
`categories_per_chunk`	int	10	Categories to extract per chunk
`divisions`	int	12	Number of chunks to divide data into
`iterations`	int	8	Number of extraction passes over the data
`user_model`	str	`"gpt-4o"`	Model to use
`model_source`	str	`"auto"`	Provider
`creativity`	float	None	Temperature
`specificity`	str	`"broad"`	`"broad"` or `"specific"` category granularity
`research_question`	str	None	Research context to guide extraction
`focus`	str	None	Focus instruction (e.g. `"methodological contributions"`)
`mode`	str	`"text"`	PDF mode: `"text"`, `"image"`, or `"both"`
`filename`	str	None	Output CSV filename
`random_state`	int	None	Random seed for reproducibility
`chunk_delay`	float	0.0	Delay between API calls (seconds)
`progress_callback`	callable	None	Callback for progress updates

Returns: dict with keys:

counts_df: DataFrame of categories with counts
top_categories: List of top category names
raw_top_text: Raw model output from final merge step

Example:

import catademic as cat

results = cat.extract(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Social Science Computer Review papers",
    api_key=api_key,
    max_categories=10,
    focus="methodological contributions",
)

print(results["top_categories"])
# ['Computational text analysis', 'Survey methodology', 'Network analysis', ...]

`explore()`

Raw category extraction for frequency and saturation analysis. Unlike extract(), which normalizes and merges categories into a clean final set, explore() returns every category string from every chunk across every iteration — with duplicates intact.

This is useful for analyzing which categories are robust (consistently discovered across runs) versus noise (appearing only once or twice). Increasing iterations lets you build saturation curves showing when category discovery converges.

Academic source parameters

Same as classify(): journal_issn, journal_name, journal_field, topic_name, topic_id, paper_limit, date_from, date_to, polite_email.

Core parameters

Parameter	Type	Default	Description
`input_data`	list/Series	None	Text data. Omit when using a journal source.
`api_key`	str	required	API key for the LLM provider
`description`	str	`""`	Description of the corpus
`max_categories`	int	12	Max categories per chunk
`categories_per_chunk`	int	10	Categories to extract per chunk
`divisions`	int	12	Number of chunks to divide data into
`iterations`	int	8	Number of passes over the data
`user_model`	str	`"gpt-4o"`	Model to use
`model_source`	str	`"auto"`	Provider
`creativity`	float	None	Temperature
`specificity`	str	`"broad"`	`"broad"` or `"specific"` category granularity
`research_question`	str	None	Research context
`focus`	str	None	Focus instruction for extraction
`random_state`	int	None	Random seed for reproducibility
`filename`	str	None	Output CSV filename (one category per row)
`chunk_delay`	float	0.0	Delay between API calls (seconds)
`progress_callback`	callable	None	Callback for progress updates

Returns: list[str] — every category extracted from every chunk across every iteration. Length ≈ iterations × divisions × categories_per_chunk.

Example:

import catademic as cat
from collections import Counter

raw_categories = cat.explore(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Social Science Computer Review papers",
    api_key=api_key,
    iterations=20,
    filename="sscr_categories_raw.csv",
)

counts = Counter(raw_categories)
for category, freq in counts.most_common(15):
    print(f"{freq:3d}x  {category}")

`summarize()`

Summarize text or PDF documents using LLMs. Supports single-model and multi-model (ensemble) summarization. In multi-model mode, summaries from all models are synthesized into a consensus summary. Input type is auto-detected.

Parameter	Type	Default	Description
`input_data`	list/Series/str	required	Text list, pandas Series, single string, PDF directory, or list of PDF paths
`api_key`	str	None	API key (single-model mode)
`description`	str	`""`	Description of the content (provides context)
`instructions`	str	`""`	Specific summarization instructions (e.g. `"bullet points"`)
`max_length`	int	None	Maximum summary length in words
`focus`	str	None	What to focus on (e.g. `"key findings"`, `"methodology"`)
`user_model`	str	`"gpt-4o"`	Model to use
`model_source`	str	`"auto"`	Provider
`mode`	str	`"image"`	PDF processing mode: `"image"`, `"text"`, or `"both"`
`pdf_dpi`	int	150	DPI for PDF rendering
`creativity`	float	None	Temperature
`thinking_budget`	int	0	Provider-specific reasoning budget
`chain_of_thought`	bool	True	Enable step-by-step reasoning
`context_prompt`	bool	False	Add expert context prefix
`step_back_prompt`	bool	False	Step-back prompting
`filename`	str	None	Output CSV filename
`save_directory`	str	None	Directory to save results
`progress_callback`	callable	None	Callback for progress updates
`models`	list	None	List of `(model, provider, api_key)` tuples for multi-model mode
`max_workers`	int	None	Max parallel workers
`safety`	bool	False	Save progress after each item
`max_retries`	int	5	Max retries per API call
`fail_strategy`	str	`"partial"`	`"partial"` or `"strict"`
`row_delay`	float	0.0	Delay between rows (seconds)
`batch_mode`	bool	False	Use async batch API (text input only)
`batch_poll_interval`	float	30.0	Seconds between batch status checks
`batch_timeout`	float	86400.0	Max seconds to wait for batch completion

Returns: pandas.DataFrame with columns:

survey_input: Original text or page label (PDFs)
summary: Generated summary (or consensus for multi-model)
summary_<model>: Per-model summaries (multi-model only)
processing_status: "success", "error", or "skipped"
failed_models: Comma-separated list (multi-model only)
pdf_path, page_index: PDF metadata (PDF mode only)

Example:

import catademic as cat

# Single model
results = cat.summarize(
    input_data=df["abstracts"],
    description="Academic paper abstracts",
    instructions="Summarize the key findings in 2-3 bullet points",
    max_length=100,
    api_key=api_key,
)

# Multi-model ensemble
results = cat.summarize(
    input_data=df["abstracts"],
    description="Academic paper abstracts",
    models=[
        ("gpt-4o", "openai", openai_key),
        ("claude-sonnet-4-5-20250929", "anthropic", anthropic_key),
    ],
)

# PDF summarization
results = cat.summarize(
    input_data="/path/to/papers/",
    description="Research papers",
    mode="text",
    api_key=api_key,
)

Discovery Functions

These functions help you find journals and topics in OpenAlex before running classification or extraction.

`find_journal(name)`

Search OpenAlex for journals matching a name string.

>>> cat.find_journal("socius")
                                      display_name       issn  works_count
0  Socius Sociological Research for a Dynamic World  2378-0231         1039
1                                     Jurnal Socius  2089-9661          315

`find_journals_by_field(field)`

Search for journals whose name or focus matches a field.

>>> cat.find_journals_by_field("demography")
# Returns DataFrame with display_name, issn, works_count, publisher

# Then fetch across all demography journals:
>>> cat.classify(categories=[...], journal_field="demography", api_key=api_key)

`find_topic(name)`

Search for academic topics/concepts in OpenAlex.

>>> cat.find_topic("machine learning")
        display_name  level  works_count
0  Machine learning      2      1234567
1    Deep learning       3       567890

# Then filter papers by topic:
>>> cat.classify(categories=[...], topic_name="machine learning", api_key=api_key)

`fetch_academic_papers()`

Fetch papers directly for custom analysis.

papers = cat.fetch_academic_papers(
    journal_issn="0894-4393",
    limit=100,
    date_from="2023-01-01",
    polite_email="you@example.com",
)
# Returns DataFrame with: text, title, doi, publication_year, authors,
# cited_by_count, fwci, keywords, concepts, topic, ...

Multi-Model Ensemble

Use multiple models to classify the same data and aggregate their decisions via consensus voting. This reduces false positives and improves reliability.

results = cat.classify(
    categories=["Empirical", "Theoretical", "Methodological", "Other"],
    journal_issn="0894-4393",
    paper_limit=50,
    models=[
        ("gpt-4o", "openai", openai_key),
        ("claude-sonnet-4-5-20250929", "anthropic", anthropic_key),
        ("gemini-2.5-flash", "google", google_key),
    ],
    consensus_threshold="unanimous",  # or "majority", "two-thirds", or 0.75
)

The result DataFrame includes:

Per-model columns (e.g. category_1_gpt_4o, category_1_claude_sonnet_...)
Consensus columns (category_1_consensus, category_2_consensus, …)
agreement_score: Proportion of models that agreed

Thresholds:

"unanimous" (default): All models must agree — highest precision, most conservative
"majority": >50% agreement
"two-thirds": >67% agreement
Any float between 0 and 1

Per-model temperature can be set with 4-tuples:

models=[
    ("gpt-4o", "openai", openai_key, {"creativity": 0.0}),
    ("gpt-4o", "openai", openai_key, {"creativity": 0.3}),
]

Batch Mode

For large classification jobs, batch mode uses the provider's async batch API — typically 50% cheaper with higher rate limits. Supported by OpenAI, Anthropic, Google, Mistral, and xAI.

results = cat.classify(
    categories=categories,
    journal_issn="0894-4393",
    paper_limit=500,
    api_key=api_key,
    batch_mode=True,
    batch_poll_interval=60,   # check every 60 seconds
    batch_timeout=86400,      # wait up to 24 hours
)

Limitations: text input only (no PDF/image), single model only (no ensemble), no progress_callback.

Related Projects

cat-stack: The shared classification engine that powers cat-ademic.
cat-llm: The full CatLLM ecosystem meta-package for text classification.
cat-vader: Social media text classification with platform-aware context injection.

Academic Research

If you use this package for research, please cite:

Soria, C. (2025). cat-ademic (0.1.0). GitHub. https://github.com/chrissoria/cat-ademic

Contributing & Support

Report bugs or request features: Open a GitHub Issue
Research collaboration: Email ChrisSoria@Berkeley.edu

License

cat-ademic is distributed under the terms of the GNU GPL-3.0 license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

May 17, 2026

0.2.0

May 12, 2026

This version

0.1.2

Mar 19, 2026

0.1.1

Mar 5, 2026

0.1.0

Mar 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_ademic-0.1.2.tar.gz (335.3 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cat_ademic-0.1.2-py3-none-any.whl (337.9 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file cat_ademic-0.1.2.tar.gz.

File metadata

Download URL: cat_ademic-0.1.2.tar.gz
Upload date: Mar 19, 2026
Size: 335.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_ademic-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`d865dff65009bd730adf6b996248d3c8262321b60606fb4e2b41bae59c88eec4`
MD5	`ce8a7f10fc63687118de667d8bf4a2ff`
BLAKE2b-256	`a1ee3409979edc962e76faead132512856cc6292cbfea7cc1352345007dbd8da`

See more details on using hashes here.

File details

Details for the file cat_ademic-0.1.2-py3-none-any.whl.

File metadata

Download URL: cat_ademic-0.1.2-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 337.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_ademic-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c63acacbd4c20d4a4eb9828d3fa14ba124a9d67572d20176afb6299762c49f5d`
MD5	`80d7b12853f3b290614a41add04b7f67`
BLAKE2b-256	`55379217623a2dbbbc8f86a228e7d95382195f1a200987c785cc420ce0223e55`

See more details on using hashes here.

cat-ademic 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cat-ademic

The Problem

The Solution

Table of Contents

Installation

Quick Start

Best Practices for Classification

What works

What doesn't help (or hurts)

Summary

Configuration

Get Your API Key

Supported Models

API Reference

classify()

Academic source parameters

Academic context parameters

Core parameters

Multi-model ensemble parameters

Batch mode parameters

Prompt strategy parameters

Advanced parameters

extract()

Academic source parameters

Academic context parameters

Core parameters

explore()

Academic source parameters

Core parameters

summarize()

Discovery Functions

find_journal(name)

find_journals_by_field(field)

find_topic(name)

fetch_academic_papers()

Multi-Model Ensemble

Batch Mode

Related Projects

Academic Research

Contributing & Support

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`classify()`

`extract()`

`explore()`

`summarize()`

`find_journal(name)`

`find_journals_by_field(field)`

`find_topic(name)`

`fetch_academic_papers()`