Skip to main content

AI-powered Python library for systematic, scalable analysis of academic literature

Project description

Literature Mapper

An AI-powered Python library for systematic, scalable analysis of academic literature.

Literature Mapper turns a folder of PDF articles into a structured, queryable SQLite database. It combines local PDF processing with Gemini AI analysis and OpenAlex citation data to create a rich Knowledge Graph of your research field.


Features

  • Knowledge Graph Extraction: Automatically extracts concepts, authors, methods, and findings as connected nodes.
    • Nodes: Papers, Concepts, Findings, Methods, Authors, Institutions, Limitations.
    • Edges: PAPER -> HAS_CONCEPT, PAPER -> HAS_METHOD, AUTHOR -> COAUTHORED_WITH, CONCEPT -> RELATED_TO.
    • Storage: Normalized SQLite schema (kg_nodes, kg_edges), exportable to .gexf for graph tools.
  • OpenAlex Integration: Automatically fetches citation counts and references for papers in your corpus, enabling robust bibliometric analysis.
  • Ghost Hunting: Algorithms to identify missing pieces in your literature review:
    • Bibliographic Ghosts: Papers frequently cited by your corpus but missing from it.
    • Missing Authors: Influential authors cited by your corpus who aren't directly represented.
  • Thematic Agents: Synthesize answers and validate hypotheses using the Knowledge Graph.
    • Argument Agent: Aggregates evidence to answer research questions.
    • Validation Agent: Critiques hypotheses against the literature.
  • Semantic Search: Find relevant content by meaning using vector embeddings.
  • Gemini Models: Works with any available Gemini model (default: gemini-2.5-flash).
  • Clean Database Schema: SQLite with proper constraints and relational tables.
  • Simple CLI: Process, query, and export directly from the terminal.

Installation

# Install from PyPI
pip install literature-mapper

# Or install the latest commit from GitHub
pip install git+https://github.com/jeremiahbohr/literature-mapper.git

# Configure your Google AI API key
export GEMINI_API_KEY="your_api_key_here"

Quick Start (Jupyter / Python)

from literature_mapper import LiteratureMapper

# 1: Initialize the mapper (creates corpus.db)
mapper = LiteratureMapper("./my_ai_research")

# 2: Process PDFs (Extracts Metadata + Knowledge Graph)
results = mapper.process_new_papers(recursive=True)
print(f"Processed: {results.processed}")

# 3: Fetch Citations (OpenAlex)
# Populates citation counts and references for processed papers
mapper.update_citations()

# 4: Synthesize Answers (Argument Agent)
answer = mapper.synthesize_answer("What are the limitations of current methods?")
print(answer)

# 5: Validate Hypotheses (Validation Agent)
critique = mapper.validate_hypothesis("Current methods have solved the problem of hallucination.")
print(critique['verdict'])  # e.g., "CONTRADICTED"
print(critique['explanation'])

# 6: Export Data
mapper.export_to_csv("corpus.csv")

Command-Line Interface

Literature Mapper offers a powerful CLI for managing your research corpus.

Core Workflow

  1. Process PDFs: Extract text and build the Knowledge Graph.

    literature-mapper process ./my_research --recursive
    
  2. Fetch Citations: Enrich your corpus with data from OpenAlex.

    literature-mapper citations ./my_research
    
  3. Analyze Status: View corpus statistics and health.

    literature-mapper status ./my_research
    

Visualization

Export your corpus as a .gexf file for visualization in tools like Gephi.

# Default: Semantic Knowledge Graph
literature-mapper viz ./my_research --output graph.gexf
Mode Description Best For
semantic (Default) The full Knowledge Graph (Concepts, Findings, Methods). Understanding the logical structure of arguments.
authors Co-authorship network (weighted by shared papers). Identifying "Invisible Colleges" and key researchers.
concepts Topic co-occurrence network. Mapping the "Topic Landscape" of the field.
river Same as concepts, but adds a start year attribute. Creating dynamic networks (similar to ThemeRiver visualizations) in Gephi.
similarity Paper similarity map based on shared concepts (Jaccard Index). Finding thematically similar papers without direct citations.

Ghost Hunting

Identify missing links and gaps in your literature review.

literature-mapper ghosts ./my_research --mode <MODE>
Mode Description
bibliographic (Default) Identifies papers frequently cited by your corpus but missing from it. Helps you find seminal works you missed.
authors Identifies authors frequently cited by your corpus but not represented in it. Helps you find key voices in the field.

Analysis Tools

# Synthesize an answer to a research question
literature-mapper synthesize ./my_research "What is the impact of X on Y?"

# Validate a hypothesis against the corpus
literature-mapper validate ./my_research "X causes Y."

# Identify Hubs (Most Cited Papers)
literature-mapper hubs ./my_research

# View Comprehensive Corpus Statistics
literature-mapper stats ./my_research

Configuration via Environment Variables

Variable Purpose Default
GEMINI_API_KEY Required. Google AI key None
LITERATURE_MAPPER_MODEL Default model for CLI gemini-2.5-flash
LITERATURE_MAPPER_MAX_FILE_SIZE Max PDF size (bytes) 52428800 (50 MB)
LITERATURE_MAPPER_BATCH_SIZE PDFs processed per batch 10
LITERATURE_MAPPER_LOG_LEVEL Log level (DEBUG, INFO, …) INFO
LITERATURE_MAPPER_VERBOSE Set to true for debug logs false

Advanced Usage

Embeddings

Literature Mapper uses Google's models/text-embedding-004 to generate vector embeddings for every concept, finding, and paper title in the Knowledge Graph. This enables the agents to find relevant information based on semantic meaning (e.g., matching "hallucination" with "context loss") rather than just keyword overlap.

OpenAlex Integration

The system uses OpenAlex to fetch high-quality citation data. It attempts to match papers by DOI first, then by title. This data is crucial for the bibliographic and authors ghost modes. No API key is required for OpenAlex, but the system is configured to be polite with rate limits.


Requirements

  • Python 3.8 or newer
  • Google AI API key (create one here)
  • Internet connection (for Gemini API and OpenAlex)

License

Released under the MIT License. See the LICENSE file for full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

literature_mapper-2.0.1.tar.gz (53.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

literature_mapper-2.0.1-py3-none-any.whl (55.0 kB view details)

Uploaded Python 3

File details

Details for the file literature_mapper-2.0.1.tar.gz.

File metadata

  • Download URL: literature_mapper-2.0.1.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for literature_mapper-2.0.1.tar.gz
Algorithm Hash digest
SHA256 79307cf17533829a1625b8a7dea684ad31aace6c8bf84d3f4da7fdd2995fb8be
MD5 319ed05f925c1e41b240a3da6043feeb
BLAKE2b-256 ffa397f4ad80eca97a8a84ad58042fa6036661cca9fbe9407cdba821ff2e6938

See more details on using hashes here.

File details

Details for the file literature_mapper-2.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for literature_mapper-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a6bb01be54c9a902902b6ea77c7731a225299ccf04a9d8e3e98619a931a71957
MD5 dc0626b794490a7ccb268c68e3b04eae
BLAKE2b-256 d6c809f4dc6702a9a91714463b57da52659574ce062096657b23c2e632794f84

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page