Enhanced Zotero library and full-text PDF search
Project description
ZotGrep - Enhanced Zotero Library and Full-Text PDF Search
ZotGrep is a Python package that enables users to search their local Zotero library using the API and then search for full-text content within PDFs. It includes multiple output formats (CSV and Markdown) and Zotero URL integration for direct access to search results.
The general workflow involves a) including search terms for references in the Zotero library, b) full-text search among the results for a new set of keywords. The output will contain all hits among the references.
Credits: ZotGrep builds on the excellent pyzotero by Stephan Hugel.
Disclaimer: The project includes substantial parts that were vibe-coded using Claude and ChatGPT.
Features
Core Functionality
- Search Zotero library metadata (titles, authors, etc.)
- Full-text search within PDF attachments (both linked and imported files)
- Easy-to-use web interface and advanced command line interface for power users and AI agents.
- Support for both local linked files and Zotero-stored PDFs
- Context-aware text extraction with highlighted search terms
- Multiple Output Formats: Save search results to CSV or Markdown files
- CSV Export: Structured data format with comprehensive metadata for analysis
- Markdown Export: Research-friendly format with YAML frontmatter for note-taking apps
- Zotero URLs: Direct links to open items and specific PDF pages in Zotero
- Enhanced Metadata: Author names, publication years, and timestamps
- Command-line Options: Automation-friendly with argument parsing
- Interactive Output Choice: Choose between CSV, Markdown, or no file output
Installation
Install from PyPI
For most users, install ZotGrep from PyPI and run it as a CLI tool:
uv tool install zotgrep
Then launch it with:
zotgrep --help
If you want to use ZotGrep as a dependency inside another uv-managed project instead of as a global tool:
uv add zotgrep
Install from a source checkout
If you are developing ZotGrep or want to run the local repository in editable mode:
git clone https://github.com/franciscowilhelm/zotgrep.git
cd zotgrep
uv venv .venv
source .venv/bin/activate
uv pip install -e .
Usage
Web Interface
Launch the local web interface with:
zotgrep --web
By default the web UI runs on http://127.0.0.1:23120. You can override that with --port, for example:
zotgrep --web --port 23121
ZotGrep supports a user config file for persistent defaults. You can manage the most important settings via the web UI under General Settings. Use that page to save defaults such as linked-file paths, result limits, and context-window size. The main search page then uses those saved defaults and keeps only per-search inputs in the form. See Advanced for configuring settings via a JSON file.
Interactive Search via Command Line Interface
You can use ZotGrep via the zotgrep shell command or the module interface, which provides an interactive shell:
zotgrep
Equivalent module form:
python -m zotgrep
This will prompt you for:
- Metadata search terms (searches titles, authors, etc.)
- Full-text search terms (comma-separated list)
After displaying results, you'll be offered output format choices:
Output options:
1. CSV file (spreadsheet format)
2. Markdown file (research notes format)
3. No file output
Choose output format (1/2/3):
Direct Search via Command Line
You can specify search terms directly via command line arguments for non-interactive use:
zotgrep --zotero "career engagement"
zotgrep --zotero "career engagement" --fulltext "barriers"
zotgrep --zotero "career engagement" --metadata-only
zotgrep --zotero "career engagement" --no-abstract
zotgrep --zotero "AI ethics" --item-type "journalArticle, bookSection" --tags "privacy, fairness" --tag-match any
zotgrep --zotero "measurement invariance" --collection "Focused Review"
--zoterospecifies the metadata search string (e.g., title, author, etc.).--fulltextoptionally specifies the full-text search terms (comma-separated).--metadata-only/--no-fulltextruns only the metadata search and skips PDF/full-text processing.- abstracts are included by default;
--no-abstractomits them. --publication/--publication-titlefilters results by publication title (comma-separated for multiple).--item-type/--itemtypefilters by Zotero item type (comma-separated for multiple).--collectionfilters by a Zotero collection key or exact collection name.--tag/--tagsfilters by Zotero tag (comma-separated for multiple).--tag-match {all,any}controls whether all supplied tags are required or any single tag is enough.
This allows for scripting and automation without interactive prompts. All other output and configuration options remain available.
Example with output:
zotgrep --zotero "AI ethics" --fulltext "privacy, fairness" --csv results.csv
Example with publication filter (list via comma-separated values):
zotgrep --zotero "AI ethics" --fulltext "privacy, fairness" --publication "Nature, Science"
Example with metadata filters:
zotgrep --zotero "AI ethics" --fulltext "privacy, fairness" --item-type "journalArticle" --collection "Focused Review" --tags "privacy, fairness" --tag-match all
Output Format Options
CSV Export
Save results to CSV format for data analysis and spreadsheet applications:
# Save to specific CSV file
zotgrep --csv results.csv
# Save to CSV only (no console output)
zotgrep --csv results.csv --csv-only
Markdown Export
Save results to Markdown format for research notes and documentation:
# Save to specific Markdown file
zotgrep --md results.md
zotgrep --markdown results.md
# Save to Markdown only (no console output)
zotgrep --md results.md --md-only
zotgrep --markdown results.md --markdown-only
JSON Export
Structured JSON output is saved by default unless --no-json is used. You can also specify a filename explicitly:
zotgrep --zotero "career engagement" --json results.json
zotgrep --zotero "career engagement" --no-json
JSON and Markdown frontmatter both record the applied metadata filters inside search_details.metadata_filters.
Interactive Output Choice
When no output format is specified, the script offers an interactive menu to choose between CSV, Markdown, or no file output.
Output Formats
CSV Output Format
The CSV file includes the following columns for structured data analysis:
| Column | Description |
|---|---|
reference_title |
Title of the reference |
authors |
Author names (Last, First; format) |
publication_year |
Publication year |
publication_title |
Journal or publication title |
doi |
DOI when available |
reference_key |
Zotero item key |
abstract |
Abstract text unless --no-abstract is requested |
pdf_filename |
PDF file name |
pdf_key |
PDF attachment key |
page_number |
Page where term was found |
search_term_found |
The specific search term that matched |
context |
Text context around the found term |
zotero_item_url |
URL to open item in Zotero |
zotero_pdf_url |
URL to open specific PDF page in Zotero |
search_timestamp |
When the search was performed |
Notes:
- Metadata-only runs include only reference-level columns.
- Full-text columns such as
pdf_filename,page_number,search_term_found,context, andzotero_pdf_urlare included only when full-text hits exist. - The
abstractcolumn is omitted when--no-abstractis used. - CSV uses plain
context; the Markdown-only highlighted variant is not written to CSV.
Markdown Output Format
The Markdown output is designed for research note-taking and literature review workflows, and is structured as follows:
- YAML Frontmatter Block: At the top, a compact YAML block stores the format version,
search_details, and asummary. - Search Summary and Reference List: A summary of the search and a numbered reference list for all papers.
- Abstracts Section: By default, a separate abstracts section appears after the reference list unless
--no-abstractis supplied. - Detailed Findings Section:
- In full-text mode, each paper includes metadata, a per-term occurrence summary, and numbered annotation excerpts with Zotero PDF links.
- In metadata-only mode, the file still includes the reference list and abstracts, but the detailed findings section states that no annotation-level findings were generated.
- Detailed Paper Sections in Full-Text Mode: For each paper:
- The paper title as a heading.
- Metadata bullets: authors, year, publication, DOI, citekey, and a direct Zotero link.
- A
Term Summarysubheading with occurrence counts per search term. - An
Annotationssubheading with numbered occurrences and page-specific Zotero links. - A horizontal rule (
---) separates each paper section.
This format is compatible with note-taking applications like Obsidian and supports direct navigation to Zotero items and PDF pages.
Example Markdown Structure:
---
zotgrep-results/v1:
search_details:
zotero_query: bifactor
full_text_query:
- psycho
search_mode: fulltext
search_timestamp: '2025-06-08 17:15:00'
context_window: 2
summary:
total_papers_found: 3
total_annotations_found: 17
---
# ZotGrep Results
## Search Summary
- **Search Date:** 2025-06-08 17:15:00
- **Zotero Library Query:** `bifactor`
- **Full-Text Query:** `psycho`
- **Results:** Found **17** annotations across **3** papers.
### Reference List
1. Neufeld, S., St Clair, M., Brodbeck, J., Wilkinson, P., Goodyer, I., & Jones, P. (2024). *Measurement Invariance in Longitudinal Bifactor Models: Review and Application Based on the p Factor*.
2. Bornovalova, M., Choate, A., Fatimah, H., Petersen, K., & Wiernik, B. (2020). *Appropriate Use of Bifactor Analysis in Psychopathology Research: Appreciating Benefits and Limitations*.
3. Watts, A., Poore, H., & Waldman, I. (2019). *Riskier Tests of the Validity of the Bifactor Model of Psychopathology*.
---
## Detailed Findings
### Measurement Invariance in Longitudinal Bifactor Models: Review and Application Based on the p Factor
- **Authors**: Neufeld, Sharon A. S.; St Clair, Michelle; Brodbeck, Jeannette; Wilkinson, Paul O.; Goodyer, Ian M.; Jones, Peter B.
- **Year**: 2024
- **Publication**: Psychological Assessment
- **DOI**: https://doi.org/10.1037/pas0000564
- **Citekey**: `73ZD2D7S`
- **Zotero Link**: [Open Item in Zotero](zotero://select/library/items/73ZD2D7S)
#### Term Summary
- `psycho`: 2 occurrences
#### Annotations
##### Occurrence #1, Page 8
> Thus far we have reviewed the importance of establishing longitudinal MI in bifactor models, provided guidance on MI cut-offs to employ when ordered-categorical indicators are utilized, and outlined estimator choices and missing data considerations. ...
> - Highlight on [Page 8](zotero://open-pdf/library/items/KGDL8AWR?page=8)
##### Occurrence #2, Page 18
> Psychological Assessment, 30(9), 1174–1185. https://doi.org/10.1037/pas0000564 ...
> - Highlight on [Page 18](zotero://open-pdf/library/items/KGDL8AWR?page=18)
---
# ... Additional paper sections follow the same structure ...
Quick Start Examples
Example 1: Basic Search with Interactive Output Choice
zotgrep
# Enter search terms when prompted
# Choose output format from the interactive menu
Example 2: Direct Search via Command Line (Non-Interactive)
zotgrep --zotero "deep learning" --fulltext "convolution, neural network"
# Runs search directly with specified terms, no prompts
Example 3: Direct CSV Export
zotgrep --zotero "AI ethics" --fulltext "privacy, fairness" --csv my_research_results.csv
# Results saved to CSV for data analysis
Example 4: Direct Markdown Export for Note-Taking
zotgrep --zotero "literature review" --fulltext "systematic, meta-analysis" --md literature_review.md
# Results saved to Markdown for research notes
Example 5: Silent Export (No Console Output)
zotgrep --zotero "machine learning" --fulltext "algorithm, bias" --markdown research_notes.md --markdown-only
# Only creates the Markdown file, no console output
Open Item in Zotero
zotero://select/library/items/ITEM_KEY
Open PDF at Specific Page
zotero://open-pdf/library/items/PDF_KEY?page=PAGE_NUMBER
These URLs can be clicked in spreadsheet applications or used programmatically to jump directly to relevant content in your Zotero library.
Example Output
Console Output
Reference: Machine Learning in Healthcare (Key: SMITH2023)
Authors: Smith, John; Doe, Jane
Year: 2023
PDF: smith_2023_ml_healthcare.pdf
Found 'algorithm' on Page: 15
Context: ...The machine learning ***algorithm*** demonstrated significant improvements...
Zotero PDF URL: zotero://open-pdf/library/items/PDF123?page=15
CSV Output
The same information is saved in structured CSV format for further analysis, reporting, or integration with other tools.
Markdown Output
Results are organized by paper with YAML frontmatter and annotations sections, perfect for research note-taking and literature review workflows.
Command Line Arguments
Output Options
--json FILENAME: Save results to specified JSON file--no-json: Disable the default JSON export--csv FILENAME: Save results to specified CSV file--csv-only: Only save to CSV, suppress console output--md FILENAMEor--markdown FILENAME: Save results to specified Markdown file--md-onlyor--markdown-only: Only save to Markdown, suppress console output
Search Term Options
--zotero "SEARCH TERMS": Specify Zotero metadata search terms directly (e.g.,"machine learning health")--fulltext "TERM1, TERM2": Optionally specify full-text search terms as a comma-separated list (e.g.,"algorithm, bias")--metadata-onlyor--no-fulltext: Skip PDF/full-text processing and return metadata-only results--no-abstract: Omit abstracts from output--publication "TITLE1, TITLE2"or--publication-title "TITLE1, TITLE2": Filter results by publication title (comma-separated list). Example:"Nature, Science"--item-type "TYPE1, TYPE2"or--itemtype "TYPE1, TYPE2": Filter by Zotero item type. Example:"journalArticle, book"--collection "COLLECTION": Filter by Zotero collection key or exact collection name. Example:"ABCD1234"or"Focused Review"--tag "TAG1, TAG2"or--tags "TAG1, TAG2": Filter by Zotero tags. Example:"privacy, fairness"--tag-match {all,any}: Control how multiple tags are matched.allrequires every supplied tag;anyaccepts at least one.
Other Options
--config CONFIG: Path to configuration file (JSON format)--base-path PATH: Override base attachment path--max-results N: Maximum results for metadata search (default: 100)--context-window N: Context sentence window size (default: 2). The default means 2 sentences before and after the keyword is found will be returned. Larger window sizes will return more sentences. Sentence splitting uses the Zotero item language when available, with a built-in fallback if no language-aware tokenizer is available at runtime.--port PORT: Port for the local web interface when using--web(default: 23120)--version: Show version information--help: Show help message
Environment Variables
ZOTGREP_CONFIG_PATH: Use a custom user config file pathZOTERO_BASE_ATTACHMENT_PATH: Base directory for linked-file attachmentsZOTERO_PUBLICATION_TITLE_FILTER: Filter results by publication title (comma-separated list). Example:Nature, ScienceZOTERO_ITEM_TYPE_FILTER: Filter by Zotero item type (comma-separated list). Example:journalArticle, bookZOTERO_COLLECTION_FILTER: Filter by Zotero collection key or exact collection nameZOTERO_TAG_FILTER: Filter by Zotero tags (comma-separated list). Example:privacy, fairnessZOTERO_TAG_MATCH_MODE: Control multi-tag matching withallorany
Use Cases
Research and Literature Review
- CSV Export: Create structured datasets for systematic literature reviews and meta-analyses
- Markdown Export: Generate research notes with proper citations and page references
- Direct Zotero Integration: Jump directly to source materials from search results
Academic Writing
- Evidence Collection: Quickly locate and cite relevant passages with page-specific references
- Collaborative Research: Share search results in both structured (CSV, JSON) and readable (Markdown) formats
- Literature Synthesis: Build comprehensive literature reviews with organized annotations
Knowledge Management
- Research Databases: Create searchable databases of research findings in CSV format
- Note-Taking Integration: Import Markdown results into Obsidian, Notion, or other note-taking apps
- Concept Tracking: Monitor mentions of specific concepts across your entire literature collection
- Citation Networks: Build interconnected knowledge bases with contextual references
Workflow Integration
- Data Analysis: Use CSV exports with R, Python, or Excel for quantitative literature analysis
- Documentation: Generate Markdown reports for research documentation and sharing
- Reference Management: Seamlessly integrate with existing Zotero workflows
- Agentic Workflows: Let AI agents use the CLI and use JSON (highly structured and machine-readable, less human-readable) output formats.
Troubleshooting
Common Issues
- PDF not found errors: For linked files, verify
BASE_ATTACHMENT_PATHis correctly set; for Zotero-stored files, verify the attachment is available through Zotero - API connection failures: Check Zotero credentials and internet connection
- Empty search results: Try broader search terms or check PDF text extraction
- File encoding issues: Both CSV and Markdown files use UTF-8 encoding for international characters
- Markdown formatting issues: Special characters are automatically escaped in Markdown output
- Output format conflicts: Cannot use both
--csv-onlyand--md-onlysimultaneously
Debug Tips
- Check Zotero sync status
- For linked files, verify PDF files are accessible at the specified paths
- Test with simple search terms first
- Review console output for detailed error messages
- For Markdown issues, check that special characters are properly handled
- Use
--helpto see all available command-line options
Output Format Selection Guide
Choose CSV when:
- Performing quantitative analysis of search results
- Importing data into spreadsheet applications
- Conducting systematic literature reviews requiring structured data
- Integrating with data analysis tools (R, Python, etc.)
Choose Markdown when:
- Taking research notes and building knowledge bases
- Using note-taking applications like Obsidian or Notion
- Creating readable research documentation
- Building interconnected literature reviews
- Sharing results in a human-readable format
Advanced
Modifying settings via JSON file
Additional to configuring basic settings via the Web UI, settings can be modified in these ways:
- manually by creating that JSON file
- by pointing to another file with
--config PATHorZOTGREP_CONFIG_PATH
The recommended file path is:
~/.config/zotgrep/config.json
Typical persistent settings include:
{
"zotero_user_id": "0",
"zotero_api_key": "local",
"library_type": "user",
"base_attachment_path": "/path/to/your/linked/pdfs",
"max_results_stage1": 100,
"context_sentence_window": 2,
"item_type_filter": ["journalArticle"],
"collection_filter": "Focused Review",
"tag_filter": ["privacy", "fairness"],
"tag_match_mode": "all"
}
If you use only Zotero-stored files, leave base_attachment_path empty.
Environment variables override config-file values at runtime. The most relevant ones are:
export ZOTERO_BASE_ATTACHMENT_PATH='/path/to/your/zotero/attachments'
export ZOTGREP_CONFIG_PATH='/path/to/custom/config.json'
export ZOTERO_ITEM_TYPE_FILTER='journalArticle,book'
export ZOTERO_COLLECTION_FILTER='Focused Review'
export ZOTERO_TAG_FILTER='privacy,fairness'
export ZOTERO_TAG_MATCH_MODE='any'
Full-text backend comparison: pypdfium2 vs. Zotero index (xpdf)
ZotGrep offers two Stage 2 full-text backends, selectable with --fulltext-source (only in the CLI, the web interface always uses pdf):
pdf(default) — ZotGrep downloads each PDF attachment and extracts text with pypdfium2. Provides page-level results and context windows.zotero-index(experimental) — ZotGrep queries the full-text index that Zotero itself maintains (built internally using xpdf). No download is required, but page-level information is unavailable and not all attachments may be indexed.
The table below shows results across five searches against the same Zotero library (--zotero proactiv, metadata search mode: titleCreatorYear). Hits = total annotation matches returned (context windows, not unique papers).
| Full-text query | pypdfium2 hits | Zotero index hits | Hit Δ | pypdfium2 time | Zotero index time | Speed Δ |
|---|---|---|---|---|---|---|
diary |
104 | 86 | +18 | 7.9 s | 11.1 s | −3.2 s |
daily |
352 | 333 | +19 | 9.2 s | 19.1 s | −9.9 s |
barriers |
81 | 60 | +21 | 7.6 s | 12.7 s | −5.2 s |
cross-lagged |
27 | 27 | 0 | 7.1 s | 10.2 s | −3.1 s |
time perspective |
42 | 27 | +15 | 7.0 s | 7.9 s | −0.8 s |
| Total | 606 | 533 | +73 | 38.8 s | 61.0 s | −22.2 s |
Takeaways:
- pypdfium2 found 12–37% more hits per query in four out of five cases, likely because Zotero's index does not cover all pages or all attachments or misses hits due to poorer PDF processing.
- pypdfium2 was faster in every query, and significantly so for high-hit queries (
daily: 2× faster). The overhead of downloading and parsing PDFs is more than offset by avoiding sequential index API calls.
Use zotero-index only if you cannot access the PDF files directly. For all other cases the default pdf backend gives better coverage and lower latency.
Testing
Run the test suite to verify functionality:
uv run --group test python -m pytest
This will test:
- Zotero URL generation
- CSV export functionality
- Sample data processing
License
This project is open source published, like Zotero itself, under a GPL license. Please refer to the license file for details.
Acknowledgments
ZotGrep depends on several upstream open-source projects. In particular:
- pyzotero for Zotero API access
- Flask for the local web interface
- pypdfium2 and PDFium for PDF text extraction
- pySBD for sentence boundary detection
- PyYAML for YAML serialization in Markdown exports
See NOTICE for license attributions and upstream license links.
Contributing
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zotgrep-3.1.2.tar.gz.
File metadata
- Download URL: zotgrep-3.1.2.tar.gz
- Upload date:
- Size: 83.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64399e85fcf7612bf7c911b3ae913a5bed9290d0028577abe598276aeb8e3743
|
|
| MD5 |
5e89686539a457e1bced02827485dab1
|
|
| BLAKE2b-256 |
53554da4e842f8fdbf7058369c00adecd082e0f324b62c709d4d6cc1546aee6f
|
Provenance
The following attestation bundles were made for zotgrep-3.1.2.tar.gz:
Publisher:
python-publish.yml on franciscowilhelm/zotgrep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zotgrep-3.1.2.tar.gz -
Subject digest:
64399e85fcf7612bf7c911b3ae913a5bed9290d0028577abe598276aeb8e3743 - Sigstore transparency entry: 1271617846
- Sigstore integration time:
-
Permalink:
franciscowilhelm/zotgrep@b52e4282baa31b61d46f3bbf64a184adb32cbbe5 -
Branch / Tag:
refs/tags/v3.1.2 - Owner: https://github.com/franciscowilhelm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b52e4282baa31b61d46f3bbf64a184adb32cbbe5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file zotgrep-3.1.2-py3-none-any.whl.
File metadata
- Download URL: zotgrep-3.1.2-py3-none-any.whl
- Upload date:
- Size: 66.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09ed49dbcbf8cfe3d116c00d9765402af521e981cedaab472b106288c237aea0
|
|
| MD5 |
2f297d970cccfea694ed340022bcee50
|
|
| BLAKE2b-256 |
e57438a93c1029dcf71c2ce3cca3f79bc0621adf808311858c327a0f53fef368
|
Provenance
The following attestation bundles were made for zotgrep-3.1.2-py3-none-any.whl:
Publisher:
python-publish.yml on franciscowilhelm/zotgrep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zotgrep-3.1.2-py3-none-any.whl -
Subject digest:
09ed49dbcbf8cfe3d116c00d9765402af521e981cedaab472b106288c237aea0 - Sigstore transparency entry: 1271617855
- Sigstore integration time:
-
Permalink:
franciscowilhelm/zotgrep@b52e4282baa31b61d46f3bbf64a184adb32cbbe5 -
Branch / Tag:
refs/tags/v3.1.2 - Owner: https://github.com/franciscowilhelm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b52e4282baa31b61d46f3bbf64a184adb32cbbe5 -
Trigger Event:
release
-
Statement type: