A Python toolkit to test queries across multiple bibliographic databases for Systematic Literature Reviews

These details have not been verified by PyPI

Project links

Homepage

Project description

SLR Query Tester

SLR Query Tester is a Python toolkit to assist researchers in conducting Systematic Literature Reviews (SLRs) by testing and managing queries across multiple bibliographic databases. It streamlines the process of translating common queries into database-specific syntax, fetching and caching results, comparing them against a golden solution, and generating comprehensive reports.

If you use this tool, you are required to cite it as:

Rakshit Mittal, "SLRQueryTester: A Toolkit to Test Queries for SLRs" (2025) arXiv

Features
Installation
Configuration
Usage
Git Integration
Report Generation
Caching
Disclaimer
Notes
License

Further resources:

Features

Multiple Database Support: Query various bibliographic databases such as Scopus, IEEE, OpenAlex, Springer, and more.
Query Translation: Automatically translates common query strings into database-specific syntax.
Caching Mechanism: Caches API responses to minimize redundant calls and manage data efficiently.
Enhanced Query Decomposition: Decomposes complex queries into executable subqueries and also merges the result as per the original query. (incomplete feature)
Golden Solution Comparison: Compares fetched results against a golden solution to identify matches using fuzzy logic while using the multiprocessing framework to reduce time taken.
Comprehensive Reporting: Generates detailed CSV reports, including a main report and secondary reports for each query.
Git Integration: Automatically commits and pushes cache updates to your Git repository, maintaining version control of cached data, and allowing multiple researchers to pool data from traditionally rate-limited APIs, enabled by the design of the cache management methods.

Installation

1. Using `pip` (Recommended)

Ensure you have Python installed (version 3.7 or higher is recommended).

pip install slrquerytester

2. Installing from Git

If you want the latest development version:

pip install git+https://gitlab.rakshitmittal.net/rmittal/slrquerytester.git

3. Setting Up a Virtual Environment (Optional but Recommended)

Creating a virtual environment helps manage dependencies and avoid conflicts.

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Then proceed with the installation steps above.

Configuration

SLR Query Tester uses a configuration file (config.json) to manage settings. The idea is that each SLR project has a unique configuration file. Ensure that you create and properly configure this file before running the tool. Example config.json

{
  "api_keys": {
    "CORE": "YOUR_CORE_API_KEY",
    "Dimensions": "YOUR_DIMENSIONS_API_KEY",
    "IEEE": "YOUR_IEEE_API_KEY",
    "LENS": "YOUR_LENS_API_KEY",
    "OpenAlex": "YOUR_EMAIL_FOR_OPENALEX_API",
    "Scopus-Free": "YOUR_FREE_SCOPUS_API_KEY",
    "Scopus-Premium": "YOUR_PREMIUM_SCOPUS_API_KEY",
    "Springer-Free": "YOUR_FREE_SPRINGER_API_KEY",
    "Springer-Premium": "YOUR_PREMIUM_SPRINGER_API_KEY",
    "WOS": "YOUR_WOS_API_KEY"
  },
  "repository": "/absolute/path/to/repository",
  "threshold_days": 30
}

{
  "api_keys": {
    "CORE": "YOUR_CORE_API_KEY",
    "Dimensions": "YOUR_DIMENSIONS_API_KEY",
  },
  "cache_directory": "/absolute/path/to/directory",
  "golden_solution_directory": "/absolute/path/to/directory",
  "report_directory": "/absolute/path/to/directory",
  "queries_file": "/absolute/path/to/file",
  "threshold_days": 30
}

Only specify the API keys that you have! Notice the difference, either specifying the 'repository' or all the other directory paths.

Configuration Fields

api_keys : dict[str,str]: (not required for --report, but required for --query)
- CORE : str: API key for CORE API.
- Dimensions : str: API key for Dimensions Analytics API. (created according to spec but not tested due to lack of API key)
- IEEE : str : API key for IEEE Metadata Search API. (created according to spec but not tested due to lack of API key)
- OpenAlex : str : Email address for OpenAlex API Polite pool.
- Scopus-Free : str : Free API key for Elsevier Scopus Search API.
- Scopus-Premium : str : Premium API key for Elsevier Scopus Search API.
- Springer-Free : str : Free API key for Springer Meta/v2 API.
- Springer-Premium : str : Premium API key for Springer Meta/v2 API.
- WOS : str : API key for Clarivate Web of Science API Expanded.
cache_directory : str (required) : Absolute path to directory where cached API responses and related data are stored.
golden_solution_directory : str (required) : Absolute path to directory containing the golden solution file/s (BibTeX only with .bib or .bibtex extension) against which results will be compared.
queries_file : str (required) : Absolute path to the queries.json file containing the list of queries and their decomposition levels.
report_directory : str (required) : Absolute path to directory where generated reports will be saved.
maual_articles_directory : str (required for --manual) : Absolute path to directory where manually obtained results are saved in a specific format.
repository_path : str (not required, recommended) : Absolute path to the top-level project directory. You cannot use this with other file path specifications, or the tool will return an error. If you do specify a repository path, the default file-paths will be as follows:
- cache_directory = repository_path + 'slrquerytester.cache/'
- golden_solution_directory = repository_path + 'golden_solution/'
- report_directory = repository_path + 'reports/'
- queries_file = repository_path + 'queries.json'
- manual_aricles_directory = repository_path + 'manual_articles/'
threshold_days : int (default = 100) : Number of days after which cached results are considered stale and need to be refreshed.
language : str (default = "en") : Language tag that can be interpreted by langcodes. You may use this to filter results to a specific language.

Example `queries.json`

[
    {
        "query": "machine learning AND (deep learning OR neural networks) AND publication_year>=2020",
        "decomposition_level": 2
    },
    {
        "query": "data mining OR big data NOT (social media)",
        "decomposition_level": 1
    }
]

query : The query string to be executed, in the common query language.
decomposition_level : The level of decomposition for complex queries. Higher values enable deeper decomposition.

Note: Ensure that all paths specified in the config.json file exist or can be created by the tool. Adjust file permissions as necessary to allow read/write operations in the designated directories.

Usage

After installation and configuration, you can run SLR Query Tester using the command-line interface (CLI).

Basic Command

slrquerytester --config path/to/config.json [options]

Available Options

--config : Path (relative to cwd) to the configuration JSON file. Required unless you enable the --docs flag. Default: config.json
--query : If set, query databases and cache results.
--manual : If set, cache manually obtained articles.
--report : If set, generate reports.
--output_merge : If set with --report, cache the merged/union results from all databases for each query. The union articles are stored in a special slrquerytester-union directory within the cache. Can only be used when --report is also set.
--git : If set, automatically commit and push cache updates to the specified Git repository. requires repository_path in configuration.
--debug : If set, output debug-level logs for more detailed information.
--docs : If set, generate documentation using pdoc and open the browser.

Examples

Query Databases and Generate Reports (without Saving the Union)

slrquerytester --config config.json --query --report

Query Databases, Generate Reports (without Saving the Union), and Push to Git

slrquerytester --config config.json --query --report --git

Generate Reports and Cache Union Results

slrquerytester --config config.json --report --output_merge

Regenerate Reports Without Querying Databases or Saving the Union

slrquerytester --config config.json --report

Generate and View Documentation

slrquerytester --docs

Cache Manually Obtained Articles

slrquerytester --manual

Cache Manual Articles

It may be the case, that you have to manually extract articles using the Web UI of the bibliographic databases (like ACM-DL), or from another tool, in bib format. In that case, you can specify the manual_articles_directory or simply populate the default in the repository according to a precise structure:

manual_articles_directory
├── <random_hash_or_query_number>
│   ├── <database_name>
│   │   ├── metadata.json
│   │   ├── results.bib
│   │   └── results2.bib (any random name and number of bib files)
│   ├── <database_name>
│   │   ├── metadata.json
│   │   └── results.bib
│   └── ...
├── <random_hash_or_query_number>
│   ├── <database_name>
│   │   ├── metadata.json
│   │   └── results.bib
│   └── ...
└── ...

It is very important to be consistent with the database_names in the sub-directories.

Each metadata.json should be a dictionary with a single key, value pair:

{
  "general_query_string": "query_in_general_query_syntax"
}

Git Integration

SLR Query Tester can automatically commit and push cache updates to a Git repository, ensuring version control of your cached data.

How It Works

Appending Commit Messages:

Whenever new data is added to the cache via API calls, a commit message is appended to git_commit_message.txt located in the repository_path.

Committing and Pushing:

If the --git flag is set, the tool stages all changes, commits them using the messages in git_commit_message.txt, and pushes the commit to the remote repository.
Upon a successful push, the git_commit_message.txt file is deleted to prevent duplicate commits.

Important Notes

Repository Path: Ensure that the repository_path in your config.json points to the root of your Git repository.
Git Authentication: Make sure your Git repository is properly authenticated to allow push operations. This may involve setting up SSH keys or configuring credential helpers.
Commit Message File Location: The git_commit_message.txt file is located in the repository_path and is not part of the cache, ensuring easy retrieval and management.

Report Generation

SLR Query Tester generates detailed reports summarizing the results of your queries and their comparison against the golden solution.

Main Report (`main.csv`)

Location: Saved in the report_directory.
Context: provides the data for each performed query, so also the decomposed sub-queries, and the queries specified in queries.json
Contents for each query:
- Serial Number: Unique identifier for each query.
- Query String: The original query string.
- For Each Database:
  - Translated Query: The database-specific version of the query.
  - Articles Retrieved: Number of articles fetched from the database.
  - Golden Matches: Number of articles matching the golden solution.
- Total Unique Articles: Combined unique articles from all databases.
- Total Golden Matches: Combined matches across all databases.

Secondary Reports (`1.csv`, `2.csv`, etc.)

Location: Saved in the report_directory, named after the serial number corresponding to each query in the main report.
Contents:
- First Column: Article titles from the golden solution.
- Subsequent Columns: Each database name.
- Cells: Indicates 'Yes' if the article is present in the database's results, 'No' otherwise.

Viewing the Reports

Main Report: Open main.csv in your preferred spreadsheet application to view an overview of all queries and their results across databases.
Secondary Reports: Open individual CSV files (e.g., 1.csv) to see detailed comparisons for each query, showing which golden solution articles were retrieved by each database.

Article Duplicate Detection

When merging articles from multiple databases (both for union generation and golden solution comparison), SLR Query Tester uses a sophisticated multi-step duplicate detection algorithm to identify and eliminate duplicate entries:

Detection Algorithm

Exact Key Matching: Articles with identical BibTeX keys are considered duplicates.
DOI-based Matching: Articles are compared using their Digital Object Identifiers (DOIs). DOIs are normalized (lowercased and trimmed) before comparison to handle formatting variations.
Fuzzy Text Matching: For articles without DOIs or with non-matching DOIs, the tool uses fuzzy string matching on normalized article metadata.

Fuzzy Matching Process

The fuzzy matching algorithm:

Normalization: Article fields are extracted in alphabetical order (author, title, year) and normalized by:
- Converting to lowercase
- Removing special characters and punctuation
- Expanding journal abbreviations using the JabRef abbreviation database
- Stripping extra whitespace
- Combining into format: "author: <author> | title: <title> | year: <year>"
Similarity Scoring: Uses the Levenshtein distance-based ratio from the fuzzywuzzy library to compute similarity scores (0-100).
Threshold: Articles with a similarity score above 75% are considered duplicates.

Note: For performance optimization, articles are grouped into blocks based on the first 10 characters of the normalized title before fuzzy comparison. This means that duplicate articles with significantly different title beginnings may not be detected as duplicates, trading some accuracy for faster processing.

Caching

Cache Directory Structure

The cache is organized in a hierarchical directory structure within the specified cache_directory:

cache_directory
├── <query_hash>
│   ├── <database_name>
│   │   ├── metadata.json
│   │   └── results.bib
│   ├── <database_name>
│   │   ├── metadata.json
│   │   └── results.bib
│   ├── slrquerytester-union/     # ← Union results (when using --output_merge)
│   │   ├── metadata.json
│   │   └── result0.bib
│   └── ...
├── <query_hash>
│   ├── <database_name>
│   │   ├── metadata.json
│   │   └── results.bib
│   └── ...
└── ...

<query_hash>/ : A unique identifier generated from the query string using an MD5 hash. This ensures that each query has a distinct directory, avoiding conflicts and facilitating quick lookups.
<database_name>/ : Subdirectories within each query hash directory, representing the specific databases (e.g., Scopus, PubMed) queried.
metadata.json : A JSON file containing metadata about the cached results for the specific query and database.
results.bib : BibTeX file containing the fetched articles from the database corresponding to the query.

Metadata (`metadata.json`)

Each metadata.json file stores essential information about the cached results to manage data integrity, freshness, and completeness. Below is an example structure and explanation of each field:

{
    "expected_num_articles": 100,
    "num_articles_retrieved": 80,
    "last_api_call_time": "2024-04-15T12:34:56Z",
    "translated_query_string": "TITLE-ABS-KEY(machine learning AND (deep learning OR neural networks) AND PUBYEAR > 2019)",
    "general_query_string": "machine learning AND (deep learning OR neural networks) AND publication_year>=2020"
}

expected_num_articles : The total number of articles expected from the query. This number is typically derived from the database's response metadata indicating the total available results. Helps determine if all expected articles have been retrieved or if further API calls are necessary to fetch additional results.
num_articles_retrieved : The number of articles currently retrieved and stored in the cache. Tracks progress in fetching articles, especially when handling paginated API responses or large result sets.
last_api_call_time : The timestamp of the most recent API call made for this query and database. ISO 8601 format with a 'Z' suffix indicating UTC time (e.g., "2024-04-15T12:34:56Z"). Determines the freshness of the cached data. If the data is older than the specified threshold_days, it is considered stale and may require refreshing.
translated_query_string : The database-specific version of the original query string. Stores how the general query has been translated to fit the syntax requirements of the specific database, facilitating accurate execution and future reference.
general_query_string : The original, user-defined query string as specified in queries.json. Maintains a reference to the initial query, allowing for easy identification and comparison across different databases and reports.

How the Cache Works

Query Execution:

When a user runs a query, the tool translates it into the specific syntax required by each targeted database.

Caching Results:

Results from each database are stored in the results.bib file within their respective <database_name>/ directories.
Metadata about each query and database combination is stored in metadata.json.

Cache Validation:

Before executing a query, the tool checks if results are already cached.
It verifies if the cached data is stale based on the last_api_call_time and threshold_days.
It also checks if the number of retrieved articles meets the expected_num_articles.

Updating the Cache:

If data is stale or incomplete, the tool fetches additional results and updates both results.bib and metadata.json.
Commit messages are appended to git_commit_message.txt to document these updates for version control.

Git Integration:

When the --git flag is used, the tool commits and pushes changes in the cache to the specified Git repository, ensuring that all team members have access to the latest cached data.

Benefits of the Cache Structure

Efficiency: Minimizes redundant API calls by reusing cached data, saving time and reducing the load on external services.
Scalability: Handles multiple queries and databases systematically, allowing for easy expansion as more queries or databases are added.
Data Integrity: Metadata ensures that cached data remains accurate and up-to-date, facilitating reliable comparisons and reporting.
Collaboration: Git integration allows multiple researchers to share and maintain a consistent cache, enhancing collaborative efforts in large-scale SLR projects.

Managing the Cache

Clearing the Cache for a Specific Query and Database: If you need to remove cached data for a particular query and database (e.g., to force a fresh fetch), navigate to the corresponding directory and delete the results.bib and metadata.json files.

rm /path/to/cache_directory/<query_hash>/<database_name>/results.bib
rm /path/to/cache_directory/<query_hash>/<database_name>/metadata.json

Clearing the Entire Cache: To remove all cached data, delete the entire cache_directory.

rm -rf /path/to/cache_directory/*

Use with caution, as this will require re-fetching all data.

Backing Up the Cache: Regularly back up the cache_directory to prevent data loss and ensure that cached results can be restored if needed.

Disclaimer

SLR Query Tester is provided "as-is" without any warranty. The author does not assume any responsibility for any misuse of this tool. Users are responsible for ensuring that their use of this tool complies with the terms and conditions of the APIs and the data they retrieve. Please respect all licensing agreements and usage policies associated with the bibliographic databases and their APIs.

Notes

API Rate Limits: Be mindful of the rate limits and usage policies of the databases you query to avoid exceeding allowed request quotas. slrquerytester has a built-in exponential backoff rate-limiter.
API Keys: Be careful to not share your API keys with anyone, or upload them with the configuration file to a shared Git repository. It is advisable to keep the configuration file outside the Git repo. The idea is to have a unique configuration file for each such SLR project that you may be working on (hats-off to you if you're doing that!).

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.1

Dec 22, 2025

0.2.0

Feb 10, 2025

0.1.0

Feb 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slrquerytester-0.2.1.tar.gz (66.4 kB view details)

Uploaded Dec 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slrquerytester-0.2.1-py3-none-any.whl (81.1 kB view details)

Uploaded Dec 22, 2025 Python 3

File details

Details for the file slrquerytester-0.2.1.tar.gz.

File metadata

Download URL: slrquerytester-0.2.1.tar.gz
Upload date: Dec 22, 2025
Size: 66.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for slrquerytester-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`c7c658671b99fa680e9f1b40c377a772e06cce09adc3157382d1f5d7bb12d63c`
MD5	`21f69995dbc3b1d627206a72ab118ccc`
BLAKE2b-256	`b9a30a437646ffb5a82e16889b60032424d0fcffc889985c90db081dd8302a8c`

See more details on using hashes here.

File details

Details for the file slrquerytester-0.2.1-py3-none-any.whl.

File metadata

Download URL: slrquerytester-0.2.1-py3-none-any.whl
Upload date: Dec 22, 2025
Size: 81.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for slrquerytester-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad03b18914e5f87c1eac86f76a2886e26dc89e8719ad45a8655649a684513248`
MD5	`6aa668e3f2815fd18afe707198d5344d`
BLAKE2b-256	`ac8bfdb424d2d1a4ca79a785c9eaabc3cc6fa48c13428056321308f1cabecaa3`

See more details on using hashes here.

slrquerytester 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

SLR Query Tester

Table of Contents

Further resources:

Features

Installation

1. Using pip (Recommended)

2. Installing from Git

3. Setting Up a Virtual Environment (Optional but Recommended)

Configuration

Configuration Fields

Example queries.json

Usage

Basic Command

Available Options

Examples

Cache Manual Articles

Git Integration

How It Works

Important Notes

Report Generation

Main Report (main.csv)

Secondary Reports (1.csv, 2.csv, etc.)

Viewing the Reports

Article Duplicate Detection

Detection Algorithm

Fuzzy Matching Process

Caching

Cache Directory Structure

Metadata (metadata.json)

How the Cache Works

Benefits of the Cache Structure

Managing the Cache

Disclaimer

Notes

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Using `pip` (Recommended)

Example `queries.json`

Main Report (`main.csv`)

Secondary Reports (`1.csv`, `2.csv`, etc.)

Metadata (`metadata.json`)