Skip to main content

SeSG is a tool developed to help Systematic Literature Review researchers, specifically at the step of building a search string.

Project description

sesg

SeSG (Search String Generator) python package repository.

PyPI version Documentation Status CI codecov Code style: black Ruff Docstring Style License: GPL v3

SeSG is a tool developed to help Systematic Literature Review researchers, specifically at the step of building a search string.

Installation

You can install with pip, poetry, or any other package manager:

poetry add sesg

Usage

For a more extensive example, please refer to this repository.

Generating a search string

from dataclasses import dataclass
from random import sample

from sesg.search_string import (
    SimilarWordsFinder,
    create_enrichment_text,
    generate_search_string,
    set_pub_year_boundaries,
)
from sesg.topic_extraction import create_docs, extract_topics_with_bertopic
from transformers import BertForMaskedLM, BertTokenizer


@dataclass
class Study:
    title: str
    abstract: str
    keywords: str


GS: list[Study] = []
QGS: list[Study] = sample(GS, len(GS) // 3)


def main():
    docs = create_docs(
        [
            {
                "title": s.title,
                "abstract": s.abstract,
                "keywords": s.keywords,
            }
            for s in QGS
        ]
    )

    enrichment_text = create_enrichment_text(
        [
            {
                "title": s.title,
                "abstract": s.abstract,
            }
            for s in QGS
        ]
    )

    similar_words_finder = SimilarWordsFinder(
        enrichment_text=enrichment_text,
        bert_model=BertForMaskedLM.from_pretrained("bert-base-uncased"),
        bert_tokenizer=BertTokenizer.from_pretrained("bert-base-uncased"),
    )

    topics = extract_topics_with_bertopic(
        docs,
        kmeans_n_clusters=2,
        umap_n_neighbors=5,
    )

    search_string = generate_search_string(
        topics,
        n_words_per_topic=5,
        n_similar_words_per_word=1,
        similar_words_finder=similar_words_finder,
    )

    search_string = f"TITLE-ABS-KEY({search_string})"
    search_string = set_pub_year_boundaries(search_string, min_year=2010, max_year=2020)

    print(search_string)
    # TITLE-ABS-KEY((("antipatterns") AND ("detection" OR "management") AND ("bdtex") AND ("approach" OR "algorithm") AND ("smurf")) OR (("code" OR "pattern") AND ("detection" OR "management") AND ("design" OR "software") AND ("software" OR "computer") AND ("learning" OR "translation"))) AND PUBYEAR > 1999 AND PUBYEAR < 2018  # noqa: E501


if __name__ == "__main__":
    main()

Assessing the quality of a search string

import trio
from sesg.evaluation import EvaluationFactory, Study
from sesg.scopus import InvalidStringError, Page, ScopusClient


API_KEYS: list[str] = []

GS: list[Study] = []
QGS: list[Study] = []


async def main():
    string = 'TITLE-ABS-KEY("machine learning" and "code smell") AND PUBYEAR > 2010 AND PUBYEAR < 2020'  # noqa: E501
    evaluation_factory = EvaluationFactory(gs=GS, qgs=QGS)

    client = ScopusClient(API_KEYS)

    entries: list[Page.Entry] = []
    try:
        async for page in client.search(string):
            entries.extend(page.entries)

    except InvalidStringError:
        print("Invalid string")

    evaluation = evaluation_factory.evaluate([e.title for e in entries])

    print(evaluation.start_set_recall)
    # 0.7


if __name__ == "__main__":
    trio.run(main)

Credits

This project is a continuation of Leo Fuchs' work. Most of my work in this project consisted in refactoring the codebase, adding tests, improving the documentation and optimizing the performance, along with the addition of some new features.

Highlights

Below you can find the major improvements over the original project:

  • Added BERTopic as a topic extraction strategy.
  • Improved snowballing performance by 100x~120x (thanks to rapidfuzz and multiprocessing).
  • Improved scopus search performance by 30x~40x (thanks to httpx and Eduardo Mendes' help).
  • Improved search string generation performance by ~1.5x (thanks to a caching system).
  • Improved code quality by adopting the use of lint and formatting tools. Also, added type hints to try to catch errors before runtime.
  • Added tests to prevent bugs when refactoring or adding new features.
  • Added docs to help users and contributors.

Contributing

You can contribute in many ways, such as creating issues and submitting pull requests. If you wish to contribute with code, please read the contributor guide.

License

This project is licensed under the terms of the GPL-3.0-only license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sesg-0.0.59.tar.gz (37.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sesg-0.0.59-py3-none-any.whl (41.0 kB view details)

Uploaded Python 3

File details

Details for the file sesg-0.0.59.tar.gz.

File metadata

  • Download URL: sesg-0.0.59.tar.gz
  • Upload date:
  • Size: 37.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.10 Linux/5.15.0-78-generic

File hashes

Hashes for sesg-0.0.59.tar.gz
Algorithm Hash digest
SHA256 e4c5a4c989b9bea9996d9819a2ce4581ff9b9638212842f252db03b4f9e349c0
MD5 56afdb59466ec8d2f1e8656ccdba8b40
BLAKE2b-256 4f540ceb8a4bb61296433092c075c5a0e939be428a95924530fda043a2eaef27

See more details on using hashes here.

File details

Details for the file sesg-0.0.59-py3-none-any.whl.

File metadata

  • Download URL: sesg-0.0.59-py3-none-any.whl
  • Upload date:
  • Size: 41.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.10 Linux/5.15.0-78-generic

File hashes

Hashes for sesg-0.0.59-py3-none-any.whl
Algorithm Hash digest
SHA256 f7937f51d04a82d8bc9d0dc9c90555e3efcf996a58a54b330bcdf2d4c94d3244
MD5 35c3b016efc95bb6365395911689407a
BLAKE2b-256 58457f7bbf3719e7b5aa68ab59f6de9819a89dede0b5a1176cca482477f8e0c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page