Skip to main content

Topic modeling for scientific articles

Project description

BERTeley

PyPI License Build Status Documentation Status Test Coverage

Topic modeling for scientific articles

Installation

pip install berteley

Quick Start

# code snippet showing very basic usage

Topic Modeling Colab Notebook: TBD

Features

  • A text pre-processing suite tailored for scientific articles. Including a curated stopword list.
  • 3 readily available language models that are trained on scientific articles.
  • Real time metric calculation for topic model comparison.

Copyright Notice

BERTeley Copyright (c) 2023, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) and University of California, Berkeley. All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at IPO@lbl.gov.

NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.

License Agreement

BERTeley Copyright (c) 2023, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) and University of California, Berkeley. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

(1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

(2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

(3) Neither the name of the University of California, Lawrence Berkeley National Laboratory, U.S. Dept. of Energy, University of California, Berkeley nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

You are under no obligation whatsoever to provide any bug fixes, patches, or upgrades to the features, functionality or performance of the source code ("Enhancements") to anyone; however, if you choose to make your Enhancements available either publicly, or directly to Lawrence Berkeley National Laboratory, without imposing a separate written license agreement for such Enhancements, then you hereby grant the following license: a non-exclusive, royalty-free perpetual license to install, use, modify, prepare derivative works, incorporate into other computer software, distribute, and sublicense such enhancements or derivative works thereof, in binary and source code form.

Credits

Please reference this work:

      @article{Berteley,
      title = {Benchmarking topic models on scientific articles using BERTeley},
      journal = {Natural Language Processing Journal},
      volume = {6},
      pages = {100044},
      year = {2024},
      issn = {2949-7191},
      doi = {https://doi.org/10.1016/j.nlp.2023.100044},
      url = {https://www.sciencedirect.com/science/article/pii/S2949719123000419},
      author = {Eric Chagnon and Ronald Pandolfi and Jeffrey Donatelli and Daniela Ushizima},
      keywords = {NLP, Topic modeling, Scientific articles, Transformers},
      abstract = {The introduction of BERTopic marked a crucial advancement in topic modeling and presented a topic model that outperformed both traditional and modern topic models in terms of topic modeling metrics on a variety of corpora. However, unique issues arise when topic modeling is performed on scientific articles. This paper introduces BERTeley, an innovative tool built upon BERTopic, designed to alleviate these shortcomings and improve the usability of BERTopic when conducting topic modeling on a corpus consisting of scientific articles. This is accomplished through BERTeley’s three main features: scientific article preprocessing, topic modeling using pre-trained scientific language models, and topic model metric calculation. Furthermore, an experiment was conducted comparing topic models using four different language models in three corpora consisting of scientific articles.}
      }
      

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

berteley-0.1.2.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

berteley-0.1.2-py2.py3-none-any.whl (13.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file berteley-0.1.2.tar.gz.

File metadata

  • Download URL: berteley-0.1.2.tar.gz
  • Upload date:
  • Size: 36.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for berteley-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e5d37be5b1cbd86dbf4189785eb3bfbc002fb684eca5ae7f4cf2009bc754550a
MD5 9c4e0f88d73932dac1c681958bd88eb3
BLAKE2b-256 eee98812d759febf1746824b98efdf278ec57027e233016ac63ce3bbfd370790

See more details on using hashes here.

File details

Details for the file berteley-0.1.2-py2.py3-none-any.whl.

File metadata

  • Download URL: berteley-0.1.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for berteley-0.1.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 60842443f712f484e07fb5cc54d6edd535e40d7eb50c2aeb385568211b60269e
MD5 ee8163369a073eb5627cb2ad8bd52a6d
BLAKE2b-256 48d8094daf0837773a7068616ee4dd5e599806f309bd86684627c345d63b564e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page