Skip to main content

Text and image analysis of NB's digital collection

Project description

DHLAB

dhlab is a python library for accessing reduced representations of text and pictures at the National Library of Norway (NLN), Nasjonalbiblioteket (NB) in Norwegian. It is developed and maintained by The Digital Humanities lab group.

The python package includes wrapper functions for the DHLAB API (Application Programming Interface) that can be used to query the texts in NB Digital, the NLN's digital collection of books and newspapers.

The API allows for textual qualitative and quantitative analyses of the digital texts by generating e.g. word frequency lists, concordances, collocations, n-grams, as well as extracting names and narrative graphs.

Analyses can be performed on both a single document, and on a larger corpus. It is also possible to build one's own corpora based on bibliographic metadata.

The Jupyter Notebooks in the digital_tekstanalyse repo show examples on how to use the library, and can be used directly in your browser without prior programming experience.

Installation

Install dhlab in your terminal with pip:

pip install dhlab

Example use

You could start by building your own corpus, e.g. of books published between 1814 and 1905:

from dhlab.text import Corpus

book_corpus = Corpus(doctype="digibok", from_year=1814, to_year=1905)

Contact

If you have any questions, or run into any problems with the code, please log them in our issue tracker in the DHLAB repo.

Changelog

v2.16.0 (2023-01-16)

Feat

  • heatmap: add heatmap wrapper to viz package

v2.15.1 (2023-01-16)

Fix

  • EmptyCorpus (#129)

v2.15.0 (2023-01-16)

Feat

  • corpus: (#128)

Fix

  • bump-version: disable triggering new workflow [skip-ci] (#127)

v2.14.0 (2023-01-13)

Feat

  • visualize, ngram, wordbank: (#126)

v2.13.3 (2023-01-09)

Fix

  • handle API errors

v2.13.2 (2023-01-06)

Fix

  • dhlab_api: construct df from response.json (#124)

v2.13.1 (2022-12-08)

Fix

  • Pos0 (#123)

v2.13.0 (2022-12-07)

Feat

  • api: added start page and to page for NER and POS (#122)

v2.12.0 (2022-12-07)

Feat

  • ngram: make ngram subclass of dhlabobj (#121)

v2.11.2 (2022-11-24)

Fix

  • ngram: lang param is not supported for avis
  • requirements: add scipy to reqirements

v2.11.1 (2022-11-14)

Feat

  • totals: Add totals() to top level
  • DhlabObj: add DhlabObj
  • Corpus: allow empty Corpus and construct df

Fix

  • conc_coll: fix repr functions

Refactor

  • corpus: change add()

v2.10.0 (2022-10-26)

Feat

  • added image search in bokhylla books to api

v2.9.5 (2022-10-26)

Fix

  • nbtokenizer now handles numbers combined with words, e.g. '1800-tallet'

v2.9.4 (2022-10-04)

Fix

  • relative and absolute

v2.9.3 (2022-10-04)

Fix

  • ngram smooth param

v2.9.2 (2022-10-03)

Fix

  • lang added to ngram

v2.9.1 (2022-09-12)

Fix

  • revert ngram_conv()

v2.9.0 (2022-09-05)

Feat

  • added reference for words

v2.8.2 (2022-09-01)

Fix

  • corpus: : extend_from_identifiers (#88)

    need to be able to work with dataframe with urns

v2.8.1 (2022-08-30)

Fix

  • delete whitespace

v2.8.0 (2022-08-30)

Feat

  • added spacy pos parse (pos, lemma, dependency)

Refactor

  • corpus: ignore index in corpus.add (#84)

v2.7.0 (2022-08-24)

  • automatic bump with features from 2.6.x versions

v2.6.3 (2022-08-23)

Feat

  • add Empty Corpus to dhlab.init

Fix

  • dewey search in corpus builder

v2.6.2 (2022-08-17)

Feat

  • add corpus default limit
  • add docstring to concordance

Fix

  • revert gitignore
  • revert Collocation

v2.6.1 (2022-08-15)

Fix

  • error in pandas corpus (#75)

v2.6.0 (2022-08-08)

Feat

  • code for word evaluations

v2.5.0 (2022-08-05)

Feat

  • geolocation

v2.4.0 (2022-07-12)

Feat

  • ner with spaCy

v2.3.6 (2022-07-12)

Fix

  • nb_ngram to point to new endpoint

v2.3.0 (2022-06-02)

Feat

  • added access to Norsk Ordbank, wordbank

v2.2.0 (2022-05-13)

Feat

  • ngram, geodata

v2.1.0 (2022-05-13)

Feat

  • geodata

v2.0.25 (2022-04-27)

Fix

  • setup.cfg: make package dhlab importable

v2.0.24 (2022-04-19)

Refactor

  • expose dhlab v1 modules

v2.0.22 (2022-03-22)

Fix

  • import all legacy modules in __init__.py

Refactor

  • move dhlab_v1 code into its own subpackage
  • docs/package_summary.rst: add reference table for legacy code

v2.0.21 (2022-03-21)

Refactor

  • constants: add global variables for URLs in constants.py
  • Reformat code with pep8 tools
  • turn relative imports into absolute imports
  • simplify and reduce expressions
  • rename classes with CamelCase

Docs

  • README: add "Example use"
  • add docstrings in subpackages
  • add docs/CHANGELOG.md
  • docs: add *.rst documentation files
  • add autosummary of whole dhlab package
  • logo: update logo image
  • add jupyter integration and toggle feature
  • add copybutton to code blocks
  • add docstrings and make functions private

v2.0.20geo (2022-03-02)

Feat

  • dhlab.api.dhlab_api: add function get_places
  • text.geo_data: add class GeoData

Fix

  • text.dispersion: pass **kwargs to plot()

v.2.0.18dispersion (2022-02-21)

Feat

  • text.dispersion: add class Dispersion
  • api.dhlab_api: add get_dispersion

Fix

  • requirements: remove wordcloud

v2.0.17params (2022-02-08)

Refactor

  • text.corpus: add parameter fulltext
  • api.dhlab_api.document_corpus: add parameter fulltext
  • text.conc_coll.Concordance: add parameters window and limit
  • text.conc_coll.Collocations: add parameter samplesize

Fix

  • text.corpus.urnlist: fix urnlist assignment

v2.0.12.chunk (2022-01-29)

Refactor

  • text.chunking: add attribute self.chunks

v2.0.10chunks (2022-01-29)

Feat

  • text.conc_coll: add class Counts
  • text.corpus: add class Corpus_from_identifiers
  • text.chunking: add class Chunks
  • text.chunking: add functions get_chunks, get_chunks_para

Fix

  • imports
  • dhlab_api.get_chunks: return dict not dataframe
  • apply autopep8

v2.0.5 (2022-01-19)

Refactor

  • nbtokenizer: edit tokens for mail and web addresses

Feat

  • add Tokens class

Fix

  • imports

v2.0.2a (2022-01-18)

Fix

  • typecheck of corpus objects

v2.0.1.alpha6 (2022-01-18)

  • changed wordcloud import
  • fixed corpus transfer in conc_coll

v2.0.0.beta (2022-01-18)

Feat

  • add get_file_from_github, download_from_github in utils

Refactor

  • New package structure

Docs

  • include installation instructions in README

v1.0.0 (2022-01-06)

  • Set up Github Actions to run automatic linting and testing
  • Set up documentation pages
  • Include documentation of the code in docstrings

Fix

  • address linting issues from flake8
  • reformat code

Feat

  • add documentation summaries for all modules
  • add documentation for the repo
  • add docstrings from README.md to nbtext.py
  • add pylint config file

Refactor

  • reduce code duplication
  • update workflow file reference
  • change str.format to f-strings
  • optimize imports
  • rename workflow that packages and publishes dhlab to pypi
  • use default publish workflow
  • reduce compatible python versions
  • update publishing workflow
  • type out scope for linting explicitly
  • move pylint.yml

v0.75 (2019-09-09)

  • Inital release to pypi

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dhlab-2.16.0.tar.gz (200.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dhlab-2.16.0-py3-none-any.whl (180.3 kB view details)

Uploaded Python 3

File details

Details for the file dhlab-2.16.0.tar.gz.

File metadata

  • Download URL: dhlab-2.16.0.tar.gz
  • Upload date:
  • Size: 200.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dhlab-2.16.0.tar.gz
Algorithm Hash digest
SHA256 8aea42abb7f99141320e35c10a970eedad1dfad216db2206d4ee0db62188a9b2
MD5 6fdb47671bb89900798b51b878a020c4
BLAKE2b-256 8dd74800672b8514f819003eed8c580a5498b95b9fe95d15e34633f1437b9dad

See more details on using hashes here.

File details

Details for the file dhlab-2.16.0-py3-none-any.whl.

File metadata

  • Download URL: dhlab-2.16.0-py3-none-any.whl
  • Upload date:
  • Size: 180.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dhlab-2.16.0-py3-none-any.whl
Algorithm Hash digest
SHA256 345d945d18dd9e01c996c712397702f6696a2f5c38d7207f4e03a6599c7a7099
MD5 d27a44a95636e21d41b77e22f9434c12
BLAKE2b-256 7b520730c9092e55b1acd748722b4e59b678faf643af7f5cf0e1430284284b86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page