Skip to main content

No project description provided

Project description

Citation-finder

Citation-finder finds in-text citations in a text corpus using regular expressions. The expressions find citations that follow the author name-publication year format of the APA, Harvard and Chicago B styles.

How to use

Citation-finder takes a corpus created with the dhlab package from The National Library of Norway.

Install and import dhlab, and create a corpus.

pip install -U dhlab

import dhlab as dh

corpus = dh.Corpus(doctype='digibok')

Import citation-finder and call the function on the corpus.

import citaton_finder as cf

cf.citation_finder(corpus, yearspan=(1900,1965), limit=500)

The additional, optional arguments are yearspan and limit. The function searches for concordances using a range of four digit numbers that represent publication years. The range from year to year can be defined with yearspan. The default is from 1000 to the current year. Limit refers to the concordance limit. The default is 4000.

The function returns a Pandas DataFrame with the individual citation matches and their associated URN from the dhlab corpus.

What will match

Citation-finder will match both citations where the author name and publication year is inside parentheses (e.g. (Smith, 1991)), as well as citations where the author name is outside parentheses and the publication year is inside parentheses (e.g. Smith (1991)).

In order for the regular expressions to distinguish citation-like strings from other text, they assume at least one word beginning with an upper case letter (author name), a four digit number (publication year) and parentheses (or semicolons, which can also surround citations if several are listed in a row.

Additionally the patterns allow for several optional elements:

  • multiple authors can be listed
    • (Lee, Singh and Smith, 1991)
    • Lee, Singh and Smith (1991)
  • author names can include initials
    • (P. W. Smith, 1991)
    • P. W. Smith (1991)
  • author names can be followed by "et al." or "m.fl." in Norwegian
    • (Smith et al., 1991)
    • Smith et al. (1991)
  • publication year can be followed by a page reference
    • (Smith, 1991, p. 123-125)
    • Smith (1991, p. 123-125)
  • publication year can be followed by a single letter to differentiate multiple works by the same author in the same year
    • (Smith, 1991a)
    • Smith (1991a)
  • the author name inside parentheses can be preceded by other text
    • (see for instance Smith, 1991)

Since the regular expressions simply search for patterns in raw text, citation-finder will return all the matching strings regardless of whether they are true citations or not, and will not return citations that do not match the pattern.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citation_finder-1.0.0.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

citation_finder-1.0.0-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file citation_finder-1.0.0.tar.gz.

File metadata

  • Download URL: citation_finder-1.0.0.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/5.14.0-1048-oem

File hashes

Hashes for citation_finder-1.0.0.tar.gz
Algorithm Hash digest
SHA256 378746233ec1730e55f53d0d1e4950e435601e9d7e98b175d2cc1c7c94ace8df
MD5 12ce479186b09b50c32bb796ccb63927
BLAKE2b-256 a935885ba947e8ff2767eb1eb5f677ab1998726c37930f164089f2463ee14dbf

See more details on using hashes here.

File details

Details for the file citation_finder-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: citation_finder-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 4.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/5.14.0-1048-oem

File hashes

Hashes for citation_finder-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dde8218205dd9f02439f58c907a236d2592a5eeb99b1bdc49ae2e4aafea2bed9
MD5 2cef832e62294393404ae83c8c0cc7b6
BLAKE2b-256 a10a905be1cadf9a8db2c0404b2dd835112836ba5f21d6bc15235b8697b9e3c3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page