No project description provided
Project description
Citation-finder
Citation-finder finds in-text citations in a text corpus using regular expressions. The expressions find citations that follow the author name-publication year format of the APA, Harvard and Chicago B styles.
How to use
Citation-finder takes a corpus created with the dhlab package from The National Library of Norway.
Install and import dhlab, and create a corpus.
pip install -U dhlab
import dhlab as dh
corpus = dh.Corpus(doctype='digibok')
Import citation-finder and call the function on the corpus.
import citaton_finder as cf
cf.citation_finder(corpus, yearspan=(1900,1965), limit=500)
The additional, optional arguments are yearspan and limit. The function searches for concordances using a range of four digit numbers that represent publication years. The range from year to year can be defined with yearspan. The default is from 1000 to the current year. Limit refers to the concordance limit. The default is 4000.
The function returns a Pandas DataFrame with the individual citation matches and their associated URN from the dhlab corpus.
What will match
Citation-finder will match both citations where the author name and publication year is inside parentheses (e.g. (Smith, 1991)), as well as citations where the author name is outside parentheses and the publication year is inside parentheses (e.g. Smith (1991)).
In order for the regular expressions to distinguish citation-like strings from other text, they assume at least one word beginning with an upper case letter (author name), a four digit number (publication year) and parentheses (or semicolons, which can also surround citations if several are listed in a row.
Additionally the patterns allow for several optional elements:
- multiple authors can be listed
- (Lee, Singh and Smith, 1991)
- Lee, Singh and Smith (1991)
- author names can include initials
- (P. W. Smith, 1991)
- P. W. Smith (1991)
- author names can be followed by "et al." or "m.fl." in Norwegian
- (Smith et al., 1991)
- Smith et al. (1991)
- publication year can be followed by a page reference
- (Smith, 1991, p. 123-125)
- Smith (1991, p. 123-125)
- publication year can be followed by a single letter to differentiate multiple works by the same author in the same year
- (Smith, 1991a)
- Smith (1991a)
- the author name inside parentheses can be preceded by other text
- (see for instance Smith, 1991)
Since the regular expressions simply search for patterns in raw text, citation-finder will return all the matching strings regardless of whether they are true citations or not, and will not return citations that do not match the pattern.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file citation_finder-1.0.0.tar.gz
.
File metadata
- Download URL: citation_finder-1.0.0.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/5.14.0-1048-oem
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 378746233ec1730e55f53d0d1e4950e435601e9d7e98b175d2cc1c7c94ace8df |
|
MD5 | 12ce479186b09b50c32bb796ccb63927 |
|
BLAKE2b-256 | a935885ba947e8ff2767eb1eb5f677ab1998726c37930f164089f2463ee14dbf |
File details
Details for the file citation_finder-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: citation_finder-1.0.0-py3-none-any.whl
- Upload date:
- Size: 4.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/5.14.0-1048-oem
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dde8218205dd9f02439f58c907a236d2592a5eeb99b1bdc49ae2e4aafea2bed9 |
|
MD5 | 2cef832e62294393404ae83c8c0cc7b6 |
|
BLAKE2b-256 | a10a905be1cadf9a8db2c0404b2dd835112836ba5f21d6bc15235b8697b9e3c3 |