Skip to main content

A tool for pulling word occurrence ('n-gram') data from the Gallica periodical archive.

Project description

gallicaGetter

This tool wraps a few endpoints from the Gallica API to allow multi-threaded data retrieval with support for generators. I'll be adding much more documentation soon -- just wanted to get this out there! Pull requests welcome.

Current endpoints are:

  • 'sru' -- word occurrences
  • 'content' -- occurrence context and page numbers
  • 'papers' -- paper metadata
  • 'issues' -- years published for a given paper

The tool's functionality has evolved around my application's needs, but it should be easy to extend.

Examples

I want to retrieve all issues that mention "Brazza" from 1890 to 1900.

import gallicaGetter

sruWrapper = gallicaGetter.connect('sru')

records = sruWrapper.get(
    terms="Brazza",
    startDate="1890",
    endDate="1900",
    grouping="all"
)

for record in records:
    print(record.getRow())

I want to retrieve all occurrences of "Brazza" within 10 words of "Congo" in the paper "Le Temps" from 1890 to 1900.

import gallicaGetter

sruWrapper = gallicaGetter.connect('sru')

records = sruWrapper.get(
    terms="Brazza",
    startDate="1890",
    endDate="1900",
    linkTerm="Congo",
    linkDistance=10,
    grouping="all",
    codes="cb34431794k"
)

for record in records:
    print(record.getRow())

Retrieve the number of occurrences of "Victor Hugo", by year, across the Gallica archive from 1800 to 1900, running 30 requests in parallel.

import gallicaGetter

sruWrapper = gallicaGetter.connect('sru', numWorkers=30)

records = sruWrapper.get(
    terms="Victor Hugo",
    startDate="1800",
    endDate="1900",
    grouping="year"
)

for record in records:
    print(record.getRow())

Retrieve all issues mentioning "Paris" in the papers "Le Temps" and "Le Figaro" from 1890 to 1900, using a generator.

import gallicaGetter

sruWrapper = gallicaGetter.connect('sru')

recordGenerator = sruWrapper.get(
    terms="Paris",
    startDate="1890",
    endDate="1900",
    grouping="all",
    codes=["cb34431794k", "cb3443179k"],
    generate=True
)

for i in range(10):
    print(next(recordGenerator).getRow())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gallicagetter-0.0.1.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

gallicagetter-0.0.1-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file gallicagetter-0.0.1.tar.gz.

File metadata

  • Download URL: gallicagetter-0.0.1.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for gallicagetter-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b9006e5ff311e8eb8ab350af229be135dae79119d362824cd7f6a2e043143c4b
MD5 af58803172a6260dca148f949a6278c2
BLAKE2b-256 7501ee43408584efb5598dccc86e4f58d00212d856afb9c31887e122acf06323

See more details on using hashes here.

File details

Details for the file gallicagetter-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for gallicagetter-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 df635e357c8368f85b4d901ac610187d2b4242a59e893a463463ead862624798
MD5 92a1329f04eb90520b077d83ccb72f24
BLAKE2b-256 2358b779f4054f9b29d42b8c98c9f555ce4a747c4ff0a7348d28a54a10e34259

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page