Skip to main content

Toolkit for the Målfrid project

Project description

Maalfrid toolkit

maalfrid_toolkit is a Python package designed for crawling and extracting natural language data from documents found on the web (HTML, PDF, DOC). It is primarily used in the Målfrid project, a collaboration between the National Library of Norway and The Language Council of Norway, which aims to measure the usage of the two official Norwegian language forms, Bokmål and Nynorsk, on Norwegian public sector websites. While the toolkit has a particular emphasis on the Nordic countries, it supports extraction and language detection of more than 60 languages. The maalfrid_toolkit is also used to produce the yearly Målfrid dataset (freely available documents from Norwegian state institutions).

It builds upon:

Install

Install with pip

pip install maalfrid_toolkit

With Glotlid / fasttext (optional, see below for caveats):

pip install maalfrid_toolkit[glotlid]

Install with pdm

pdm install

Test run pipeline

On HTML

python -m maalfrid_toolkit.pipeline --url https://www.nb.no/utstilling/opplyst-glimt-fra-en-kulturhistorie/ --to_jsonl

On PDF

python -m maalfrid_toolkit.pipeline --url https://www.nb.no/sbfil/dok/nst_taledat_dk.pdf --to_jsonl

On DOC

python -m maalfrid_toolkit.pipeline --url https://www.nb.no/content/uploads/2018/11/Søknadsskjema-Bokhylla-2.doc --to_jsonl

On (W)ARC file (e.g. from self-crawled material)

python -m maalfrid_toolkit.pipeline --warc_file example_com-00000.warc.gz --calculate_simhash --to_jsonl > warc.jsonl

On sitemap

python -m maalfrid_toolkit.pipeline --url https://example.com/sitemap.xml --crawl_sitemap --to_jsonl > example.jsonl

Useful extraction otpions

  • mode: Choose between 'precision' (default) and 'recall'. Recall will give you more language content but probably at the expense of more noise.
  • use_lenient_html_parser: Use a lenient HTML parser to fix broken HTML (more expensive).
  • extract_metadata: Extract metadata from the document and try to infer document publish date.

Database (Postgres)

If you want to store and process the data further in a database, setup a Postgres database and enter your credentials in an .env file in the package root directory (see env-example). Be sure to populate the database with schema and indices found in db/ prior to running the commands in maalfrid_toolkit.db.

OS-level dependencies (tested with Ubuntu 24.04) for optional functionality

For fasttext (optional)

sudo apt-get install build-essential python3-dev

For .doc text extraction (optional)

sudo apt-get install antiword

A note on using Browsertrix

In order to use Browsertrix for crawling JavaScript-heavy pages and extract text from HTML, you must currently clone a custom Browsertrix from:

https://github.com/Sprakbanken/browsertrix-crawler/tree/add-dom-resource

Then build with Docker:

docker build -t maalfrid-browsertrix .

License

GPL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maalfrid_toolkit-1.5.4.tar.gz (8.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maalfrid_toolkit-1.5.4-py3-none-any.whl (8.2 MB view details)

Uploaded Python 3

File details

Details for the file maalfrid_toolkit-1.5.4.tar.gz.

File metadata

  • Download URL: maalfrid_toolkit-1.5.4.tar.gz
  • Upload date:
  • Size: 8.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.26.6 CPython/3.12.3 Linux/6.8.0-100-generic

File hashes

Hashes for maalfrid_toolkit-1.5.4.tar.gz
Algorithm Hash digest
SHA256 1835fd75b9b035829c2fe259fec00a589f8f13abdc41f93eced5442c3ada9838
MD5 513e5d1c725a356befab96f58f756472
BLAKE2b-256 947c4c9ba2526dc6a8ad3bdaa4609f39be838bfeb713f8df77b4b301fc311fa4

See more details on using hashes here.

File details

Details for the file maalfrid_toolkit-1.5.4-py3-none-any.whl.

File metadata

  • Download URL: maalfrid_toolkit-1.5.4-py3-none-any.whl
  • Upload date:
  • Size: 8.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.26.6 CPython/3.12.3 Linux/6.8.0-100-generic

File hashes

Hashes for maalfrid_toolkit-1.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4e426134189493a8939d7f5d298483a0c2ace1cdbf9b58fd1d579635509663a8
MD5 30e675fc4dc12bb87a4865400c5187ae
BLAKE2b-256 f1e971cbfaff39f0fdf4f3a921e5211f91aaef038d61f4a7c5771e7c73922ec3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page