Toolkit for the Målfrid project

Project description

Maalfrid toolkit

maalfrid_toolkit is a Python package designed for crawling and extracting natural language data from documents found on the web (HTML, PDF, DOC). It is primarily used in the Målfrid project, a collaboration between the National Library of Norway and The Language Council of Norway, which aims to measure the usage of the two official Norwegian language forms, Bokmål and Nynorsk, on Norwegian public sector websites. While the toolkit has a particular emphasis on the Nordic countries, it supports extraction and language detection of more than 60 languages.

It builds upon:

wget and (custom) browsertrix for crawling
JusText for HTML boilerplate removal
Notram PDF text extraction from NB AI-lab
DOC extraction using docx2txt and antiword
Gielladetect/pytextcat and GlotLID V3 for language detection

Install

Install with pip

pip install git+https://github.com/NationalLibraryOfNorway/maalfrid_toolkit

Install with pdm

pdm install

OS-level dependencies (tested with Ubuntu 24.04)

For fasttext

sudo apt-get install build-essential python3-dev

For .doc text extraction

sudo apt-get install antiword

Test run crawl

pdm run python3 -m maalfrid_toolkit.crawl src/maalfrid_toolkit/crawljobs/example.com.yaml

Test run pipeline

On HTML

pdm run python -m maalfrid_toolkit.pipeline --url https://www.nb.no/utstilling/opplyst-glimt-fra-en-kulturhistorie/ --verbose

On PDF

pdm run python -m maalfrid_toolkit.pipeline --url https://www.nb.no/sbfil/dok/nst_taledat_dk.pdf --verbose

On DOC

pdm run python -m maalfrid_toolkit.pipeline --url https://www.nb.no/content/uploads/2018/11/Søknadsskjema-Bokhylla-2.doc

On WARC file (e.g. from self-crawled material)

pdm run python -m maalfrid_toolkit.pipeline --warc_file example_com-00000.warc.gz --verbose

Database (Postgres)

If you want to store and process the data further in a database, setup a Postgres database and enter your credentials in an .env file in the package root directory (see env-example). Be sure to populate the database with schema and indices found in db/ prior to running the commands in maalfrid_toolkit.db.

A note on using Browsertrix

In order to use Browsertrix for crawling JavaScript-heavy pages and extract text from HTML, you must currently clone a custom Browsertrix from:

https://github.com/Sprakbanken/browsertrix-crawler/tree/add-dom-resource

Then build with Docker:

docker build -t maalfrid-browsertrix .

License

GPL

Project details

Release history Release notifications | RSS feed

1.5.5

Mar 5, 2026

1.5.4

Mar 5, 2026

1.5.3

Mar 5, 2026

1.5.2

Jan 23, 2026

1.5.1

Jan 22, 2026

1.5.0

Jan 22, 2026

1.4.5

Oct 6, 2025

1.4.4

Oct 1, 2025

1.4.3

Sep 30, 2025

1.4.2

Sep 30, 2025

1.4.1

Sep 30, 2025

1.4.0

Sep 25, 2025

1.3.0

May 22, 2025

1.2.0

May 20, 2025

1.1.1

May 14, 2025

1.1.0

May 12, 2025

1.0.2

May 9, 2025

1.0.1

May 9, 2025

This version

1.0.0

May 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maalfrid_toolkit-1.0.0.tar.gz (1.1 MB view details)

Uploaded May 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

maalfrid_toolkit-1.0.0-py3-none-any.whl (705.3 kB view details)

Uploaded May 9, 2025 Python 3

File details

Details for the file maalfrid_toolkit-1.0.0.tar.gz.

File metadata

Download URL: maalfrid_toolkit-1.0.0.tar.gz
Upload date: May 9, 2025
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.24.1 CPython/3.12.3 Linux/6.8.0-58-generic

File hashes

Hashes for maalfrid_toolkit-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b87d4eef57deb1bb2045e8ed90063828c273139eb27ab44d9538105f115b433d`
MD5	`60b0bf5872346cc5ff28576401205f6b`
BLAKE2b-256	`ebd19e23e218e1b6e714d4e67fe708fd64e14884c3a5424e7cbae2fe8b5a1879`

See more details on using hashes here.

File details

Details for the file maalfrid_toolkit-1.0.0-py3-none-any.whl.

File metadata

Download URL: maalfrid_toolkit-1.0.0-py3-none-any.whl
Upload date: May 9, 2025
Size: 705.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.24.1 CPython/3.12.3 Linux/6.8.0-58-generic

File hashes

Hashes for maalfrid_toolkit-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eeb58bd7ef478dadd71ce17732f4c4bd75c484bf5725ae8b2c0aa4668119bc08`
MD5	`b64eb37e2446f9c5254c98f9d58a0405`
BLAKE2b-256	`14f854c08cf90d1a24843c4c9e3bfdbf1a805cc943f808c8fa1128b3b15c0b81`

See more details on using hashes here.

maalfrid_toolkit 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Maalfrid toolkit

Install

Install with pip

Install with pdm

OS-level dependencies (tested with Ubuntu 24.04)

For fasttext

For .doc text extraction

Test run crawl

Test run pipeline

On HTML

On PDF

On DOC

On WARC file (e.g. from self-crawled material)

Database (Postgres)

A note on using Browsertrix

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes