Skip to main content

Toolkit for the Målfrid project

Project description

Maalfrid toolkit

maalfrid_toolkit is a Python package designed for crawling and extracting natural language data from documents found on the web (HTML, PDF, DOC). It is primarily used in the Målfrid project, a collaboration between the National Library of Norway and The Language Council of Norway, which aims to measure the usage of the two official Norwegian language forms, Bokmål and Nynorsk, on Norwegian public sector websites. While the toolkit has a particular emphasis on the Nordic countries, it supports extraction and language detection of more than 60 languages.

It builds upon:

Install

Install with pip

pip install maalfrid_toolkit

Install with pdm

pdm install

OS-level dependencies (tested with Ubuntu 24.04)

For fasttext

sudo apt-get install build-essential python3-dev

For .doc text extraction

sudo apt-get install antiword

Test run crawl

python3 -m maalfrid_toolkit.crawl src/maalfrid_toolkit/crawljobs/example.com.yaml

Test run pipeline

On HTML

python -m maalfrid_toolkit.pipeline --url https://www.nb.no/utstilling/opplyst-glimt-fra-en-kulturhistorie/ --verbose

On PDF

python -m maalfrid_toolkit.pipeline --url https://www.nb.no/sbfil/dok/nst_taledat_dk.pdf --verbose

On DOC

python -m maalfrid_toolkit.pipeline --url https://www.nb.no/content/uploads/2018/11/Søknadsskjema-Bokhylla-2.doc

On WARC file (e.g. from self-crawled material)

python -m maalfrid_toolkit.pipeline --warc_file example_com-00000.warc.gz --verbose

Database (Postgres)

If you want to store and process the data further in a database, setup a Postgres database and enter your credentials in an .env file in the package root directory (see env-example). Be sure to populate the database with schema and indices found in db/ prior to running the commands in maalfrid_toolkit.db.

A note on using Browsertrix

In order to use Browsertrix for crawling JavaScript-heavy pages and extract text from HTML, you must currently clone a custom Browsertrix from:

https://github.com/Sprakbanken/browsertrix-crawler/tree/add-dom-resource

Then build with Docker:

docker build -t maalfrid-browsertrix .

License

GPL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maalfrid_toolkit-1.0.1.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maalfrid_toolkit-1.0.1-py3-none-any.whl (705.3 kB view details)

Uploaded Python 3

File details

Details for the file maalfrid_toolkit-1.0.1.tar.gz.

File metadata

  • Download URL: maalfrid_toolkit-1.0.1.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.24.1 CPython/3.12.3 Linux/6.8.0-58-generic

File hashes

Hashes for maalfrid_toolkit-1.0.1.tar.gz
Algorithm Hash digest
SHA256 20eba63d9daa10764577c601d209dcc29db56a22e8e5df7bca902a79d67ac4a8
MD5 b621dc270c08fda53cdab9d455c6e475
BLAKE2b-256 0eaa08c2696ffca03412a9b95b212f0c75a0a168e867eed0509991bf84e9bdbc

See more details on using hashes here.

File details

Details for the file maalfrid_toolkit-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: maalfrid_toolkit-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 705.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.24.1 CPython/3.12.3 Linux/6.8.0-58-generic

File hashes

Hashes for maalfrid_toolkit-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5c5389b32d1c5024530b886d55361b327457e885b049b83c28b69efbe4878d02
MD5 0aa6f5d6ef9998a04ec1a371c6c4b22c
BLAKE2b-256 3e14e5994de050d8424c703d12d8f122a8b9d100db93f6939c4b5ff7f65ccabd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page