Skip to main content

Scraper and PDF text processor for domsdatabasen.dk

Project description

Domsdatabasen

Scraping og processering af sager fra Domsdatabasen.

Hver enkelt dom kan tilgås via https://domsdatabasen.dk/#sag/\<nummer>, hvor <nummer> er mellem 1 og 3821 (pr. 11-10-2023).

Datasættet ligger i en processeret udgave på Huggingface.

Scraping af sager

Se src/scripts/scrape.py.

Processing af scraped data

Se src/scripts/process.py.

Byg datasæt

Se src/scripts/finalize.py.


Documentation License LastCommit Code Coverage Contributor Covenant

Developers:

Setup

Installation

  1. Run pip install -r requirements.txt to install the required packages.

A Word on Modules and Scripts

In the src directory there are two subdirectories, domsdatabasen and scripts. This is a brief explanation of the differences between the two.

Modules

All Python files in the domsdatabasen directory are modules internal to the project package. Examples here could be a general data loading script, a definition of a model, or a training function. Think of modules as all the building blocks of a project.

When a module is importing functions/classes from other modules we use the relative import notation - here's an example:

from .other_module import some_function

Scripts

Python files in the scripts folder are scripts, which are short code snippets that are external to the project package, and which is meant to actually run the code. As such, only scripts will be called from the terminal. An analogy here is that the internal numpy code are all modules, but the Python code you write where you import some numpy functions and actually run them, that a script.

When importing module functions/classes when you're in a script, you do it like you would normally import from any other package:

from domsdatabasen import some_function

Note that this is also how we import functions/classes in tests, since each test Python file is also a Python script, rather than a module.

Features

Automatic Test Coverage Calculation

Run make test to test your code, which also updates the "coverage badge" in the README, showing you how much of your code base that is currently being tested.

Continuous Integration

Github CI pipelines are included in the repo, running all the tests in the tests directory, as well as building online documentation, if Github Pages has been enabled for the repository (can be enabled on Github in the repository settings).

Code Spaces

Code Spaces is a new feature on Github, that allows you to develop on a project completely in the cloud, without having to do any local setup at all. This repo comes included with a configuration file for running code spaces on Github. When hosted on alexandrainst/domsdatabasen then simply press the <> Code button and add a code space to get started, which will open a VSCode window directly in your browser.

Project structure

.
├── .devcontainer
│   └── devcontainer.json
├── .github
│   └── workflows
│       ├── ci.yaml
│       └── docs.yaml
├── .gitignore
├── .name_and_email
├── .pre-commit-config.yaml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── README.md
├── config
│   ├── __init__.py
│   ├── config.yaml
│   └── hydra
│       └── job_logging
│           └── custom.yaml
├── data
├── gfx
│   └── alexandra_logo.png
├── makefile
├── models
├── notebooks
├── poetry.toml
├── pyproject.toml
├── src
│   ├── scripts
│   │   ├── fix_dot_env_file.py
│   │   └── your_script.py
│   └── domsdatabasen
│       ├── __init__.py
│       └── your_module.py
└── tests
    ├── __init__.py
    └── test_dummy.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domsdatabasen-0.1.3.tar.gz (31.3 kB view details)

Uploaded Source

Built Distribution

domsdatabasen-0.1.3-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file domsdatabasen-0.1.3.tar.gz.

File metadata

  • Download URL: domsdatabasen-0.1.3.tar.gz
  • Upload date:
  • Size: 31.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.3 Darwin/23.3.0

File hashes

Hashes for domsdatabasen-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ee6e4d18d27ff32de93b1e9ad992e1d4b5a53e90ca6c19331b0fa4256df25fad
MD5 424f2b7e3b0f3eca745d776b78634d02
BLAKE2b-256 13d478eca9940378ca22d0a2d45b815bcecfba5e70829ceb78de6713a2f259ab

See more details on using hashes here.

File details

Details for the file domsdatabasen-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: domsdatabasen-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 32.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.3 Darwin/23.3.0

File hashes

Hashes for domsdatabasen-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3f49ace99bd3ea930e67f8cbd928f2572b13281ea7186efa6dfe10d757f6b4e3
MD5 d0a469620e624160cc091b276b10219d
BLAKE2b-256 8945077381e99bf275a626b8e30926bfc36f64493473eae686d5d28495cf0ba4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page