Scraper and PDF text processor for domsdatabasen.dk
Project description
Domsdatabasen
Scraping og processering af sager fra Domsdatabasen.
Hver enkelt dom kan tilgås via https://domsdatabasen.dk/#sag/\<nummer>, hvor <nummer> er mellem 1 og 3821 (pr. 11-10-2023).
Datasættet ligger i en processeret udgave på Huggingface.
Scraping af sager
Se src/scripts/scrape.py
.
Processing af scraped data
Se src/scripts/process.py
.
Byg datasæt
Se src/scripts/finalize.py
.
Developers:
- Oliver Kinch (oliver.kinch@alexandra.dk)
- Dan Saattrup Nielsen (dan.nielsen@alexandra.dk)
Setup
Installation
- Run
pip install -r requirements.txt
to install the required packages.
A Word on Modules and Scripts
In the src
directory there are two subdirectories, domsdatabasen
and scripts
. This is a brief explanation of the differences between the two.
Modules
All Python files in the domsdatabasen
directory are modules
internal to the project package. Examples here could be a general data loading script,
a definition of a model, or a training function. Think of modules as all the building
blocks of a project.
When a module is importing functions/classes from other modules we use the relative import notation - here's an example:
from .other_module import some_function
Scripts
Python files in the scripts
folder are scripts, which are short code snippets that
are external to the project package, and which is meant to actually run the code. As
such, only scripts will be called from the terminal. An analogy here is that the
internal numpy
code are all modules, but the Python code you write where you import
some numpy
functions and actually run them, that a script.
When importing module functions/classes when you're in a script, you do it like you would normally import from any other package:
from domsdatabasen import some_function
Note that this is also how we import functions/classes in tests, since each test Python file is also a Python script, rather than a module.
Features
Automatic Test Coverage Calculation
Run make test
to test your code, which also updates the "coverage badge" in the
README, showing you how much of your code base that is currently being tested.
Continuous Integration
Github CI pipelines are included in the repo, running all the tests in the tests
directory, as well as building online documentation, if Github Pages has been enabled
for the repository (can be enabled on Github in the repository settings).
Code Spaces
Code Spaces is a new feature on Github, that allows you to develop on a project
completely in the cloud, without having to do any local setup at all. This repo comes
included with a configuration file for running code spaces on Github. When hosted on
alexandrainst/domsdatabasen
then simply press the <> Code
button
and add a code space to get started, which will open a VSCode window directly in your
browser.
Project structure
.
├── .devcontainer
│ └── devcontainer.json
├── .github
│ └── workflows
│ ├── ci.yaml
│ └── docs.yaml
├── .gitignore
├── .name_and_email
├── .pre-commit-config.yaml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── README.md
├── config
│ ├── __init__.py
│ ├── config.yaml
│ └── hydra
│ └── job_logging
│ └── custom.yaml
├── data
├── gfx
│ └── alexandra_logo.png
├── makefile
├── models
├── notebooks
├── poetry.toml
├── pyproject.toml
├── src
│ ├── scripts
│ │ ├── fix_dot_env_file.py
│ │ └── your_script.py
│ └── domsdatabasen
│ ├── __init__.py
│ └── your_module.py
└── tests
├── __init__.py
└── test_dummy.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file domsdatabasen-0.1.2.tar.gz
.
File metadata
- Download URL: domsdatabasen-0.1.2.tar.gz
- Upload date:
- Size: 31.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.3 Darwin/23.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 388fda45262f19ad575bcf6b25806fd175c13813e3a1f0723e27977154060888 |
|
MD5 | 79135781a5c3daa5d8a895c62327fea2 |
|
BLAKE2b-256 | dafe5d0ca00d7df6ca1696d755b124290090683d8316ca0df7682a2cb046c531 |
File details
Details for the file domsdatabasen-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: domsdatabasen-0.1.2-py3-none-any.whl
- Upload date:
- Size: 32.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.3 Darwin/23.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2de694766972bc5c1fb766ff80fa8357df5eb7b156366c15b844b5e30aa678e5 |
|
MD5 | 3bbf9292d391b6460dcda53b2f6b0a8b |
|
BLAKE2b-256 | 36b5d1b24290574c618b9750b1c587209e8c04376d9576da6198691a0b5420f9 |