Text processing tool for detecting Danish CPR-numbers.
Project description
os2ds-rules
: Next-generation, high-performance rule system for OS2datascanner
os2ds-rules
is the next-generation rules system for use in OS2datascanner
by Magenta ApS, which aimes to deliver high-performance
with regards to processing speed and detection accuracy.
The project consists of several components:
- A shared-library backend written in modern
C++20
. - A Python C/C++ extension that exposes functionality from the backend to
python
. - A
python
library that provides a safe, easy-to-use interface to the aforementioned extension.
WARNING: This is currently in a very early stage of development and frequently undergoes substantial changes, so it is not ready for prime time, yet.
Getting Started
You can install either or both of the C++
library or the python
extension.
C++ backend library: libos2dsrules
For the C++
library you need a few different things:
- A compiler that supports
C++20
. We recommend using eitherg++
(GCC) orclang
(LLVM). cmake>=3.20
: Primary (meta) build system.ninja
: Cross-platform backend for cmake.gtest
(Google Test): For building and running the test suite.
For development, you additionally want:
cppcheck
orclang-analyzer
: For static code analysis.clang-format
: For formatting C++ code. We adhere to the LLVM style guide.clang-tidy
: For C++ code linting.gdb
orlldb
: A suitable debugger.
To make a debug build on linux
, run the following:
# Make a build directory
cmake . --preset linux-debug
cmake --build --preset linux-build-debug
This will build the shared library libos2dsrules.so
and the test suite testsuite
.
To install the library, run the following from the build directory:
sudo cmake --install build_cmake/debug
By default, this will install headers into /usr/include
and shared objects to
/usr/lib
.
To run the test suite:
ctest --preset linux-test-debug
Currently, this has only been tested on linux
.
It remains to be tested on windows
and macos
.
Using CMake preset workflows
There are four preconfigured workflows that has been automated with cmake:
linux-debug-workflow
: Configures, Builds and Tests the debug version of the library forlinux
.linux-release-workflow
: Configures, Builds and Packages the release version of the library forlinux
.windows-debug-workflow
: Configures, Builds and Tests the debug version of the library forwindows
.windows-release-workflow
: Configures, Builds and Packages the release version of the library forwindows
.
The Python extension: os2ds-rules
You need the following:
- A compiler that supports
C++20
. We recommend using eitherg++
(GCC) orclang
(LLVM). - The CPython development headers and libraries.
setuptools
: For building the extension.pytest
: For python tests.pytest-benchmark
: For python benchmarks.
NOTE: Depending on what OS you use and how CPython
was installed on your system,
the development headers and libraries may or may not be installed.
The development headers and libraries can be installed using a package manager on the following systems:
ubuntu
/debian
:sudo apt install python3-dev
fedora
:sudo dnf install python3-devel
To build the extension:
# From the project root.
python3 -m setup build
To install the extension locally:
# From the project root.
# You may want to use the `-e` option during development.
python3 -m pip install .
Uninstalling is as easy as running:
pip uninstall os2ds-rules
Running the benchmark
After having installed the extension as described above, run the benchmarks with:
python3 -m pytest --benchmark-only test/benchmarks/
Currently, you need to build and install the extension before running the benchmark until this gets fixed.
Python Interpreter support
The Python3 extension uses the CPython
C-API, which is supported by
CPython
as standard.
We aim to support the PyPy
interpreter as well.
Usage Examples
In Python
Let us scan a python str
for occurances for CPR-Numbers:
from os2ds_rules import CPRDetector
detector = CPRDetector()
matches = detector.find_matches('This is a fake, but valid CPR-Number: 1111111118')
for m in matches:
print(m)
In C++
Consider this simple file, test.cpp
:
#include <os2dsrules.hpp>
#include <name_rule.hpp>
#include <iostream>
#include <string>
using namespace OS2DSRules::NameRule;
int main(void) {
NameRule rule;
std::string s = "This is my friend, John.";
auto results = rule.find_matches(s);
std::cout << "Found matches: \n";
for (auto m : results) {
std::cout << m.match() << '\n';
}
return 0;
}
To compile, using clang
, simply run:
clang++ -los2dsrules -std=c++20 test.cpp -o test
This will produce an executable test
in the current working directory.
Running it:
$ ./test
Found matches:
John
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for os2ds_rules-0.0.4-pp39-pypy39_pp73-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5e3f720b75b97c9ec94c3f91fd3fe1925dbd7986ef3644bc5d840532c7ce7d6 |
|
MD5 | c75709a8fe7b08bc0ddecb25af7dbdfd |
|
BLAKE2b-256 | 2e47352c3ccc01edc9f9802a1e90a78ba405ae78e56a93f3427f550fc7f63a14 |
Hashes for os2ds_rules-0.0.4-pp39-pypy39_pp73-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | beb6221f1e366c58e2b23be19cf72616f9d10724efef25ab220a5a67fa806203 |
|
MD5 | dd58e72c45f540b10ed6d0f8f525e7b6 |
|
BLAKE2b-256 | 909196b4c14ed499d258d1734597bd018001453a267e043bc0d17cf0f1e566cb |
Hashes for os2ds_rules-0.0.4-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f678a7e8611193930a85b95346542d09af4d842fe76bb33f476cb1b244a03f1 |
|
MD5 | ae4f61099017b969ef97de8a8b0f6958 |
|
BLAKE2b-256 | e60c07b4d815293e2010d6bd90714fe9b46f13f065efb0f062f95c39d812a99f |
Hashes for os2ds_rules-0.0.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 775604b989bf444d5305940f4250856debd4d0e63b17c7fb0351278c01cdc104 |
|
MD5 | fc7106e00ae5f961c2cf6d68a0c19a20 |
|
BLAKE2b-256 | b2e17e35ef16880959c2309ae9d3c0b28b3eda1bebdfc88b73d58f7f828b3bb8 |
Hashes for os2ds_rules-0.0.4-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b46e0031915b62f812c0530bfd8b10d4fbc51e0dce1d26e662a782548fe6e08d |
|
MD5 | 8e861c894ce1ad78fe4c508c04bc9933 |
|
BLAKE2b-256 | 3f591f3627c1230b12a04079c9d64f5b33bb172fd2c84064238d2471be012639 |
Hashes for os2ds_rules-0.0.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4a855ee0bf49c854184aff80c5356547750bc8a1d319c9d0fa4b86cf6ed989f |
|
MD5 | 87b5d1aca2d6e182cc0401c8b25056c0 |
|
BLAKE2b-256 | 830840574d575f6e0ae59484b103f15c6f09c53dcd1a8940b833126fe30da32b |
Hashes for os2ds_rules-0.0.4-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32a1afa4b340a85fa8d12c31a8bd448a3c5daa46aea7486f813e9ad6fbf29bc9 |
|
MD5 | b93fa09061e72171abb61f843d832f09 |
|
BLAKE2b-256 | fd53de93ac12d6307202ce2f17d15d732cb7ed9a84242e38e492f5273e4479cb |
Hashes for os2ds_rules-0.0.4-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7472309651895c2769ccd7e1e3a5f15bbe40973246d63056f955a0240d8cc08 |
|
MD5 | 452fa3e7caa88b86c5c91c8ce2c3aaa1 |
|
BLAKE2b-256 | f98f7926156ac2ae04dd90909fc4b52ad7097dbf6d9be49c63b24558efc5de68 |