Swedish NLP utilities — stopwords, legal NER, text normalisation, chunking, and Pinecone vector pipeline helpers

These details have not been verified by PyPI

Project description

swedish-nlp-utils

Swedish NLP utilities for text processing, entity extraction, number parsing, chunking, and Pinecone vector pipeline helpers — built for Trollfabriken AITrix AB projects including SocKartan, DocFlow, ScrapeAssistant, and RevisionsUpproret.

Modules

Module	Class	Purpose
`stopwords`	`SwedishStopwords`	Domain-specific Swedish stopword sets
`normalizer`	`SwedishNormalizer`	Municipality, authority, law-ref, OCR normalisation
`ner`	`SwedishNER`	Rule-based NER for Swedish legal/municipal texts
`formats`	`SwedishFormats`	Date, number, SEK, personnummer parsing & formatting
`chunker`	`SwedishChunker`	Swedish-aware text chunking for vector pipelines
`vectors`	`SwedishVectorPipeline`	Pinecone + OpenAI embedding pipeline

Installation

# Core (no external API dependencies)
pip install swedish-nlp-utils

# With vector pipeline support (Pinecone + OpenAI)
pip install "swedish-nlp-utils[vectors]"

# With spaCy NER upgrade
pip install "swedish-nlp-utils[spacy]"
python -m spacy download sv_core_news_sm

Quick start

Stopwords

from swedish_nlp import SwedishStopwords

# General Swedish
sw = SwedishStopwords()
sw.filter_tokens(["socialnämnden", "beslutade", "att", "bevilja", "insats"])
# → ["socialnämnden", "beslutade", "bevilja", "insats"]

# Domain-specific
sw = SwedishStopwords.for_social_services()  # general + legal + social
sw.filter_text("handläggaren beslutade att genomföra utredningen")

# All domains combined
sw = SwedishStopwords.all_domains()

# Custom additions (fluent)
sw = SwedishStopwords.for_legal().add("paragrafen", "stycket")

Normaliser

from swedish_nlp import SwedishNormalizer

n = SwedishNormalizer()

# Municipality names
n.normalize_municipality("Göteborgs stad")       # → "Göteborg"
n.normalize_municipality("Stockholms kommunen")  # → "Stockholm"
n.normalize_municipality("GBG")                  # → "Göteborg"

# Authority names → canonical short form
n.normalize_authority("Justitieombudsmannen")              # → "JO"
n.normalize_authority("Inspektionen för vård och omsorg")  # → "IVO"
n.normalize_authority("Högsta förvaltningsdomstolen")      # → "HFD"

# Law references
n.normalize_law_reference("Socialtjänstlagen")  # → "SoL"
n.normalize_law_reference("sol")                # → "SoL"
n.normalize_law_reference("förvaltningslagen")  # → "FL"

# Replace law names in running text
n.normalize_law_references_in_text(
    "Enligt socialtjänstlagen och föräldrabalken..."
)
# → "Enligt SoL och FB..."

# OCR artefact correction
n.normalize_ocr("§  12 nämnden\xadbeslut")  # → "§ 12 nämndbeslut"

# Utilities
n.ascii_fold("åäö ÅÄÖ")       # → "aao AAO"
n.to_slug("Göteborgs Stad 2024")  # → "goteborgs-stad-2024"

Named Entity Recognition

from swedish_nlp import SwedishNER

ner = SwedishNER()
entities = ner.extract("""
    Socialnämnden fattade beslut 2024-03-15 enligt SoL 4 kap. 1 §.
    Handläggare: Anna Lindqvist. Diarienummer: SOC-2024-0042.
    JO har i HFD 2015:5 klargjort rättsläget.
""")

entities.authorities   # ["Socialnämnden", "JO"]
entities.courts        # ["HFD"]
entities.law_refs      # ["SoL", "4 kap. 1 §", "HFD 2015:5"]
entities.persons       # ["Anna Lindqvist"]
entities.diarienummer  # ["SOC-2024-0042"]
entities.dates         # ["2024-03-15"]
entities.roles         # ["handläggare"]

# Optional spaCy upgrade (better person/org detection)
ner = SwedishNER(use_spacy=True)

Formats

from swedish_nlp import SwedishFormats
from datetime import date

# Dates — parse
SwedishFormats.parse_date("15 mars 2024")   # → date(2024, 3, 15)
SwedishFormats.parse_date("2024-03-15")     # → date(2024, 3, 15)

# Dates — format
SwedishFormats.format_date(date(2024, 3, 15))       # → "2024-03-15"
SwedishFormats.format_date_long(date(2024, 3, 15))  # → "15 mars 2024"

# Numbers (Swedish format: space-thousands, comma-decimal)
SwedishFormats.parse_number("1 234 567,89")   # → 1234567.89
SwedishFormats.format_number(1234567.89)      # → "1 234 567,89"

# SEK
SwedishFormats.parse_sek("1 234 567 kr")       # → 1234567.0
SwedishFormats.format_sek(1234567.0)           # → "1 234 567 kr"
SwedishFormats.format_sek(1234.5, decimals=2)  # → "1 234,50 kr"
SwedishFormats.format_sek(1000.0, unit="tkr")  # → "1 000 tkr"

# Personnummer
SwedishFormats.validate_personnummer("19850312-4564")         # → bool
SwedishFormats.pseudonymize_personnummer("19850312-4564")     # → "1985-XX-XXXX"

# Extract from text
SwedishFormats.extract_sek_amounts("Budget 5 000 SEK")    # → [5000.0]
SwedishFormats.extract_dates("Beslut 2024-03-15")         # → ["2024-03-15"]
SwedishFormats.parse_postal_code("Göteborg 413 01")       # → "413 01"

Chunker

from swedish_nlp import SwedishChunker
from swedish_nlp.chunker.chunker import ChunkConfig

# Default config (512 tokens, 50 overlap)
chunker = SwedishChunker()
chunks = chunker.chunk(long_text)

# Custom config
cfg = ChunkConfig(chunk_size=256, chunk_overlap=30, min_chunk_size=40)
chunker = SwedishChunker(cfg)

# With Pinecone metadata for every chunk
chunks = chunker.chunk_document(
    text,
    doc_id       = "arsredovisning-goteborg-2023",
    doc_type     = "årsredovisning",
    municipality = "Göteborg",
    year         = 2023,
    extra_metadata = {"source_url": "https://goteborg.se/doc.pdf"},
)

for c in chunks:
    print(c.index, c.token_estimate, c.text[:80])
    print(c.metadata)  # {"doc_id": ..., "doc_type": ..., "municipality": ...}

# Format for Pinecone upsert
vector = c.to_pinecone_dict("vec-001", embedding=[0.1] * 1536)

Vector pipeline (Pinecone + OpenAI)

from swedish_nlp.vectors import SwedishVectorPipeline

# Requires: pip install "swedish-nlp-utils[vectors]"
# Requires: PINECONE_API_KEY and OPENAI_API_KEY in environment

pipeline = SwedishVectorPipeline(
    index_name = "sockartan-documents",
    namespace  = "protokoll",
)

# Index a document (chunk → embed → upsert in one call)
n_vectors = pipeline.chunk_and_upsert(
    text         = full_protocol_text,
    doc_id       = "protokoll-goteborg-2024-03",
    doc_type     = "protokoll",
    municipality = "Göteborg",
    year         = 2024,
)

# Semantic search
results = pipeline.search("socialtjänstens insatser för barn", top_k=5)
for r in results:
    print(r.score, r.municipality, r.text[:100])

# Filtered search
results = pipeline.search_with_filter(
    "budget underskott",
    doc_type     = "årsredovisning",
    municipality = "Göteborg",
    year         = 2023,
)

# Management
pipeline.delete_by_doc_id("protokoll-goteborg-2024-03")
stats = pipeline.get_index_stats()
print(stats["total_vectors"], stats["namespaces"])

CLI

# Named entity extraction
swe-nlp analyze "Socialnämnden fattade beslut enligt SoL 4 kap. 1 §"

# From file, JSON output
swe-nlp analyze --file document.txt --format json

# Chunk a document
swe-nlp chunk --file arsredovisning.txt --size 256 --show-tokens

# Normalize OCR output and law references
swe-nlp normalize --file ocr_output.txt
swe-nlp normalize "Socialtjänstlagen 4 kap §  1" --municipality

Environment variables

Variable	Module	Required
`OPENAI_API_KEY`	`SwedishVectorPipeline`	For vector pipeline
`PINECONE_API_KEY`	`SwedishVectorPipeline`	For vector pipeline
`PINECONE_INDEX_NAME`	`SwedishVectorPipeline`	Default: `swedish-docs`

Package structure

swedish_nlp/
├── __init__.py              ← Public API surface
├── cli.py                   ← swe-nlp CLI (analyze, chunk, normalize)
├── py.typed                 ← PEP 561 typed marker
├── stopwords/
│   └── stopwords.py         ← SwedishStopwords (5 domains)
├── normalizer/
│   └── normalizer.py        ← SwedishNormalizer (municipality/authority/law/OCR)
├── ner/
│   └── ner.py               ← SwedishNER (rule-based + optional spaCy)
├── formats/
│   └── formats.py           ← SwedishFormats (dates/numbers/SEK/personnummer)
├── chunker/
│   └── chunker.py           ← SwedishChunker (section/paragraph/sentence/token)
└── vectors/
    └── vectors.py           ← SwedishVectorPipeline (Pinecone + OpenAI)

VNV/tests/
├── test_stopwords.py        ← 19 tests
├── test_normalizer.py       ← 30 tests
├── test_ner.py              ← 25 tests
├── test_formats.py          ← 38 tests
└── test_chunker.py          ← 17 tests

Extending

Add a new stopword domain:

from swedish_nlp.stopwords.stopwords import _DOMAIN_MAP, Domain

# Add a custom domain set
_DOMAIN_MAP[Domain.MEDICAL].update({"ny_term", "annan_term"})

Add a new authority alias:

# In normalizer/normalizer.py
_AUTHORITY_CANONICAL["ny myndighet"] = "NM"

Add a new law abbreviation:

# In normalizer/normalizer.py
_LAW_ALIASES["ny lagtext"] = "NL"

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swedish_nlp_utils-1.0.0.tar.gz (32.4 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

swedish_nlp_utils-1.0.0-py3-none-any.whl (33.6 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file swedish_nlp_utils-1.0.0.tar.gz.

File metadata

Download URL: swedish_nlp_utils-1.0.0.tar.gz
Upload date: May 15, 2026
Size: 32.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for swedish_nlp_utils-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b1f5a5761f199ec4049ba4ebf8f3a466a3c4cacbfe49248123437fc3d1f0bf66`
MD5	`c171039414615a8040e1f28a3da73631`
BLAKE2b-256	`104d9734e091a3fab6bdcd039c4c31adf8499f767dabd76f994fd37c8c3d64ef`

See more details on using hashes here.

File details

Details for the file swedish_nlp_utils-1.0.0-py3-none-any.whl.

File metadata

Download URL: swedish_nlp_utils-1.0.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 33.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for swedish_nlp_utils-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c6afc5f2408dadeeff9b8f027c632f2261074a40f5c94f94ec47725b398082a7`
MD5	`299be8fac2b3ff4397f80eb46ec260a6`
BLAKE2b-256	`38cf385707e7bdfb178fd2fac6d39aba77e4d34078626778ea3dadcd68ad7448`

See more details on using hashes here.

swedish-nlp-utils 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

swedish-nlp-utils

Modules

Installation

Quick start

Stopwords

Normaliser

Named Entity Recognition

Formats

Chunker

Vector pipeline (Pinecone + OpenAI)

CLI

Environment variables

Package structure

Extending

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes