Skip to main content

Swedish NLP utilities — stopwords, legal NER, text normalisation, chunking, and Pinecone vector pipeline helpers

Project description

swedish-nlp-utils

Swedish NLP utilities for text processing, entity extraction, number parsing, chunking, and Pinecone vector pipeline helpers — built for Trollfabriken AITrix AB projects including SocKartan, DocFlow, ScrapeAssistant, and RevisionsUpproret.


Modules

Module Class Purpose
stopwords SwedishStopwords Domain-specific Swedish stopword sets
normalizer SwedishNormalizer Municipality, authority, law-ref, OCR normalisation
ner SwedishNER Rule-based NER for Swedish legal/municipal texts
formats SwedishFormats Date, number, SEK, personnummer parsing & formatting
chunker SwedishChunker Swedish-aware text chunking for vector pipelines
vectors SwedishVectorPipeline Pinecone + OpenAI embedding pipeline

Installation

# Core (no external API dependencies)
pip install swedish-nlp-utils

# With vector pipeline support (Pinecone + OpenAI)
pip install "swedish-nlp-utils[vectors]"

# With spaCy NER upgrade
pip install "swedish-nlp-utils[spacy]"
python -m spacy download sv_core_news_sm

Quick start

Stopwords

from swedish_nlp import SwedishStopwords

# General Swedish
sw = SwedishStopwords()
sw.filter_tokens(["socialnämnden", "beslutade", "att", "bevilja", "insats"])
# → ["socialnämnden", "beslutade", "bevilja", "insats"]

# Domain-specific
sw = SwedishStopwords.for_social_services()  # general + legal + social
sw.filter_text("handläggaren beslutade att genomföra utredningen")

# All domains combined
sw = SwedishStopwords.all_domains()

# Custom additions (fluent)
sw = SwedishStopwords.for_legal().add("paragrafen", "stycket")

Normaliser

from swedish_nlp import SwedishNormalizer

n = SwedishNormalizer()

# Municipality names
n.normalize_municipality("Göteborgs stad")       # → "Göteborg"
n.normalize_municipality("Stockholms kommunen")  # → "Stockholm"
n.normalize_municipality("GBG")                  # → "Göteborg"

# Authority names → canonical short form
n.normalize_authority("Justitieombudsmannen")              # → "JO"
n.normalize_authority("Inspektionen för vård och omsorg")  # → "IVO"
n.normalize_authority("Högsta förvaltningsdomstolen")      # → "HFD"

# Law references
n.normalize_law_reference("Socialtjänstlagen")  # → "SoL"
n.normalize_law_reference("sol")                # → "SoL"
n.normalize_law_reference("förvaltningslagen")  # → "FL"

# Replace law names in running text
n.normalize_law_references_in_text(
    "Enligt socialtjänstlagen och föräldrabalken..."
)
# → "Enligt SoL och FB..."

# OCR artefact correction
n.normalize_ocr("§  12 nämnden\xadbeslut")  # → "§ 12 nämndbeslut"

# Utilities
n.ascii_fold("åäö ÅÄÖ")       # → "aao AAO"
n.to_slug("Göteborgs Stad 2024")  # → "goteborgs-stad-2024"

Named Entity Recognition

from swedish_nlp import SwedishNER

ner = SwedishNER()
entities = ner.extract("""
    Socialnämnden fattade beslut 2024-03-15 enligt SoL 4 kap. 1 §.
    Handläggare: Anna Lindqvist. Diarienummer: SOC-2024-0042.
    JO har i HFD 2015:5 klargjort rättsläget.
""")

entities.authorities   # ["Socialnämnden", "JO"]
entities.courts        # ["HFD"]
entities.law_refs      # ["SoL", "4 kap. 1 §", "HFD 2015:5"]
entities.persons       # ["Anna Lindqvist"]
entities.diarienummer  # ["SOC-2024-0042"]
entities.dates         # ["2024-03-15"]
entities.roles         # ["handläggare"]

# Optional spaCy upgrade (better person/org detection)
ner = SwedishNER(use_spacy=True)

Formats

from swedish_nlp import SwedishFormats
from datetime import date

# Dates — parse
SwedishFormats.parse_date("15 mars 2024")   # → date(2024, 3, 15)
SwedishFormats.parse_date("2024-03-15")     # → date(2024, 3, 15)

# Dates — format
SwedishFormats.format_date(date(2024, 3, 15))       # → "2024-03-15"
SwedishFormats.format_date_long(date(2024, 3, 15))  # → "15 mars 2024"

# Numbers (Swedish format: space-thousands, comma-decimal)
SwedishFormats.parse_number("1 234 567,89")   # → 1234567.89
SwedishFormats.format_number(1234567.89)      # → "1 234 567,89"

# SEK
SwedishFormats.parse_sek("1 234 567 kr")       # → 1234567.0
SwedishFormats.format_sek(1234567.0)           # → "1 234 567 kr"
SwedishFormats.format_sek(1234.5, decimals=2)  # → "1 234,50 kr"
SwedishFormats.format_sek(1000.0, unit="tkr")  # → "1 000 tkr"

# Personnummer
SwedishFormats.validate_personnummer("19850312-4564")         # → bool
SwedishFormats.pseudonymize_personnummer("19850312-4564")     # → "1985-XX-XXXX"

# Extract from text
SwedishFormats.extract_sek_amounts("Budget 5 000 SEK")    # → [5000.0]
SwedishFormats.extract_dates("Beslut 2024-03-15")         # → ["2024-03-15"]
SwedishFormats.parse_postal_code("Göteborg 413 01")       # → "413 01"

Chunker

from swedish_nlp import SwedishChunker
from swedish_nlp.chunker.chunker import ChunkConfig

# Default config (512 tokens, 50 overlap)
chunker = SwedishChunker()
chunks = chunker.chunk(long_text)

# Custom config
cfg = ChunkConfig(chunk_size=256, chunk_overlap=30, min_chunk_size=40)
chunker = SwedishChunker(cfg)

# With Pinecone metadata for every chunk
chunks = chunker.chunk_document(
    text,
    doc_id       = "arsredovisning-goteborg-2023",
    doc_type     = "årsredovisning",
    municipality = "Göteborg",
    year         = 2023,
    extra_metadata = {"source_url": "https://goteborg.se/doc.pdf"},
)

for c in chunks:
    print(c.index, c.token_estimate, c.text[:80])
    print(c.metadata)  # {"doc_id": ..., "doc_type": ..., "municipality": ...}

# Format for Pinecone upsert
vector = c.to_pinecone_dict("vec-001", embedding=[0.1] * 1536)

Vector pipeline (Pinecone + OpenAI)

from swedish_nlp.vectors import SwedishVectorPipeline

# Requires: pip install "swedish-nlp-utils[vectors]"
# Requires: PINECONE_API_KEY and OPENAI_API_KEY in environment

pipeline = SwedishVectorPipeline(
    index_name = "sockartan-documents",
    namespace  = "protokoll",
)

# Index a document (chunk → embed → upsert in one call)
n_vectors = pipeline.chunk_and_upsert(
    text         = full_protocol_text,
    doc_id       = "protokoll-goteborg-2024-03",
    doc_type     = "protokoll",
    municipality = "Göteborg",
    year         = 2024,
)

# Semantic search
results = pipeline.search("socialtjänstens insatser för barn", top_k=5)
for r in results:
    print(r.score, r.municipality, r.text[:100])

# Filtered search
results = pipeline.search_with_filter(
    "budget underskott",
    doc_type     = "årsredovisning",
    municipality = "Göteborg",
    year         = 2023,
)

# Management
pipeline.delete_by_doc_id("protokoll-goteborg-2024-03")
stats = pipeline.get_index_stats()
print(stats["total_vectors"], stats["namespaces"])

CLI

# Named entity extraction
swe-nlp analyze "Socialnämnden fattade beslut enligt SoL 4 kap. 1 §"

# From file, JSON output
swe-nlp analyze --file document.txt --format json

# Chunk a document
swe-nlp chunk --file arsredovisning.txt --size 256 --show-tokens

# Normalize OCR output and law references
swe-nlp normalize --file ocr_output.txt
swe-nlp normalize "Socialtjänstlagen 4 kap §  1" --municipality

Environment variables

Variable Module Required
OPENAI_API_KEY SwedishVectorPipeline For vector pipeline
PINECONE_API_KEY SwedishVectorPipeline For vector pipeline
PINECONE_INDEX_NAME SwedishVectorPipeline Default: swedish-docs

Package structure

swedish_nlp/
├── __init__.py              ← Public API surface
├── cli.py                   ← swe-nlp CLI (analyze, chunk, normalize)
├── py.typed                 ← PEP 561 typed marker
├── stopwords/
│   └── stopwords.py         ← SwedishStopwords (5 domains)
├── normalizer/
│   └── normalizer.py        ← SwedishNormalizer (municipality/authority/law/OCR)
├── ner/
│   └── ner.py               ← SwedishNER (rule-based + optional spaCy)
├── formats/
│   └── formats.py           ← SwedishFormats (dates/numbers/SEK/personnummer)
├── chunker/
│   └── chunker.py           ← SwedishChunker (section/paragraph/sentence/token)
└── vectors/
    └── vectors.py           ← SwedishVectorPipeline (Pinecone + OpenAI)

VNV/tests/
├── test_stopwords.py        ← 19 tests
├── test_normalizer.py       ← 30 tests
├── test_ner.py              ← 25 tests
├── test_formats.py          ← 38 tests
└── test_chunker.py          ← 17 tests

Extending

Add a new stopword domain:

from swedish_nlp.stopwords.stopwords import _DOMAIN_MAP, Domain

# Add a custom domain set
_DOMAIN_MAP[Domain.MEDICAL].update({"ny_term", "annan_term"})

Add a new authority alias:

# In normalizer/normalizer.py
_AUTHORITY_CANONICAL["ny myndighet"] = "NM"

Add a new law abbreviation:

# In normalizer/normalizer.py
_LAW_ALIASES["ny lagtext"] = "NL"

© 2025 Trollfabriken AITrix AB — MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swedish_nlp_utils-1.0.0.tar.gz (32.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swedish_nlp_utils-1.0.0-py3-none-any.whl (33.6 kB view details)

Uploaded Python 3

File details

Details for the file swedish_nlp_utils-1.0.0.tar.gz.

File metadata

  • Download URL: swedish_nlp_utils-1.0.0.tar.gz
  • Upload date:
  • Size: 32.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for swedish_nlp_utils-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b1f5a5761f199ec4049ba4ebf8f3a466a3c4cacbfe49248123437fc3d1f0bf66
MD5 c171039414615a8040e1f28a3da73631
BLAKE2b-256 104d9734e091a3fab6bdcd039c4c31adf8499f767dabd76f994fd37c8c3d64ef

See more details on using hashes here.

File details

Details for the file swedish_nlp_utils-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for swedish_nlp_utils-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6afc5f2408dadeeff9b8f027c632f2261074a40f5c94f94ec47725b398082a7
MD5 299be8fac2b3ff4397f80eb46ec260a6
BLAKE2b-256 38cf385707e7bdfb178fd2fac6d39aba77e4d34078626778ea3dadcd68ad7448

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page