Swedish NLP utilities — stopwords, legal NER, text normalisation, chunking, and Pinecone vector pipeline helpers
Project description
swedish-nlp-utils
Swedish NLP utilities for text processing, entity extraction, number parsing, chunking, and Pinecone vector pipeline helpers — built for Trollfabriken AITrix AB projects including SocKartan, DocFlow, ScrapeAssistant, and RevisionsUpproret.
Modules
| Module | Class | Purpose |
|---|---|---|
stopwords |
SwedishStopwords |
Domain-specific Swedish stopword sets |
normalizer |
SwedishNormalizer |
Municipality, authority, law-ref, OCR normalisation |
ner |
SwedishNER |
Rule-based NER for Swedish legal/municipal texts |
formats |
SwedishFormats |
Date, number, SEK, personnummer parsing & formatting |
chunker |
SwedishChunker |
Swedish-aware text chunking for vector pipelines |
vectors |
SwedishVectorPipeline |
Pinecone + OpenAI embedding pipeline |
Installation
# Core (no external API dependencies)
pip install swedish-nlp-utils
# With vector pipeline support (Pinecone + OpenAI)
pip install "swedish-nlp-utils[vectors]"
# With spaCy NER upgrade
pip install "swedish-nlp-utils[spacy]"
python -m spacy download sv_core_news_sm
Quick start
Stopwords
from swedish_nlp import SwedishStopwords
# General Swedish
sw = SwedishStopwords()
sw.filter_tokens(["socialnämnden", "beslutade", "att", "bevilja", "insats"])
# → ["socialnämnden", "beslutade", "bevilja", "insats"]
# Domain-specific
sw = SwedishStopwords.for_social_services() # general + legal + social
sw.filter_text("handläggaren beslutade att genomföra utredningen")
# All domains combined
sw = SwedishStopwords.all_domains()
# Custom additions (fluent)
sw = SwedishStopwords.for_legal().add("paragrafen", "stycket")
Normaliser
from swedish_nlp import SwedishNormalizer
n = SwedishNormalizer()
# Municipality names
n.normalize_municipality("Göteborgs stad") # → "Göteborg"
n.normalize_municipality("Stockholms kommunen") # → "Stockholm"
n.normalize_municipality("GBG") # → "Göteborg"
# Authority names → canonical short form
n.normalize_authority("Justitieombudsmannen") # → "JO"
n.normalize_authority("Inspektionen för vård och omsorg") # → "IVO"
n.normalize_authority("Högsta förvaltningsdomstolen") # → "HFD"
# Law references
n.normalize_law_reference("Socialtjänstlagen") # → "SoL"
n.normalize_law_reference("sol") # → "SoL"
n.normalize_law_reference("förvaltningslagen") # → "FL"
# Replace law names in running text
n.normalize_law_references_in_text(
"Enligt socialtjänstlagen och föräldrabalken..."
)
# → "Enligt SoL och FB..."
# OCR artefact correction
n.normalize_ocr("§ 12 nämnden\xadbeslut") # → "§ 12 nämndbeslut"
# Utilities
n.ascii_fold("åäö ÅÄÖ") # → "aao AAO"
n.to_slug("Göteborgs Stad 2024") # → "goteborgs-stad-2024"
Named Entity Recognition
from swedish_nlp import SwedishNER
ner = SwedishNER()
entities = ner.extract("""
Socialnämnden fattade beslut 2024-03-15 enligt SoL 4 kap. 1 §.
Handläggare: Anna Lindqvist. Diarienummer: SOC-2024-0042.
JO har i HFD 2015:5 klargjort rättsläget.
""")
entities.authorities # ["Socialnämnden", "JO"]
entities.courts # ["HFD"]
entities.law_refs # ["SoL", "4 kap. 1 §", "HFD 2015:5"]
entities.persons # ["Anna Lindqvist"]
entities.diarienummer # ["SOC-2024-0042"]
entities.dates # ["2024-03-15"]
entities.roles # ["handläggare"]
# Optional spaCy upgrade (better person/org detection)
ner = SwedishNER(use_spacy=True)
Formats
from swedish_nlp import SwedishFormats
from datetime import date
# Dates — parse
SwedishFormats.parse_date("15 mars 2024") # → date(2024, 3, 15)
SwedishFormats.parse_date("2024-03-15") # → date(2024, 3, 15)
# Dates — format
SwedishFormats.format_date(date(2024, 3, 15)) # → "2024-03-15"
SwedishFormats.format_date_long(date(2024, 3, 15)) # → "15 mars 2024"
# Numbers (Swedish format: space-thousands, comma-decimal)
SwedishFormats.parse_number("1 234 567,89") # → 1234567.89
SwedishFormats.format_number(1234567.89) # → "1 234 567,89"
# SEK
SwedishFormats.parse_sek("1 234 567 kr") # → 1234567.0
SwedishFormats.format_sek(1234567.0) # → "1 234 567 kr"
SwedishFormats.format_sek(1234.5, decimals=2) # → "1 234,50 kr"
SwedishFormats.format_sek(1000.0, unit="tkr") # → "1 000 tkr"
# Personnummer
SwedishFormats.validate_personnummer("19850312-4564") # → bool
SwedishFormats.pseudonymize_personnummer("19850312-4564") # → "1985-XX-XXXX"
# Extract from text
SwedishFormats.extract_sek_amounts("Budget 5 000 SEK") # → [5000.0]
SwedishFormats.extract_dates("Beslut 2024-03-15") # → ["2024-03-15"]
SwedishFormats.parse_postal_code("Göteborg 413 01") # → "413 01"
Chunker
from swedish_nlp import SwedishChunker
from swedish_nlp.chunker.chunker import ChunkConfig
# Default config (512 tokens, 50 overlap)
chunker = SwedishChunker()
chunks = chunker.chunk(long_text)
# Custom config
cfg = ChunkConfig(chunk_size=256, chunk_overlap=30, min_chunk_size=40)
chunker = SwedishChunker(cfg)
# With Pinecone metadata for every chunk
chunks = chunker.chunk_document(
text,
doc_id = "arsredovisning-goteborg-2023",
doc_type = "årsredovisning",
municipality = "Göteborg",
year = 2023,
extra_metadata = {"source_url": "https://goteborg.se/doc.pdf"},
)
for c in chunks:
print(c.index, c.token_estimate, c.text[:80])
print(c.metadata) # {"doc_id": ..., "doc_type": ..., "municipality": ...}
# Format for Pinecone upsert
vector = c.to_pinecone_dict("vec-001", embedding=[0.1] * 1536)
Vector pipeline (Pinecone + OpenAI)
from swedish_nlp.vectors import SwedishVectorPipeline
# Requires: pip install "swedish-nlp-utils[vectors]"
# Requires: PINECONE_API_KEY and OPENAI_API_KEY in environment
pipeline = SwedishVectorPipeline(
index_name = "sockartan-documents",
namespace = "protokoll",
)
# Index a document (chunk → embed → upsert in one call)
n_vectors = pipeline.chunk_and_upsert(
text = full_protocol_text,
doc_id = "protokoll-goteborg-2024-03",
doc_type = "protokoll",
municipality = "Göteborg",
year = 2024,
)
# Semantic search
results = pipeline.search("socialtjänstens insatser för barn", top_k=5)
for r in results:
print(r.score, r.municipality, r.text[:100])
# Filtered search
results = pipeline.search_with_filter(
"budget underskott",
doc_type = "årsredovisning",
municipality = "Göteborg",
year = 2023,
)
# Management
pipeline.delete_by_doc_id("protokoll-goteborg-2024-03")
stats = pipeline.get_index_stats()
print(stats["total_vectors"], stats["namespaces"])
CLI
# Named entity extraction
swe-nlp analyze "Socialnämnden fattade beslut enligt SoL 4 kap. 1 §"
# From file, JSON output
swe-nlp analyze --file document.txt --format json
# Chunk a document
swe-nlp chunk --file arsredovisning.txt --size 256 --show-tokens
# Normalize OCR output and law references
swe-nlp normalize --file ocr_output.txt
swe-nlp normalize "Socialtjänstlagen 4 kap § 1" --municipality
Environment variables
| Variable | Module | Required |
|---|---|---|
OPENAI_API_KEY |
SwedishVectorPipeline |
For vector pipeline |
PINECONE_API_KEY |
SwedishVectorPipeline |
For vector pipeline |
PINECONE_INDEX_NAME |
SwedishVectorPipeline |
Default: swedish-docs |
Package structure
swedish_nlp/
├── __init__.py ← Public API surface
├── cli.py ← swe-nlp CLI (analyze, chunk, normalize)
├── py.typed ← PEP 561 typed marker
├── stopwords/
│ └── stopwords.py ← SwedishStopwords (5 domains)
├── normalizer/
│ └── normalizer.py ← SwedishNormalizer (municipality/authority/law/OCR)
├── ner/
│ └── ner.py ← SwedishNER (rule-based + optional spaCy)
├── formats/
│ └── formats.py ← SwedishFormats (dates/numbers/SEK/personnummer)
├── chunker/
│ └── chunker.py ← SwedishChunker (section/paragraph/sentence/token)
└── vectors/
└── vectors.py ← SwedishVectorPipeline (Pinecone + OpenAI)
VNV/tests/
├── test_stopwords.py ← 19 tests
├── test_normalizer.py ← 30 tests
├── test_ner.py ← 25 tests
├── test_formats.py ← 38 tests
└── test_chunker.py ← 17 tests
Extending
Add a new stopword domain:
from swedish_nlp.stopwords.stopwords import _DOMAIN_MAP, Domain
# Add a custom domain set
_DOMAIN_MAP[Domain.MEDICAL].update({"ny_term", "annan_term"})
Add a new authority alias:
# In normalizer/normalizer.py
_AUTHORITY_CANONICAL["ny myndighet"] = "NM"
Add a new law abbreviation:
# In normalizer/normalizer.py
_LAW_ALIASES["ny lagtext"] = "NL"
© 2025 Trollfabriken AITrix AB — MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swedish_nlp_utils-1.0.0.tar.gz.
File metadata
- Download URL: swedish_nlp_utils-1.0.0.tar.gz
- Upload date:
- Size: 32.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1f5a5761f199ec4049ba4ebf8f3a466a3c4cacbfe49248123437fc3d1f0bf66
|
|
| MD5 |
c171039414615a8040e1f28a3da73631
|
|
| BLAKE2b-256 |
104d9734e091a3fab6bdcd039c4c31adf8499f767dabd76f994fd37c8c3d64ef
|
File details
Details for the file swedish_nlp_utils-1.0.0-py3-none-any.whl.
File metadata
- Download URL: swedish_nlp_utils-1.0.0-py3-none-any.whl
- Upload date:
- Size: 33.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6afc5f2408dadeeff9b8f027c632f2261074a40f5c94f94ec47725b398082a7
|
|
| MD5 |
299be8fac2b3ff4397f80eb46ec260a6
|
|
| BLAKE2b-256 |
38cf385707e7bdfb178fd2fac6d39aba77e4d34078626778ea3dadcd68ad7448
|