SoftMatcha
Project description
A soft and fast pattern matcher for billion-scale corpora.
Paper | Website | Demo | Citation
Installation
You can install via PyPi:
pip install softmatcha
For the development purposes, you can install from the source via uv:
git clone https://github.com/softmatcha/softmatcha.git
cd softmatcha/
uv sync
or pip:
git clone https://github.com/softmatcha/softmatcha.git
cd softmatcha/
pip install -e ./
MacOS
Before running pip install, you need to setup libraries and environment variables:
brew install pkg-config icu4c
export CFLAGS="-std=c++11"
export PATH="$(brew --prefix)/opt/icu4c/bin:$(brew --prefix)/opt/icu4c/sbin:$PATH"
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$(brew --prefix)/opt/icu4c/lib/pkgconfig"
pip install softmatcha
Quick start
SoftMatcha implements two search types: scan and index.
- Scan: search texts without indexing and any preprocessing like
grep, which is useful for small corpora. - Index: search texts with an index, effectively works on billion-scale corpora.
Scan: softmatcha-grep
softmatcha-grep searches corpora without indexing:
$ softmatcha-grep "the jazz musician" corpus.txt
The first arugment is the pattern string and the second one is a file or files to be searched.
The other arguments can be seen by softmatcha-grep -h.
Index: softmatcha-index and softmatcha-search
softmatcha-index builds a search index from corpora:
$ softmatcha-index --index corpus.idx corpus.txt
softmatcha-search quickly searches patterns with a search index:
$ softmatcha-search --index corpus.idx "the jazz musician"
Options
For development purposes,
--profile=truemeasures the execution time.--logoutputs the verbose information.
For searchers,
--backend {gensim,fasttext,transformers}: Backend framework for embeddings.--model <NAME>: Name of word embeddings.--thresholdspecifies the threshold for soft matching.
For controlling outputs,
-n,--line_numberprints line number with output lines.-o,--only_matchingoutputs only matched patterns.
List of implementations
Embeddings
- gensim
- fastText
- transformers (embedding layers)
Searchers
Scan: softmatcha-grep
- Naive search:
--search naive - Quick search (default):
--search quick
Index: softmatcha-index and softmatcha-search
- Inverted index search
Citation
If you use this software, please cite:
@inproceedings{
deguchi-iclr-2025-softmatcha,
title={SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches},
author={Deguchi, Hiroyuki and Kamoda, Go and Matsushita, Yusuke and Taguchi, Chihiro and Waga, Masaki and Suenaga, Kohei and Yokoi, Sho},
booktitle={The Thirteenth International Conference on Learning Representations (ICLR 2025)},
year={2025},
url={https://openreview.net/forum?id=Q6PAnqYVpo}
}
License
This software is mainly developed by Hiroyuki Deguchi and published under the MIT-license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file softmatcha-0.1.0.tar.gz.
File metadata
- Download URL: softmatcha-0.1.0.tar.gz
- Upload date:
- Size: 35.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e75479a6838e88b1efee0548636cf6a179523a6133577388fed1588a6a230da6
|
|
| MD5 |
f400a366c5888f63d459a4e661ab1274
|
|
| BLAKE2b-256 |
6f93ec49cfedb6ca131e06c83d1a4903ed1549f8407dfeb18517825eade6ab11
|
File details
Details for the file softmatcha-0.1.0-py3-none-any.whl.
File metadata
- Download URL: softmatcha-0.1.0-py3-none-any.whl
- Upload date:
- Size: 53.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
117edb5905e46c2ca13b4910f727281c9c39cf165ebf1760af71edc67e9c0573
|
|
| MD5 |
11da94e0d6c3157ca860b3f80edc2a87
|
|
| BLAKE2b-256 |
86db47241a080ba0015d0fe1ad3f3e7697a2d38f1c032e0f9d047bcbf314771f
|