Skip to main content

SoftMatcha

Project description

A soft and fast pattern matcher for billion-scale corpora.

PyPi GitHub

Paper | Website | Demo | Citation

Installation

You can install via PyPi:

pip install softmatcha

For the development purposes, you can install from the source via uv:

git clone https://github.com/softmatcha/softmatcha.git
cd softmatcha/
uv sync

or pip:

git clone https://github.com/softmatcha/softmatcha.git
cd softmatcha/
pip install -e ./

MacOS

Before running pip install, you need to setup libraries and environment variables:

brew install pkg-config icu4c
export CFLAGS="-std=c++11"
export PATH="$(brew --prefix)/opt/icu4c/bin:$(brew --prefix)/opt/icu4c/sbin:$PATH"
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$(brew --prefix)/opt/icu4c/lib/pkgconfig"
pip install softmatcha

Quick start

SoftMatcha implements two search types: scan and index.

  • Scan: search texts without indexing and any preprocessing like grep, which is useful for small corpora.
  • Index: search texts with an index, effectively works on billion-scale corpora.

Scan: softmatcha-grep

softmatcha-grep searches corpora without indexing:

$ softmatcha-grep "the jazz musician" corpus.txt

The first arugment is the pattern string and the second one is a file or files to be searched. The other arguments can be seen by softmatcha-grep -h.

Index: softmatcha-index and softmatcha-search

softmatcha-index builds a search index from corpora:

$ softmatcha-index --index corpus.idx corpus.txt

softmatcha-search quickly searches patterns with a search index:

$ softmatcha-search --index corpus.idx "the jazz musician"

Options

For development purposes,

  • --profile=true measures the execution time.
  • --log outputs the verbose information.

For searchers,

  • --backend {gensim,fasttext,transformers}: Backend framework for embeddings.
  • --model <NAME>: Name of word embeddings.
  • --threshold specifies the threshold for soft matching.

For controlling outputs,

  • -n, --line_number prints line number with output lines.
  • -o, --only_matching outputs only matched patterns.

List of implementations

Embeddings

Searchers

Scan: softmatcha-grep

  • Naive search: --search naive
  • Quick search (default): --search quick

Index: softmatcha-index and softmatcha-search

  • Inverted index search

Citation

If you use this software, please cite:

@inproceedings{
  deguchi-iclr-2025-softmatcha,
  title={SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches},
  author={Deguchi, Hiroyuki and Kamoda, Go and Matsushita, Yusuke and Taguchi, Chihiro and Waga, Masaki and Suenaga, Kohei and Yokoi, Sho},
  booktitle={The Thirteenth International Conference on Learning Representations (ICLR 2025)},
  year={2025},
  url={https://openreview.net/forum?id=Q6PAnqYVpo}
}

License

This software is mainly developed by Hiroyuki Deguchi and published under the MIT-license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

softmatcha-0.1.0.tar.gz (35.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

softmatcha-0.1.0-py3-none-any.whl (53.5 kB view details)

Uploaded Python 3

File details

Details for the file softmatcha-0.1.0.tar.gz.

File metadata

  • Download URL: softmatcha-0.1.0.tar.gz
  • Upload date:
  • Size: 35.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.2

File hashes

Hashes for softmatcha-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e75479a6838e88b1efee0548636cf6a179523a6133577388fed1588a6a230da6
MD5 f400a366c5888f63d459a4e661ab1274
BLAKE2b-256 6f93ec49cfedb6ca131e06c83d1a4903ed1549f8407dfeb18517825eade6ab11

See more details on using hashes here.

File details

Details for the file softmatcha-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: softmatcha-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 53.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.2

File hashes

Hashes for softmatcha-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 117edb5905e46c2ca13b4910f727281c9c39cf165ebf1760af71edc67e9c0573
MD5 11da94e0d6c3157ca860b3f80edc2a87
BLAKE2b-256 86db47241a080ba0015d0fe1ad3f3e7697a2d38f1c032e0f9d047bcbf314771f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page