24 projects
poormanray
A minimal alternative to Ray for distributed data processing on EC2 instances
bonepick
CLI tool for training efficient CPU-based text quality classifiers and annotating data for distillation of classifiers.
dolma-rust-components
Rust components for Dolma - Toolkit for pre-processing LLM training data.
dolma
Toolkit for pre-processing LLM training data.
rusty-dawg
A Rust library for building and querying Directed Acyclic Word Graphs (DAWGs) and Compacted DAWGs (CDAWGs) for efficient string indexing and searching.
cloudflared-tunnel
Start a TryCloudflare Tunnel with a context manager.
openwebmath-text-extract
Text Extractor from OpenWebMath
papermage
Papermage. Casting magic over scientific PDFs.
mmdata
MMData is a toolkit for curating multimodal datasets.
tartare
Data filters
tokreate
Unified APIs for making calls to different LLMs.
quickumls
QuickUMLS is a tool for fast, unsupervised biomedical concept extraction from medical text
smashed
SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batching, and more. Supports datasets from Huggingface, torchdata iterables, or simple lists of dictionaries.
decontext
Pipeline for decontextualization of scientific snippets.
springs
A set of utilities to create and manage typed configuration files effectively, built on top of OmegaConf.
necessary
Python package to enforce optional dependencies
shadow-scholar
🎓🕶️ A collection of utilities and demos from the Semantic Scholar Research Team 🕶️🎓
mmda
MMDA - multimodal document analysis
trouting
Trouting (short for Type Routing) is a simple class decorator that allows to define multiple interfaces for a method that behave differently depending on input types.
pyterrier-sentence-transformers
Create an pyterrier index using any sentence-transformers model
scipdf
multimodal document analysis
espresso-config
A struct config parser that you can set up in the
Minimal-Server
Serve a python object through a simple socket; supports multiple connections.
quickumls-simstring
Clone of simstring designed to work with QuickUMLS. Original version here: http://chokkan.org/software/simstring/