Creates sequence data for pretraining and benchmarking sequence models
Project description
Seqthetic
This tool generates synthetic sequence data to help test various ideas in pretraining sequence models with synthetic data. It is used in meta-language repo (Coming Soon).
Features:
- Diversity: Supports generating data following various patterns including fractional Brownian Motion(fbm), LIME(TODO), TILT(TODO) and synthetic pretraining tasks etc.
- Spec-Driven: Everything about the dataset is described by a spec, which helps with documenting each ablation and high-level manipulation.
- Reproducibility: Processes involving randomness have their seeds recorded in the spec file. This means you can transfer the dataset by only
Installation
pip install -e .
Usage
Generation
To generate a synthetic dataset, just write a spec and use synthesizer to make the dataset. For details on spec, please see Concepts:
# write the spec
spec = SynthesisSpec(...)
# pass it to the synthesizer
szer = Synthesizer(spec)
# call make_dataset
dataset = szer.make_dataset()
# save dataset, or call dataset.save().
szer.save_dataset()
You will get a json file and a csv file. The json file stores the spec and is ended with .sqf.json, and the csv stores the dataset. Their names are the name field in the spec or an unique id if name is not given.
Save & load
Please make sure the spec json file and csv file is under the same directory.
# Pass the name of the spec.only loads the spec
spec = SynthesisSpec.load('ABC')
# Pass the name of the dataset. loads both the spec and the dataset. Please use the seqthetic.Dataset class.
dataset = Dataset.load('ABC')
spec_in_dataset = dataset.spec
Creating New Dependency
Creating new dependency has several requirements:
- Please add a generator field by:
generator: str = 'xxx', where xxx is the name of the generation method you will use. This field is used to discriminate different dependencies when parse spec files; - Please add custom_seed_schema wrapped with
SchemaList:custom_seed_schema = SchemaList(['hurst', 'dependency'])and record every seed used for random sampling.custom_seed_schemais used for storing seeds and loading them to the dependency. - Please add metadata_schema to specify what will be stored in the metadata field in the
Dataset. This is not enforced but helps for documentation.
Register Dependency
If you want to use custom dependency in the spec, you can register it with SynthesisSpec.register_dependency:
SynthesisSpec.register_dependency(MyDependency)
Concepts
The synthesis spec employs several concepts to enable flexible generation of datasets:
- Vocabulary: All sequences are simply a series of vocabulary which are integers. The frequencies of each vocabulary can be specified, for details see Vocabulary section.
- Domain: A dataset can be composed of a number of domains with different characteristics like the length distribution and the dependency pattern(see below). It's similar to natural language pretraining corpus containing various kind of data: news, code, arxiv papers etc. Each domain has a
mixture_ratiooption which determines how much tokens it accounts for in the whole dataset. - Dependency: A domain is mostly defined by the dependency of its sequences, which is the occurrence pattern of tokens. For example, the sequence "abcdabcd" is defined by the repeating the former sequence. It doesn't matter what sequence is repeated, but the structure is important. We hypothesize that learning the dependency by properly storing and retrieving tokens is central to the various abilities of language models, like in-context learning abilities.
- Mapping: Though dependency defines a domain, it needs to be realized as a series of tokens from the vocabulary, which is specified by the
mappingoption. Dependencies can be mapped according to their frequency in the sentence, and one can split or duplicate them to create multiple sequence from one series of dependency.
The process is:
for domain in domains:
dependencies = domain.dependency.make_dependency()
Classes
SynthesisSpec
Dataset
Vocabulary
We support the following vocabulary distributions:
- Zipf Vocabulary: the Zipf's law means the frequency of any word is inversely proportional to its rank in the frequency table, but here we use Zipf-Mandelbrot law for generality: $\text { frequency } \propto \frac{1}{(\operatorname{rank}+b)^a}$.
- Uniform Vocabulary: each vocabulary has same frequency.
- Loglinear Vocabulary(TODO): applied in the paper.
- Corpus Vocabulary(TODO): vocabulary with each frequency specified. often calculated from a realistic corpus.
To create more realistic distributions, an optional DistributionNoise can be added to them. Noise can be additive or multiplicative.
For example:
zipf_vocab = ZipfVocabulary(size=1000, alpha=1, beta=2.7)
uniform_vocab_with_noise = UniformVocabulary(
size=2000,
noise=DistributionNoise(
type='additive',
level=0.01)
)
Dependency
We support the following dependency generators:
- FBMDependency: The dependency is discretized sample of fractional brownian motion(fBm).This is inspired by the hypothesis that language possesses fractal structure, and fractional brownian motion is an easy way to construct fractal sequences with a given fractal metric called hurst exponent.
- RandomDependency: The dependency is randomly sampled from a normal distribution. Mainly used as baseline.
- FunctionDependency: The dependency is discretized function specified by the user. For example one can use $\sin (x)$ to create periodic dependency.
Mapping
Mapping contains following options:
1.sample_by: How to sample vocabularies. The choices are frequency and random, where frequency means sampling based on frequency of vocabulary, and random means sampling with no regard to frequency.
- map_by: Strategy of mapping dependency to vocabulary. The choices are
frequencyandrandom, wherefrequencymeans higher frequency dependencies are mapped to sampled vocabulary with higher probability, andrandommeans mapping dependency to vocabulary randomly.
For example, the dependency sequence with 333221 has three dependency valules: 1, 2, 3. For this sequence we sample three vocabularies: a: 0.3, b: 0.2, c: 0.1, where numbers are the probability. so under frequency mapping strategy, we map 3 to a, 2 to b, and 1 to c.
Note: we don't consider mapping multiple dependency to vocabulary or vice versa as they will break the dependency structure, which introduces variation that can be more cleanly specified by more domains or Range in fields such as hurst exponent.
Seed
Creating synthetic data involves a lot of random sampling, so to ensure reproducibility, we record seeds for random generators used by vocabulary sampling and dependency generation for each domain. We use np.random.SeedSequence.entropy to generate seeds.
The main method of Seed class is get_rng, which instantiates a numpy random generator for sampling:
# get a random generator from given seed
rng = seed.get_rng('dependency')
# get a list of random generators that are spawned from given seed
rngs = seed.get_rng('dependency', 3)
assert len(rngs) == 3
# get a list of random generators that are spawned from given seed, useful for passing variables
rngs = seed.get_rng('dependency', num_sequence, return_list=True)
assert type(rngs) == list
Range
When specifying dependencies, Range can be used for fields to specify a distribution between a range of value to improve diversity. A similar class FlexibleRange is used for cases that allow both single number input and Range input, where single number input will be converted to Range.
# input with range
dep = RandomDependency(
num_dependency=Range(min=10, max=16)
sequence_length=Range(min=200, max=1000)
)
dep_num = RandomDependency(
num_dependency=16,
sequence_length=1000)
assert isinstance(dep_num.num_dependency, Range)
Vary
The space of is immense, which makes it necessary to explore different combinations of parameters. The vary function can be used to create from a basic SynthesisSpec different specs with some parameters changed according to Variation and these specs are saved to a SynthesisSpecGroup. You can separately save the group file and the specs.
Variation
- For variating total_token, you can use
compute_opslikeMul,Div,Add,Sub. You can also specify a number:
assert spec.total_token == 2000
group = vary(spec, Variation(total_token=[Mul(2), Add(2000), Div(2), Sub(1000), 5000]))
# base spec total_token multiplied by 2
assert group.specs[0].total_token == 4000
# base spec total_token added by 2000
assert group.specs[1].total_token == 4000
# base spec total_token divided by 2
assert group.specs[2].total_token == 1000
# base spec total_token subtracted by 1000
assert group.specs[3].total_token == 1000
# base spec total_token set to 5000
assert group.specs[4].total_token == 5000
- for varying mixture_ratio of domains, a list of list of mixture_ratio must be used. each list of mixture ratio must match the domain length of the base spec:
vary(spec, Variation(mixture=[[0.1, 0.3, 0.6], [0.2, 0.4, 0.4]]))
- the operation of domain is diverse and is deferred to Domain Operation
Domain Operation
There are several basic domain operations:
Vary: Vary the domain dependency or mapping parameters.Insert: Add new domain to specified position.Remove: Remove domain from specified position.Replace: Replace domain at specified position.Shuffle: Shuffle the order of domains.ChangeSeed: Change the seed of domain.
One can choose two combination patterns:
Zip: like the zip function in python, for example thezip([1, 2], [3, 4])becomes[1, 3], [2, 4], useful for conducting multiple actions on one domain at the same timeProduct: like theitertools.productfunction , for example theproduct([1, 2], [3, 4])becomes[1, 3], [1, 4], [2, 3], [2, 4], useful for conducting multiple actions on different domains at the same time.
For example:
Zip(
ops=[
Vary(domain=0, dependency={
'hurst': [0.5, 0.6]
}),
Vary(domain=1, dependency={
'num_dependency': [Range(min=10, max=20)]
})
]
)
Roadmap
- [ ]tests
- [ ]vary stress test
- [ ]spec reproducibility
- [ ]dependency combination
- [ ]function dependency
- [ ]file related
- dependencies
- [-] dynamically register dependency (spec metadata)
- add seq_op dependencies from synthetic_pretraining
- bracket, dyck
- LIME
- DFS automata/transducer deduction/induction
- arithmetic
- math derivations
- cellular automata
- dynamical system
- discretized IFS
- sine function and variants
- multifractional brownian motion
- fractional brownian field
- merge
- [-] spec_group
- [-] generate
- [-] save
- notebooks
- fractal, fbm, mbm, discretize, bincount
- dependency, frequency
- vocab
- mapping
- [ ]vocab
- loglinear
- corpus vocab
- domain vocab
- evolution
- synonyms, antonyms, supernyms
- mapping
- multiple
- clip
- dataloader related?
- fix Range validation?
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for seqthetic-0.1.11-py3-none-any.whl
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 | 0fa5d17c52b7d4d8979111dc2f470c4edc843063b4a3f548ff9df6bb3de83736 |
|
| MD5 | 7e914538a6c5fc7622b675eda9550a9e |
|
| BLAKE2b-256 | 32e99bfae7b5c2b3199e5ec47cd6398c19c04df0fe2a76bd7b7605f7ffb02545 |