The Nordic Pile
Project description
The Nordic Pile replication code
The Nordic Pile is a repository with the aim of providing tools and code to download and replicate a large nordic language dataset. The dataset consists of many smaller datasets combined together. With the objective to cover a broad set of language modalities.
Workflow
To propose a new dataset be added to the Nordic Pile, open an issue. Your issue should include a description of the dataset, its size, what language(s) it is in, a link to the data, and any other relevant information. If a project manger approves your proposal, they will change its label to and add it to . Datasets that we elect to not include in the current version of the Pile will receive a or label. We will now focus on datasets in the languages of the nordics: Swedish, Danish, Norwegian and Finnish.
To claim responsibility for implementing an unclaimed dataset,
leave a comment on one of our unassigned issues. Once a dataset
has been assigned to you, make the necessary changes to datasets.py
and pile.py
in a fork and submit a pull request. If you require, you can also
submit a script for processing the data as shown here.
To raise an issue that is not proposing a new dataset, open an issue with the tag or as appropriate.
Data ready for final implementation should meet the following criteria:
- The data must be in lm_dataformat format.
- The data must be shuffled.
Attribution
This initiative is heavily inspired by Eleuther AIs The Pile project.
https://www.eleuther.ai/
https://pile.eleuther.ai/
Datasets
Dataset | Status |
---|---|
Wikipedia-Swedish | 🙋♀️ Waiting for contributor |
Wikipedia-Danish | 🙋♀️ Waiting for contributor |
Wikipedia-Norwegian | 🙋♀️ Waiting for contributor |
Wikipedia-Finnish | 🙋♀️ Waiting for contributor |
Swedish Parliament | 🙋♀️ Waiting for contributor |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for the_nordic_pile-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16b4506c51f69a9e8220753ec325bcee312fe4be9750e4d1ba2811c5e7e24845 |
|
MD5 | fd461bbe83b62967265ab2a5304758a8 |
|
BLAKE2b-256 | 4a37ef14f7915026c6c22691fc80dfb8a2cad4b2c705b6b673cc3fa8c921a0b6 |