Skip to main content

An Open Source Recipe to Reproduce LLaMA training dataset

Project description

RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset

This repo contains a reproducible data receipe for the RedPajama data, with the following token counts:

Dataset Token Count
Commoncrawl 878 Billion
C4 175 Billion
GitHub 59 Billion
Books 26 Billion
ArXiv 28 Billion
Wikipedia 24 Billion
StackExchange 20 Billion
Total 1.2 Trillion

Data Preparation

In data_prep, we provide all pre-processing scripts and guidelines.

Tokenization

In tokenization, we provide an example of how to tokenize the dataset using the GPT-NeoX tokenizer.

Visualization

In viz, we provide a dashboard for exploring a subset of the data using Meerkat.

License

The code in this repo is licensed under the Apache 2.0 license. Unless otherwise noted,

Copyright 2023 Together Computer, ETH Zürich, Stanford University

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

The file data_prep/book/dedup.py was co-developed with Ontocord.ai.

Copyright 2023 Ontocord.ai, Together Computer, ETH Zürich, Stanford University

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

The dataset itself, please refer to the licenses of the data subsets you use.

For full terms, see the LICENSE file. If you have any questions, comments, or concerns about licensing please contact us.

Acknowledgement

We are appreciative to the work done by the growing open-source AI community that made this project possible. That includes:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redpajama-0.0.1.tar.gz (85.4 kB view details)

Uploaded Source

Built Distribution

redpajama-0.0.1-py3-none-any.whl (108.2 kB view details)

Uploaded Python 3

File details

Details for the file redpajama-0.0.1.tar.gz.

File metadata

  • Download URL: redpajama-0.0.1.tar.gz
  • Upload date:
  • Size: 85.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for redpajama-0.0.1.tar.gz
Algorithm Hash digest
SHA256 a0130e17eef5e38bed23407aad2f1a0958bbc3350c6b2989234a0aef8fd0bd18
MD5 0d10e73644a6f0ca3193b1e0dbf40d5a
BLAKE2b-256 a7e5f9f4996ea4ddf697e006e0a10f4398f0685e87ae1ff35d685491afb46f76

See more details on using hashes here.

File details

Details for the file redpajama-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: redpajama-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 108.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for redpajama-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 db96f6a78501d6fde9df763920b8ddda95f92ecb3d3764482c69dc1d743d57df
MD5 639a32b5395d618c5ac9399ef0332957
BLAKE2b-256 5bb532b8ef65ce168fa30c5fa07b2d37e3cf6141dd33e2d1aa63ce4d39a00030

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page