Skip to main content

Data Lake for Multi-Modal AI Search

Project description


Deep Lake: Database for AI

PyPI version PyPI version

DocsGet StartedAPI ReferenceLangChain & VectorDBs CourseBlogWhitepaperSlackTwitter

What is Deep Lake?

Deep Lake is a Database for AI powered by a storage format optimized for deep-learning applications. Deep Lake can be used for:

  1. Storing and searching data plus vectors while building LLM applications
  2. Managing datasets while training deep learning models

Deep Lake simplifies the deployment of enterprise-grade LLM-based products by offering storage for all data types (embeddings, audio, text, videos, images, dicom, pdfs, annotations, and more), querying and vector search, data streaming while training models at scale, data versioning and lineage, and integrations with popular tools such as LangChain, LlamaIndex, Weights & Biases, and many more. Deep Lake works with data of any size, it is serverless, and it enables you to store all of your data in your own cloud and in one place. Deep Lake is used by Intel, Bayer Radiology, Matterport, ZERO Systems, Red Cross, Yale, & Oxford.

Deep Lake includes the following features:

Multi-Cloud Support (S3, GCP, Azure) Use one API to upload, download, and stream datasets to/from S3, Azure, GCP, Activeloop cloud, local storage, or in-memory storage. Compatible with any S3-compatible storage such as MinIO.
Native Compression with Lazy NumPy-like Indexing Store images, audio, and videos in their native compression. Slice, index, iterate, and interact with your data like a collection of NumPy arrays in your system's memory. Deep Lake lazily loads data only when needed, e.g., when training a model or running queries.
Dataloaders for Popular Deep Learning Frameworks Deep Lake comes with built-in dataloaders for Pytorch and TensorFlow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
Integrations with Powerful Tools Deep Lake has integrations with Langchain and LLamaIndex as a vector store for LLM apps, Weights & Biases for data lineage during model training, MMDetection for training object detection models, and MMSegmentation for training semantic segmentation models.
100+ most-popular image, video, and audio datasets available in seconds Deep Lake community has uploaded 100+ image, video and audio datasets like MNIST, COCO, ImageNet, CIFAR, GTZAN and others.
Instant Visualization Support in the Deep Lake App Deep Lake datasets are instantly visualized with bounding boxes, masks, annotations, etc. in Deep Lake Visualizer (see below).

Visualizer

🚀 How to install Deep Lake

Deep Lake can be installed using pip:

pip install deeplake

To access all of Deep Lake's features, please register in the Deep Lake App.

🧠 Deep Lake Code Examples by Application

Vector Store Applications

Using Deep Lake as a Vector Store for building LLM applications:

- Vector Store Quickstart

- Vector Store Tutorials

- LangChain Integration

- LlamaIndex Integration

- Image Similarity Search with Deep Lake

Deep Learning Applications

Using Deep Lake for managing data while training Deep Learning models:

- Deep Learning Quickstart

- Tutorials for Training Models

⚙️ Integrations

Deep Lake offers integrations with other tools in order to streamline your deep learning workflows. Current integrations include:

📚 Documentation

Getting started guides, examples, tutorials, API reference, and other useful information can be found on our documentation page.

🎓 For Students and Educators

Deep Lake users can access and visualize a variety of popular datasets through a free integration with Deep Lake's App. Universities can get up to 1TB of data storage and 100,000 monthly queries on the Tensor Database for free per month. Chat in on our website: to claim the access!

👩‍💻 Comparisons to Familiar Tools

Deep Lake vs Chroma

Both Deep Lake & ChromaDB enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. ChromaDB is a Vector Database that can be deployed locally or on a server using Docker and will offer a hosted solution shortly. Deep Lake is a serverless Vector Store deployed on the user’s own cloud, locally, or in-memory. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike ChromaDB, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. ChromaDB is limited to light metadata on top of the embeddings and has no visualization. Deep Lake datasets can be visualized and version controlled. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Pinecone

Both Deep Lake and Pinecone enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Pinecone is a fully-managed Vector Database that is optimized for highly demanding applications requiring a search for billions of vectors. Deep Lake is serverless. All computations run client-side, which enables users to get started in seconds. Unlike Pinecone, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Pinecone is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Weaviate

Both Deep Lake and Weaviate enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Weaviate is a Vector Database that can be deployed in a managed service or by the user via Kubernetes or Docker. Deep Lake is serverless. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike Weaviate, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Weaviate is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs DVC

Deep Lake and DVC offer dataset version control similar to git for data, but their methods for storing data differ significantly. Deep Lake converts and stores data as chunked compressed arrays, which enables rapid streaming to ML models, whereas DVC operates on top of data stored in less efficient traditional file structures. The Deep Lake format makes dataset versioning significantly easier compared to traditional file structures by DVC when datasets are composed of many files (i.e., many images). An additional distinction is that DVC primarily uses a command-line interface, whereas Deep Lake is a Python package. Lastly, Deep Lake offers an API to easily connect datasets to ML frameworks and other common ML tools and enables instant dataset visualization through Activeloop's visualization tool.

Deep Lake vs MosaicML MDS format
  • Data Storage Format: Deep Lake operates on a columnar storage format, whereas MDS utilizes a row-wise storage approach. This fundamentally impacts how data is read, written, and organized in each system.
  • Compression: Deep Lake offers a more flexible compression scheme, allowing control over both chunk-level and sample-level compression for each column or tensor. This feature eliminates the need for additional compressions like zstd, which would otherwise demand more CPU cycles for decompressing on top of formats like jpeg.
  • Shuffling: MDS currently offers more advanced shuffling strategies.
  • Version Control & Visualization Support: A notable feature of Deep Lake is its native version control and in-browser data visualization, a feature not present for MosaicML data format. This can provide significant advantages in managing, understanding, and tracking different versions of the data.
Deep Lake vs TensorFlow Datasets (TFDS)

Deep Lake and TFDS seamlessly connect popular datasets to ML frameworks. Deep Lake datasets are compatible with both PyTorch and TensorFlow, whereas TFDS are only compatible with TensorFlow. A key difference between Deep Lake and TFDS is that Deep Lake datasets are designed for streaming from the cloud, whereas TFDS must be downloaded locally prior to use. As a result, with Deep Lake, one can import datasets directly from TensorFlow Datasets and stream them either to PyTorch or TensorFlow. In addition to providing access to popular publicly available datasets, Deep Lake also offers powerful tools for creating custom datasets, storing them on a variety of cloud storage providers, and collaborating with others via simple API. TFDS is primarily focused on giving the public easy access to commonly available datasets, and management of custom datasets is not the primary focus. A full comparison article can be found here.

Deep Lake vs HuggingFace Deep Lake and HuggingFace offer access to popular datasets, but Deep Lake primarily focuses on computer vision, whereas HuggingFace focuses on natural language processing. HuggingFace Transforms and other computational tools for NLP are not analogous to features offered by Deep Lake.
Deep Lake vs WebDatasets Deep Lake and WebDatasets both offer rapid data streaming across networks. They have nearly identical steaming speeds because the underlying network requests and data structures are very similar. However, Deep Lake offers superior random access and shuffling, its simple API is in python instead of command-line, and Deep Lake enables simple indexing and modification of the dataset without having to recreate it.
Deep Lake vs Zarr Deep Lake and Zarr both offer storage of data as chunked arrays. However, Deep Lake is primarily designed for returning data as arrays using a simple API, rather than actually storing raw arrays (even though that's also possible). Deep Lake stores data in use-case-optimized formats, such as jpeg or png for images, or mp4 for video, and the user treats the data as if it's an array, because Deep Lake handles all the data processing in between. Deep Lake offers more flexibility for storing arrays with dynamic shape (ragged tensors), and it provides several features that are not naively available in Zarr such as version control, data streaming, and connecting data to ML Frameworks.

Community

Join our Slack community to learn more about unstructured dataset management using Deep Lake and to get help from the Activeloop team and other users.

We'd love your feedback by completing our 3-minute survey.

As always, thanks to our amazing contributors!

Made with contributors-img.

Please read CONTRIBUTING.md to get started with making contributions to Deep Lake.

README Badge

Using Deep Lake? Add a README badge to let everyone know:

deeplake

[![deeplake](https://img.shields.io/badge/powered%20by-Deep%20Lake%20-ff5a1f.svg)](https://github.com/activeloopai/deeplake)

Disclaimers

Dataset Licenses

Deep Lake users may have access to a variety of publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have a license to use the datasets. It is your responsibility to determine whether you have permission to use the datasets under their license.

If you're a dataset owner and do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thank you for your contribution to the ML community!

Citation

If you use Deep Lake in your research, please cite Activeloop using:

@article{deeplake,
  title = {Deep Lake: a Lakehouse for Deep Learning},
  author = {Hambardzumyan, Sasun and Tuli, Abhinav and Ghukasyan, Levon and Rahman, Fariz and Topchyan, Hrant and Isayan, David and Harutyunyan, Mikayel and Hakobyan, Tatevik and Stranic, Ivo and Buniatyan, Davit},
  url = {https://www.cidrdb.org/cidr2023/papers/p69-buniatyan.pdf},
  booktitle={Proceedings of CIDR},
  year = {2023},
}

Acknowledgment

This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome cloud-volume tool.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

deeplake-4.6.2-cp313-cp313-manylinux2014_x86_64.whl (39.4 MB view details)

Uploaded CPython 3.13

deeplake-4.6.2-cp313-cp313-manylinux2014_aarch64.whl (37.3 MB view details)

Uploaded CPython 3.13

deeplake-4.6.2-cp313-cp313-macosx_11_0_arm64.whl (32.3 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

deeplake-4.6.2-cp312-cp312-manylinux2014_x86_64.whl (39.4 MB view details)

Uploaded CPython 3.12

deeplake-4.6.2-cp312-cp312-manylinux2014_aarch64.whl (37.3 MB view details)

Uploaded CPython 3.12

deeplake-4.6.2-cp312-cp312-macosx_11_0_arm64.whl (32.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

deeplake-4.6.2-cp311-cp311-manylinux2014_x86_64.whl (39.4 MB view details)

Uploaded CPython 3.11

deeplake-4.6.2-cp311-cp311-manylinux2014_aarch64.whl (37.4 MB view details)

Uploaded CPython 3.11

deeplake-4.6.2-cp311-cp311-macosx_11_0_arm64.whl (32.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

deeplake-4.6.2-cp310-cp310-manylinux2014_x86_64.whl (39.4 MB view details)

Uploaded CPython 3.10

deeplake-4.6.2-cp310-cp310-manylinux2014_aarch64.whl (37.4 MB view details)

Uploaded CPython 3.10

deeplake-4.6.2-cp310-cp310-macosx_11_0_arm64.whl (32.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

deeplake-4.6.2-cp39-cp39-manylinux2014_x86_64.whl (39.4 MB view details)

Uploaded CPython 3.9

deeplake-4.6.2-cp39-cp39-manylinux2014_aarch64.whl (37.4 MB view details)

Uploaded CPython 3.9

deeplake-4.6.2-cp39-cp39-macosx_11_0_arm64.whl (32.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file deeplake-4.6.2-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 67493391b86d862c31b166bb8aa28dcf669818ee238e69349fa5fdf3c749d2a8
MD5 918f754393f9343bb5413b9816e1da27
BLAKE2b-256 e066177d4a0c8c5f5b7bdce91ca3f0fe82da6a63f85c385e7231fb8729639a14

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp313-cp313-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp313-cp313-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3365a6fac01a85a797653a126e56da2f08323f1488675b4eae600c6dfacd71d2
MD5 65a4291068c6c9d7cfc0294b5c6353bf
BLAKE2b-256 4d19949773781c0617aa8bfab301853dc979393086b78d405d45276502196d79

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d14bd44fd744d18a101b9316172fd156bc9c90adb08001b787b5803371cbc555
MD5 4612c6b46bc586503fbd9731cb486fab
BLAKE2b-256 f7560439348ddf2404f27522c5c1bcc426e1f3b65358be3332ef5624f6e38f08

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f37d4cdf22c55dd6348534ba72899515c7c9bc412b673e3bb27d077e2ec253bb
MD5 ab6327fc6e19ed5e606e0ca2dfacc9f1
BLAKE2b-256 a6b5aada144d8a06bb16c62a0d2087c37c38a7562093dcd3576b7c51c872f374

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp312-cp312-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp312-cp312-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 05e451f428598615402b5d427cb24e7e35b0d1af4ffdad41f5fca6e91cbaa62c
MD5 b390795e51dbe03213a414ccba3d9fa8
BLAKE2b-256 bd1e514950e02561a1fd68b8e8a8a5948f694f46b6350bd42f0a2a8c09e00d14

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ab8739854709eda2767f82c6525419a02fd047d4ddeca2a4ad78f4a7e85a7036
MD5 c98c82e290355b6d342de4de869ca823
BLAKE2b-256 c2e7de44a9240be71871f2b03c1a64f7d5b75995050a0c235ea6ccff3eca4e56

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3c99457b10bad7453ee9c6fdcc2fa74e67f158287a836f08617b3c2fb058a460
MD5 0e5dfb91d4f69d2d4b8d6a74afe53edd
BLAKE2b-256 b79d6e21ceb1668ad5504c28b353bbdc1c8ddfd84f2da125d76374873d8f880e

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp311-cp311-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp311-cp311-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d1f9b1dbff6f812ea0c8a122f6321ce741c29d4b07e90037e9be79b506f26050
MD5 4fc4de3aac405514da722c350c44e915
BLAKE2b-256 102f30ce16a18e4a6d1cce93ff9dc0eceb338c34762962d3ba8b07e1f952f5b1

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ae50cc35c37fb79e148d47ffad32026669e7ee3eef4cafcf3fa858ba999c30e7
MD5 05fd1ded0f333cc375f39f85c64a9ee6
BLAKE2b-256 e9a9cff9853591e508644c6fec9d7749c6cffc9afc4d4912e43e4f4e28321e7d

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f84049be3452bddc6435731e541ab3d06942ff28498d004e57017186fe9daf23
MD5 714651a55ef4c6a0780d85ef9e528492
BLAKE2b-256 993b5eb5d0cf27622c961f7b14c587dbeb56a111d7a3d7b67cde1f53168c2ca0

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp310-cp310-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp310-cp310-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 13d47d88f7538970d33c2adde856449030ef2de2b7c75f99691158e398968911
MD5 c96cfd4eb734d590413f6fa70f7d6d83
BLAKE2b-256 1cae79b4ee4b0efaa637ecc97727ebede3aae240389018e4f3d06d5c686b306e

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 017a445cc5c43b312d2be732fcc1c0f3c03fa91f408d542855248256e4be420c
MD5 04f92305ddfd98aa48d1741a4e2ebcc3
BLAKE2b-256 d249eb544cf639c18a62cceb8c5fb220d01a9a48b065408eecacf4f42bdda22b

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dd6a4fa8943c63dfdff199c0f7c34606c0db8b9cae03a48ac10e7358c9f439c4
MD5 a047a1e7977e64f2000a4adbe7eef360
BLAKE2b-256 c7077299b8379864f210ba11c8d33f51f64b1da560d266dbd8b8b1c86bb1a7ea

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp39-cp39-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp39-cp39-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c020e964317d243a96b196d74b6ddd3279bb9614e2396a7fbda588279c3f7cdf
MD5 59329f9d7048c1661a89afa511c9988b
BLAKE2b-256 325f4d4b6086c49b0455c6f880ab7877c3c704be56b0f3ba15094f3c162f8e40

See more details on using hashes here.

File details

Details for the file deeplake-4.6.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8299cc3bb5a42d9e2e2f641a44d2ae7a57a8241f59c329e848f914679009fe2d
MD5 923cfb47b606c3b75206ddb64b761ee2
BLAKE2b-256 8351b5d686807a9508d085461dd41e583d27c16a57670368fb9a7dea6a687f9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page