Skip to main content

Data Lake for Multi-Modal AI Search

Project description


Deep Lake: Database for AI

PyPI version PyPI version

DocsGet StartedAPI ReferenceLangChain & VectorDBs CourseBlogWhitepaperSlackTwitter

What is Deep Lake?

Deep Lake is a Database for AI powered by a storage format optimized for deep-learning applications. Deep Lake can be used for:

  1. Storing and searching data plus vectors while building LLM applications
  2. Managing datasets while training deep learning models

Deep Lake simplifies the deployment of enterprise-grade LLM-based products by offering storage for all data types (embeddings, audio, text, videos, images, dicom, pdfs, annotations, and more), querying and vector search, data streaming while training models at scale, data versioning and lineage, and integrations with popular tools such as LangChain, LlamaIndex, Weights & Biases, and many more. Deep Lake works with data of any size, it is serverless, and it enables you to store all of your data in your own cloud and in one place. Deep Lake is used by Intel, Bayer Radiology, Matterport, ZERO Systems, Red Cross, Yale, & Oxford.

Deep Lake includes the following features:

Multi-Cloud Support (S3, GCP, Azure) Use one API to upload, download, and stream datasets to/from S3, Azure, GCP, Activeloop cloud, local storage, or in-memory storage. Compatible with any S3-compatible storage such as MinIO.
Native Compression with Lazy NumPy-like Indexing Store images, audio, and videos in their native compression. Slice, index, iterate, and interact with your data like a collection of NumPy arrays in your system's memory. Deep Lake lazily loads data only when needed, e.g., when training a model or running queries.
Dataloaders for Popular Deep Learning Frameworks Deep Lake comes with built-in dataloaders for Pytorch and TensorFlow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
Integrations with Powerful Tools Deep Lake has integrations with Langchain and LLamaIndex as a vector store for LLM apps, Weights & Biases for data lineage during model training, MMDetection for training object detection models, and MMSegmentation for training semantic segmentation models.
100+ most-popular image, video, and audio datasets available in seconds Deep Lake community has uploaded 100+ image, video and audio datasets like MNIST, COCO, ImageNet, CIFAR, GTZAN and others.
Instant Visualization Support in the Deep Lake App Deep Lake datasets are instantly visualized with bounding boxes, masks, annotations, etc. in Deep Lake Visualizer (see below).

Visualizer

🚀 How to install Deep Lake

Deep Lake can be installed using pip:

pip install deeplake

To access all of Deep Lake's features, please register in the Deep Lake App.

🧠 Deep Lake Code Examples by Application

Vector Store Applications

Using Deep Lake as a Vector Store for building LLM applications:

- Vector Store Quickstart

- Vector Store Tutorials

- LangChain Integration

- LlamaIndex Integration

- Image Similarity Search with Deep Lake

Deep Learning Applications

Using Deep Lake for managing data while training Deep Learning models:

- Deep Learning Quickstart

- Tutorials for Training Models

⚙️ Integrations

Deep Lake offers integrations with other tools in order to streamline your deep learning workflows. Current integrations include:

📚 Documentation

Getting started guides, examples, tutorials, API reference, and other useful information can be found on our documentation page.

🎓 For Students and Educators

Deep Lake users can access and visualize a variety of popular datasets through a free integration with Deep Lake's App. Universities can get up to 1TB of data storage and 100,000 monthly queries on the Tensor Database for free per month. Chat in on our website: to claim the access!

👩‍💻 Comparisons to Familiar Tools

Deep Lake vs Chroma

Both Deep Lake & ChromaDB enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. ChromaDB is a Vector Database that can be deployed locally or on a server using Docker and will offer a hosted solution shortly. Deep Lake is a serverless Vector Store deployed on the user’s own cloud, locally, or in-memory. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike ChromaDB, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. ChromaDB is limited to light metadata on top of the embeddings and has no visualization. Deep Lake datasets can be visualized and version controlled. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Pinecone

Both Deep Lake and Pinecone enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Pinecone is a fully-managed Vector Database that is optimized for highly demanding applications requiring a search for billions of vectors. Deep Lake is serverless. All computations run client-side, which enables users to get started in seconds. Unlike Pinecone, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Pinecone is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Weaviate

Both Deep Lake and Weaviate enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Weaviate is a Vector Database that can be deployed in a managed service or by the user via Kubernetes or Docker. Deep Lake is serverless. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike Weaviate, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Weaviate is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs DVC

Deep Lake and DVC offer dataset version control similar to git for data, but their methods for storing data differ significantly. Deep Lake converts and stores data as chunked compressed arrays, which enables rapid streaming to ML models, whereas DVC operates on top of data stored in less efficient traditional file structures. The Deep Lake format makes dataset versioning significantly easier compared to traditional file structures by DVC when datasets are composed of many files (i.e., many images). An additional distinction is that DVC primarily uses a command-line interface, whereas Deep Lake is a Python package. Lastly, Deep Lake offers an API to easily connect datasets to ML frameworks and other common ML tools and enables instant dataset visualization through Activeloop's visualization tool.

Deep Lake vs MosaicML MDS format
  • Data Storage Format: Deep Lake operates on a columnar storage format, whereas MDS utilizes a row-wise storage approach. This fundamentally impacts how data is read, written, and organized in each system.
  • Compression: Deep Lake offers a more flexible compression scheme, allowing control over both chunk-level and sample-level compression for each column or tensor. This feature eliminates the need for additional compressions like zstd, which would otherwise demand more CPU cycles for decompressing on top of formats like jpeg.
  • Shuffling: MDS currently offers more advanced shuffling strategies.
  • Version Control & Visualization Support: A notable feature of Deep Lake is its native version control and in-browser data visualization, a feature not present for MosaicML data format. This can provide significant advantages in managing, understanding, and tracking different versions of the data.
Deep Lake vs TensorFlow Datasets (TFDS)

Deep Lake and TFDS seamlessly connect popular datasets to ML frameworks. Deep Lake datasets are compatible with both PyTorch and TensorFlow, whereas TFDS are only compatible with TensorFlow. A key difference between Deep Lake and TFDS is that Deep Lake datasets are designed for streaming from the cloud, whereas TFDS must be downloaded locally prior to use. As a result, with Deep Lake, one can import datasets directly from TensorFlow Datasets and stream them either to PyTorch or TensorFlow. In addition to providing access to popular publicly available datasets, Deep Lake also offers powerful tools for creating custom datasets, storing them on a variety of cloud storage providers, and collaborating with others via simple API. TFDS is primarily focused on giving the public easy access to commonly available datasets, and management of custom datasets is not the primary focus. A full comparison article can be found here.

Deep Lake vs HuggingFace Deep Lake and HuggingFace offer access to popular datasets, but Deep Lake primarily focuses on computer vision, whereas HuggingFace focuses on natural language processing. HuggingFace Transforms and other computational tools for NLP are not analogous to features offered by Deep Lake.
Deep Lake vs WebDatasets Deep Lake and WebDatasets both offer rapid data streaming across networks. They have nearly identical steaming speeds because the underlying network requests and data structures are very similar. However, Deep Lake offers superior random access and shuffling, its simple API is in python instead of command-line, and Deep Lake enables simple indexing and modification of the dataset without having to recreate it.
Deep Lake vs Zarr Deep Lake and Zarr both offer storage of data as chunked arrays. However, Deep Lake is primarily designed for returning data as arrays using a simple API, rather than actually storing raw arrays (even though that's also possible). Deep Lake stores data in use-case-optimized formats, such as jpeg or png for images, or mp4 for video, and the user treats the data as if it's an array, because Deep Lake handles all the data processing in between. Deep Lake offers more flexibility for storing arrays with dynamic shape (ragged tensors), and it provides several features that are not naively available in Zarr such as version control, data streaming, and connecting data to ML Frameworks.

Community

Join our Slack community to learn more about unstructured dataset management using Deep Lake and to get help from the Activeloop team and other users.

We'd love your feedback by completing our 3-minute survey.

As always, thanks to our amazing contributors!

Made with contributors-img.

Please read CONTRIBUTING.md to get started with making contributions to Deep Lake.

README Badge

Using Deep Lake? Add a README badge to let everyone know:

deeplake

[![deeplake](https://img.shields.io/badge/powered%20by-Deep%20Lake%20-ff5a1f.svg)](https://github.com/activeloopai/deeplake)

Disclaimers

Dataset Licenses

Deep Lake users may have access to a variety of publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have a license to use the datasets. It is your responsibility to determine whether you have permission to use the datasets under their license.

If you're a dataset owner and do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thank you for your contribution to the ML community!

Citation

If you use Deep Lake in your research, please cite Activeloop using:

@article{deeplake,
  title = {Deep Lake: a Lakehouse for Deep Learning},
  author = {Hambardzumyan, Sasun and Tuli, Abhinav and Ghukasyan, Levon and Rahman, Fariz and Topchyan, Hrant and Isayan, David and Harutyunyan, Mikayel and Hakobyan, Tatevik and Stranic, Ivo and Buniatyan, Davit},
  url = {https://www.cidrdb.org/cidr2023/papers/p69-buniatyan.pdf},
  booktitle={Proceedings of CIDR},
  year = {2023},
}

Acknowledgment

This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome cloud-volume tool.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

deeplake-4.4.4-cp313-cp313-manylinux2014_x86_64.whl (38.6 MB view details)

Uploaded CPython 3.13

deeplake-4.4.4-cp313-cp313-manylinux2014_aarch64.whl (36.6 MB view details)

Uploaded CPython 3.13

deeplake-4.4.4-cp313-cp313-macosx_11_0_arm64.whl (32.8 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

deeplake-4.4.4-cp312-cp312-manylinux2014_x86_64.whl (38.6 MB view details)

Uploaded CPython 3.12

deeplake-4.4.4-cp312-cp312-manylinux2014_aarch64.whl (36.6 MB view details)

Uploaded CPython 3.12

deeplake-4.4.4-cp312-cp312-macosx_11_0_arm64.whl (32.8 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

deeplake-4.4.4-cp311-cp311-manylinux2014_x86_64.whl (38.6 MB view details)

Uploaded CPython 3.11

deeplake-4.4.4-cp311-cp311-manylinux2014_aarch64.whl (36.7 MB view details)

Uploaded CPython 3.11

deeplake-4.4.4-cp311-cp311-macosx_11_0_arm64.whl (32.8 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

deeplake-4.4.4-cp310-cp310-manylinux2014_x86_64.whl (38.6 MB view details)

Uploaded CPython 3.10

deeplake-4.4.4-cp310-cp310-manylinux2014_aarch64.whl (36.7 MB view details)

Uploaded CPython 3.10

deeplake-4.4.4-cp310-cp310-macosx_11_0_arm64.whl (32.8 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file deeplake-4.4.4-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a420e185538d5fd1ec404d30d276444a0e86bfed8ddf8698997c6a0d283255c3
MD5 abd37e515d8f54db801e2dc70fdd8877
BLAKE2b-256 3a462965699728f8ba71835f57a0163de8fc20b72ef1e446cf04f69414de86bf

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp313-cp313-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp313-cp313-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e1cb1ec60e74da7c58a33b28181c668196be9d587de3694b82b6ca542f03c2c3
MD5 7e584ed1440f70e5f1b89246bfd5255a
BLAKE2b-256 8ff2761d4be7fbc4376b46baafd5dc49a00a5bfb533f60dea4915d22007d055f

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8268087467c7d8808cfd913ff059da8d387fc8163310c8839d65a9ac99834ed0
MD5 e468c959a5e42606d98be56945f8ebf3
BLAKE2b-256 693b6001ee1b4aeb0f37f1320ef1ba322f8f2828f98d9cabc399171adecc7e95

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 692317ff98d701b82d772a96c7340ced718e4f213dcfbf504ee60babfde48de5
MD5 bba503778478fd1813f1fb98fb6b3ac0
BLAKE2b-256 c207da171025add841ae7f1bcaa9d051631a653fda52fcad8c839b04e57e1c14

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp312-cp312-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp312-cp312-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 86fdb031f8da3217b9bda60e489d159d1b8ca67c815e08a2ae4da14d9bc2108a
MD5 f32b441af04546914e83cbe269e4a078
BLAKE2b-256 d228da2b2a8b9ed76cd8703dcf0f8d584c7b0b3ff32d12a5598918b10d171152

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ea68a4db7c6a9ad6be9e108cff444085c2ab1a9ff15a9c83b2d855ea257f3aa0
MD5 505692748fef2c976bb6506de5bd8ece
BLAKE2b-256 34f0526c41a284727ae349051f3cc3fa836a7b792605e0c01c049cac6b17d06c

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 567f8ae30d21287a959b83e6267d9d48a416c02e72e947daaf3c9ef0fdc2da76
MD5 07d1544d6bb0d20817a55504d9cb9d85
BLAKE2b-256 4d49e429b85e412f62a51148c98c330b4ae12faa2112b675b73e48be9fa8ce62

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp311-cp311-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp311-cp311-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 49c11e4ec46e9ed34edafddf1bf45e4eb244ba49f48004421f7bf9af1f02ee79
MD5 d516026205cca6615ef11fbd6f4fe0ac
BLAKE2b-256 2ec0dc2ff8b6bc1d67ef27d14a55a49570e66f280262d777b5d2d0ff9d7d64ca

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b5b933e61b637771a347efe24f3f3a882ec57757f26a766b74d33df372096e2b
MD5 1cac3b4d029efa729035c480e75e6c98
BLAKE2b-256 1693a5238b0103eaef46d4582b3c5ad6dc2667adf06000ed172f044407672ba0

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9fca49b798b8879b7b790d349cb88d11e92b4925da51ef695641033a0e34e7b6
MD5 0ad73421c79b06e3dcf5f79a7513d961
BLAKE2b-256 bccede455d87222cc594621c818237a54c0d748e6e9232645a48877eef6f5c5a

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp310-cp310-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp310-cp310-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5f65f98c5cff7bd64760a22af833abae387cedc07ac35957720c955811c8587d
MD5 6a2985ede344cfbdbdb2de01a09bd7ce
BLAKE2b-256 8042630737b09817b5de7d47610934a0a6dc9517f45d8d2b36d2614f13e8e6e8

See more details on using hashes here.

File details

Details for the file deeplake-4.4.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.4.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 700792a858871425b8fac624d9af4812683871d6f73bd07cc479c8851871f1e2
MD5 ac43fc771eb5c715b78cc2899d027710
BLAKE2b-256 48d3ecdb2ead071f2633caf07c185fb88e79260bae17d102041125a0d0f22975

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page