Skip to main content

Data Lake for Multi-Modal AI Search

Project description


Deep Lake: Database for AI

PyPI version PyPI version

DocsGet StartedAPI ReferenceLangChain & VectorDBs CourseBlogWhitepaperSlackTwitter

What is Deep Lake?

Deep Lake is a Database for AI powered by a storage format optimized for deep-learning applications. Deep Lake can be used for:

  1. Storing and searching data plus vectors while building LLM applications
  2. Managing datasets while training deep learning models

Deep Lake simplifies the deployment of enterprise-grade LLM-based products by offering storage for all data types (embeddings, audio, text, videos, images, dicom, pdfs, annotations, and more), querying and vector search, data streaming while training models at scale, data versioning and lineage, and integrations with popular tools such as LangChain, LlamaIndex, Weights & Biases, and many more. Deep Lake works with data of any size, it is serverless, and it enables you to store all of your data in your own cloud and in one place. Deep Lake is used by Intel, Bayer Radiology, Matterport, ZERO Systems, Red Cross, Yale, & Oxford.

Deep Lake includes the following features:

Multi-Cloud Support (S3, GCP, Azure) Use one API to upload, download, and stream datasets to/from S3, Azure, GCP, Activeloop cloud, local storage, or in-memory storage. Compatible with any S3-compatible storage such as MinIO.
Native Compression with Lazy NumPy-like Indexing Store images, audio, and videos in their native compression. Slice, index, iterate, and interact with your data like a collection of NumPy arrays in your system's memory. Deep Lake lazily loads data only when needed, e.g., when training a model or running queries.
Dataloaders for Popular Deep Learning Frameworks Deep Lake comes with built-in dataloaders for Pytorch and TensorFlow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
Integrations with Powerful Tools Deep Lake has integrations with Langchain and LLamaIndex as a vector store for LLM apps, Weights & Biases for data lineage during model training, MMDetection for training object detection models, and MMSegmentation for training semantic segmentation models.
100+ most-popular image, video, and audio datasets available in seconds Deep Lake community has uploaded 100+ image, video and audio datasets like MNIST, COCO, ImageNet, CIFAR, GTZAN and others.
Instant Visualization Support in the Deep Lake App Deep Lake datasets are instantly visualized with bounding boxes, masks, annotations, etc. in Deep Lake Visualizer (see below).

Visualizer

🚀 How to install Deep Lake

Deep Lake can be installed using pip:

pip install deeplake

To access all of Deep Lake's features, please register in the Deep Lake App.

🧠 Deep Lake Code Examples by Application

Vector Store Applications

Using Deep Lake as a Vector Store for building LLM applications:

- Vector Store Quickstart

- Vector Store Tutorials

- LangChain Integration

- LlamaIndex Integration

- Image Similarity Search with Deep Lake

Deep Learning Applications

Using Deep Lake for managing data while training Deep Learning models:

- Deep Learning Quickstart

- Tutorials for Training Models

⚙️ Integrations

Deep Lake offers integrations with other tools in order to streamline your deep learning workflows. Current integrations include:

📚 Documentation

Getting started guides, examples, tutorials, API reference, and other useful information can be found on our documentation page.

🎓 For Students and Educators

Deep Lake users can access and visualize a variety of popular datasets through a free integration with Deep Lake's App. Universities can get up to 1TB of data storage and 100,000 monthly queries on the Tensor Database for free per month. Chat in on our website: to claim the access!

👩‍💻 Comparisons to Familiar Tools

Deep Lake vs Chroma

Both Deep Lake & ChromaDB enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. ChromaDB is a Vector Database that can be deployed locally or on a server using Docker and will offer a hosted solution shortly. Deep Lake is a serverless Vector Store deployed on the user’s own cloud, locally, or in-memory. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike ChromaDB, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. ChromaDB is limited to light metadata on top of the embeddings and has no visualization. Deep Lake datasets can be visualized and version controlled. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Pinecone

Both Deep Lake and Pinecone enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Pinecone is a fully-managed Vector Database that is optimized for highly demanding applications requiring a search for billions of vectors. Deep Lake is serverless. All computations run client-side, which enables users to get started in seconds. Unlike Pinecone, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Pinecone is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Weaviate

Both Deep Lake and Weaviate enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Weaviate is a Vector Database that can be deployed in a managed service or by the user via Kubernetes or Docker. Deep Lake is serverless. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike Weaviate, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Weaviate is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs DVC

Deep Lake and DVC offer dataset version control similar to git for data, but their methods for storing data differ significantly. Deep Lake converts and stores data as chunked compressed arrays, which enables rapid streaming to ML models, whereas DVC operates on top of data stored in less efficient traditional file structures. The Deep Lake format makes dataset versioning significantly easier compared to traditional file structures by DVC when datasets are composed of many files (i.e., many images). An additional distinction is that DVC primarily uses a command-line interface, whereas Deep Lake is a Python package. Lastly, Deep Lake offers an API to easily connect datasets to ML frameworks and other common ML tools and enables instant dataset visualization through Activeloop's visualization tool.

Deep Lake vs MosaicML MDS format
  • Data Storage Format: Deep Lake operates on a columnar storage format, whereas MDS utilizes a row-wise storage approach. This fundamentally impacts how data is read, written, and organized in each system.
  • Compression: Deep Lake offers a more flexible compression scheme, allowing control over both chunk-level and sample-level compression for each column or tensor. This feature eliminates the need for additional compressions like zstd, which would otherwise demand more CPU cycles for decompressing on top of formats like jpeg.
  • Shuffling: MDS currently offers more advanced shuffling strategies.
  • Version Control & Visualization Support: A notable feature of Deep Lake is its native version control and in-browser data visualization, a feature not present for MosaicML data format. This can provide significant advantages in managing, understanding, and tracking different versions of the data.
Deep Lake vs TensorFlow Datasets (TFDS)

Deep Lake and TFDS seamlessly connect popular datasets to ML frameworks. Deep Lake datasets are compatible with both PyTorch and TensorFlow, whereas TFDS are only compatible with TensorFlow. A key difference between Deep Lake and TFDS is that Deep Lake datasets are designed for streaming from the cloud, whereas TFDS must be downloaded locally prior to use. As a result, with Deep Lake, one can import datasets directly from TensorFlow Datasets and stream them either to PyTorch or TensorFlow. In addition to providing access to popular publicly available datasets, Deep Lake also offers powerful tools for creating custom datasets, storing them on a variety of cloud storage providers, and collaborating with others via simple API. TFDS is primarily focused on giving the public easy access to commonly available datasets, and management of custom datasets is not the primary focus. A full comparison article can be found here.

Deep Lake vs HuggingFace Deep Lake and HuggingFace offer access to popular datasets, but Deep Lake primarily focuses on computer vision, whereas HuggingFace focuses on natural language processing. HuggingFace Transforms and other computational tools for NLP are not analogous to features offered by Deep Lake.
Deep Lake vs WebDatasets Deep Lake and WebDatasets both offer rapid data streaming across networks. They have nearly identical steaming speeds because the underlying network requests and data structures are very similar. However, Deep Lake offers superior random access and shuffling, its simple API is in python instead of command-line, and Deep Lake enables simple indexing and modification of the dataset without having to recreate it.
Deep Lake vs Zarr Deep Lake and Zarr both offer storage of data as chunked arrays. However, Deep Lake is primarily designed for returning data as arrays using a simple API, rather than actually storing raw arrays (even though that's also possible). Deep Lake stores data in use-case-optimized formats, such as jpeg or png for images, or mp4 for video, and the user treats the data as if it's an array, because Deep Lake handles all the data processing in between. Deep Lake offers more flexibility for storing arrays with dynamic shape (ragged tensors), and it provides several features that are not naively available in Zarr such as version control, data streaming, and connecting data to ML Frameworks.

Community

Join our Slack community to learn more about unstructured dataset management using Deep Lake and to get help from the Activeloop team and other users.

We'd love your feedback by completing our 3-minute survey.

As always, thanks to our amazing contributors!

Made with contributors-img.

Please read CONTRIBUTING.md to get started with making contributions to Deep Lake.

README Badge

Using Deep Lake? Add a README badge to let everyone know:

deeplake

[![deeplake](https://img.shields.io/badge/powered%20by-Deep%20Lake%20-ff5a1f.svg)](https://github.com/activeloopai/deeplake)

Disclaimers

Dataset Licenses

Deep Lake users may have access to a variety of publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have a license to use the datasets. It is your responsibility to determine whether you have permission to use the datasets under their license.

If you're a dataset owner and do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thank you for your contribution to the ML community!

Citation

If you use Deep Lake in your research, please cite Activeloop using:

@article{deeplake,
  title = {Deep Lake: a Lakehouse for Deep Learning},
  author = {Hambardzumyan, Sasun and Tuli, Abhinav and Ghukasyan, Levon and Rahman, Fariz and Topchyan, Hrant and Isayan, David and Harutyunyan, Mikayel and Hakobyan, Tatevik and Stranic, Ivo and Buniatyan, Davit},
  url = {https://www.cidrdb.org/cidr2023/papers/p69-buniatyan.pdf},
  booktitle={Proceedings of CIDR},
  year = {2023},
}

Acknowledgment

This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome cloud-volume tool.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

deeplake-4.6.4-cp313-cp313-manylinux2014_x86_64.whl (45.5 MB view details)

Uploaded CPython 3.13

deeplake-4.6.4-cp313-cp313-manylinux2014_aarch64.whl (41.8 MB view details)

Uploaded CPython 3.13

deeplake-4.6.4-cp313-cp313-macosx_11_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

deeplake-4.6.4-cp312-cp312-manylinux2014_x86_64.whl (45.5 MB view details)

Uploaded CPython 3.12

deeplake-4.6.4-cp312-cp312-manylinux2014_aarch64.whl (41.8 MB view details)

Uploaded CPython 3.12

deeplake-4.6.4-cp312-cp312-macosx_11_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

deeplake-4.6.4-cp311-cp311-manylinux2014_x86_64.whl (45.5 MB view details)

Uploaded CPython 3.11

deeplake-4.6.4-cp311-cp311-manylinux2014_aarch64.whl (41.8 MB view details)

Uploaded CPython 3.11

deeplake-4.6.4-cp311-cp311-macosx_11_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

deeplake-4.6.4-cp310-cp310-manylinux2014_x86_64.whl (45.5 MB view details)

Uploaded CPython 3.10

deeplake-4.6.4-cp310-cp310-manylinux2014_aarch64.whl (41.8 MB view details)

Uploaded CPython 3.10

deeplake-4.6.4-cp310-cp310-macosx_11_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

deeplake-4.6.4-cp39-cp39-manylinux2014_x86_64.whl (45.5 MB view details)

Uploaded CPython 3.9

deeplake-4.6.4-cp39-cp39-manylinux2014_aarch64.whl (41.8 MB view details)

Uploaded CPython 3.9

deeplake-4.6.4-cp39-cp39-macosx_11_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file deeplake-4.6.4-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f2117a089dbfeaa291a3345fa664934c5ff74b9ba29480d994888cc2e69649d2
MD5 11f2df6ad426bae41456fc89037e7cea
BLAKE2b-256 bd0329866dc87fc419c37b440b78ac96e7e594ed24252594d8cc7ac2ee487a59

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp313-cp313-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp313-cp313-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ce13a45734f1f55856ed34e5db7c5022d702d2db00d78b2c815a8dcd32db2b7f
MD5 f448430ee8e817de1bc0a623619516b7
BLAKE2b-256 ddc143dd9ee79bfd52d52e3e5b291319f2210c08082eaad693114ddbfb831f55

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4f377f5756b89bc89bc58fc9c0758dd29210ce1e72d0fdca55d5a5c9a2947199
MD5 99924169487c0054bc68fdcae9643566
BLAKE2b-256 4804e7064ea0b765a2451645666808ab8beb95ce8a47fe6bfd23b1c12b649393

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4b4f62f351f3fd8918b530f3bd3328aa3e83cc5c965f3a2c6f5fe822bbf76877
MD5 56847023a62c1d96ec679273aa5edbbd
BLAKE2b-256 713dde42914fb376e55492aa968525e62620a9e8b22711c4f8ac5e6e0e6cd4ac

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp312-cp312-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp312-cp312-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 97b40e610701acfda1ace7b2586760c2af7f0b24dd98d8309edd8a37727e7d0e
MD5 61edc6ec90a055aa57b7eb18419f9d66
BLAKE2b-256 db946a63ab96d29fabf9dd76f1d76e436b08c71b53f9975f59b040a8900ef828

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 786ba22848b1f1e9014f89602f4d067f47c8b8ba8467132a72860e63d4029bdc
MD5 f16ac6720de5f1a0787eacb3c5eb65cd
BLAKE2b-256 87d96ac9727e5517cc6bc88aed7aa483df7616d3e8986cfb6ef6f67b1128ac88

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6ea5554d758b9c9f4ea45903a3ba6fedd05839f5233bd79e9368b95cab3ca970
MD5 c0ea2438c0dd3a4ccb067b4dfdd494bb
BLAKE2b-256 4f28752065fc177aee2db6a72494c4f540a0a6a4e43a264d2a8aca000ec7201f

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp311-cp311-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp311-cp311-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 957957188c18ac4b3fd5d31b94a1b7875140101009ccc5339b710a17c1703bdc
MD5 5d6f6e70ebd87261dd02f4ffbd24bcb8
BLAKE2b-256 4c171a3c7ed6103d3eddef13a184fa1b60d1f7ab11f198a0b823bc7efa1821ac

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cd4e2e7cfb5c4bfadf3dcf5fc708c6bf8fe3834749c2ee7eeeda5ae45b880b58
MD5 919756c37dd29c1c7f2076f0e85f1468
BLAKE2b-256 47dfb536b5f4f7b34e7c504a8316526a4f9f18dcb7c7b74b2477c30225b81fb4

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 34e19c59bd77fa8ba663de9add8500411d39d679a8edcfc7e22241cc9b8ea399
MD5 8f2c11714f19e94111446bf6430c9cf6
BLAKE2b-256 739778ce21ca9ec3b21bcbc21a1d60a57f035e607ff0bb3c7d74faf3d13c88a2

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp310-cp310-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp310-cp310-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 aa974dc13e694f31bffe3f8d831b41628f16ef2e1e9057d861810b8c1cdb6df9
MD5 cc56bca64de0bcdc572f98b251e6a469
BLAKE2b-256 f38b0abfc739ce9328d49caf95d4d2cc0b08e327a6eeda47ea7912580d5ae09c

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4839d8b6022fe267f6a79ad18cf055e4bdcc7a72d4e31448c0b3011f6f13c1bf
MD5 a6e71b3c3a608f9af14088c85df5a8bf
BLAKE2b-256 325df57ae3e8aa7a07167174648ed3ac933896594bfd50ef68ab320317bda955

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e00e207d2dcc72190f0c3fe7ee9d3be3086d3c7cdc90c7d4ad896da4f23e868c
MD5 a99dc82574b8db0bf2fa8fda4b9ae7dc
BLAKE2b-256 ce6ff9e5cfa11411720ba9419167a2bed89d71093ecd741646a8fb57a9b13ce5

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp39-cp39-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp39-cp39-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d3a9aeebab0b8e4392c5e18fa3c2c9615086a3a388a7eea95b7def3e1050ac57
MD5 8fe997a4fce06364a129588a7338d062
BLAKE2b-256 47b064a6abf009019c3eb484aef4c2e2780be26cb33d4eebc5089f35474eaba3

See more details on using hashes here.

File details

Details for the file deeplake-4.6.4-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for deeplake-4.6.4-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1dc545599e0a26625b29af204a1ff0ea7b3bc72613401a4db3d218e7248eeac2
MD5 a8d32145b59ab8a1e62dbe9c9a64877c
BLAKE2b-256 35a373e20a562e1c0ff97ad5ae0c6455ef04f905d1675fa55ab20f572e234b0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page