Jina is the cloud-native neural search solution powered by the state-of-the-art AI and deep learning
Project description
An easier way to build neural search on the cloud
Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g. text, images, video, audio) on the cloud.
⏱️ Time Saver - The design pattern of neural search systems, from zero to a production-ready system in minutes.
🍱 Full-Stack Ownership - Keep an end-to-end stack ownership of your solution, avoid the integration pitfalls with fragmented, multi-vendor, generic legacy tools.
🌌 Universal Search - Large-scale indexing and querying of unstructured data: video, image, long/short text, music, source code, etc.
🧠 First-Class AI Models - First-class support for state-of-the-art AI models, easily usable and extendable with a Pythonic interface.
🌩️ Fast & Cloud Ready - Decentralized architecture from day one. Scalable & cloud-native by design: enjoy containerizing, distributing, sharding, async, REST/gRPC/WebSocket.
❤️ Made with Love - Never compromise on quality, actively maintained by a passionate full-time, venture-backed team.
Docs • Hello World • Quick Start • Learn • Examples • Contribute • Jobs • Website • Slack
Installation
📦 x86/64,arm/v6,v7,v8 (Apple M1) |
On Linux/macOS & Python 3.7/3.8/3.9 | Docker Users |
---|---|---|
Standard | pip install -U jina |
docker run jinaai/jina:latest |
Daemon | pip install -U "jina[daemon]" |
docker run --network=host jinaai/jina:latest-daemon |
With Extras | pip install -U "jina[devel]" |
docker run jinaai/jina:latest-devel |
Dev/Pre-Release | pip install --pre jina |
docker run jinaai/jina:master |
Version identifiers are explained here. To install Jina with extra dependencies please refer to the docs. Jina can run on Windows Subsystem for Linux. We welcome the community to help us with native Windows support.
Jina "Hello, World!" 👋🌍
Just starting out? Try Jina's "Hello, World" - jina hello --help
👗 Fashion Image Search
A simple image neural search demo for Fashion-MNIST. No extra dependencies needed, simply run:
jina hello fashion # more options in --help
...or even easier for Docker users, no install required:
docker run -v "$(pwd)/j:/j" jinaai/jina hello fashion --workdir /j && open j/hello-world.html
# replace "open" with "xdg-open" on Linux
Click here to see console output
🤖 Covid-19 Chatbot
For NLP engineers, we provide a simple chatbot demo for answering Covid-19 questions. To run that:
pip install "jina[chatbot]"
jina hello chatbot
This downloads CovidQA dataset and tells Jina to index 418 question-answer pairs with DistilBERT. The index process takes about 1 minute on CPU. Then it opens a web page where you can input questions and ask Jina.
🪆 Multimodal Document Search
A multimodal-document contains multiple data types, e.g. a PDF document often contains figures and text. Jina lets you build a multimodal search solution in just minutes. To run our minimum multimodal document search demo:
pip install "jina[multimodal]"
jina hello multimodal
This downloads people image dataset and tells Jina to index 2,000 image-caption pairs with MobileNet and DistilBERT. The index process takes about 3 minute on CPU. Then it opens a web page where you can query multimodal documents. We have prepared a YouTube tutorial to walk you through this demo.
Get Started
🥚 Fundamentals
CRUD Functions
First we look at basic CRUD operations. In Jina, CRUD corresponds to four functions: index
(create), search
(read), update
, and delete
. With Documents below as an example:
import numpy as np
from jina import Document
docs = [Document(id='🐲', embedding=np.array([0, 0]), tags={'guardian': 'Azure Dragon', 'position': 'East'}),
Document(id='🐦', embedding=np.array([1, 0]), tags={'guardian': 'Vermilion Bird', 'position': 'South'}),
Document(id='🐢', embedding=np.array([0, 1]), tags={'guardian': 'Black Tortoise', 'position': 'North'}),
Document(id='🐯', embedding=np.array([1, 1]), tags={'guardian': 'White Tiger', 'position': 'West'})]
Let's build a Flow with a simple indexer:
from jina import Flow
f = Flow().add(uses='_index')
Document
and Flow
are basic concepts in Jina, which will be explained later. _index
is a built-in embedding + structured storage that you can use out of the box.
Index |
# save four docs (both embedding and structured info) into storage
with f:
f.index(docs, on_done=print)
|
Search |
# retrieve top-3 neighbours of 🐲, this print 🐲🐦🐢 with score 0, 1, 1 respectively
with f:
f.search(docs[0], top_k=3, on_done=lambda x: print(x.docs[0].matches))
{"id": "🐲", "tags": {"guardian": "Azure Dragon", "position": "East"}, "embedding": {"dense": {"buffer": "AAAAAAAAAAAAAAAAAAAAAA==", "shape": [2], "dtype": "<i8"}}, "score": {"opName": "NumpyIndexer", "refId": "🐲"}, "adjacency": 1}
{"id": "🐦", "tags": {"position": "South", "guardian": "Vermilion Bird"}, "embedding": {"dense": {"buffer": "AQAAAAAAAAAAAAAAAAAAAA==", "shape": [2], "dtype": "<i8"}}, "score": {"value": 1.0, "opName": "NumpyIndexer", "refId": "🐲"}, "adjacency": 1}
{"id": "🐢", "tags": {"guardian": "Black Tortoise", "position": "North"}, "embedding": {"dense": {"buffer": "AAAAAAAAAAABAAAAAAAAAA==", "shape": [2], "dtype": "<i8"}}, "score": {"value": 1.0, "opName": "NumpyIndexer", "refId": "🐲"}, "adjacency": 1}
|
Update |
# update 🐲 embedding in the storage
docs[0].embedding = np.array([1, 1])
with f:
f.update(docs[0])
|
Delete |
# remove 🐦🐲 Documents from the storage
with f:
f.delete(['🐦', '🐲'])
|
Document
Document
is Jina's primitive data type. It can contain text, image, array, embedding, URI, and be accompanied by rich meta information. To construct a Document, you can use:
import numpy
from jina import Document
doc1 = Document(content=text_from_file, mime_type='text/x-python') # a text document contains python code
doc2 = Document(content=numpy.random.random([10, 10])) # a ndarray document
A Document can be recursed both vertically and horizontally to have nested Documents and matched Documents. To better see the Document's recursive structure, you can use .plot()
function. If you are using JupyterLab/Notebook, all Document objects will be auto-rendered.
import numpy
from jina import Document
d0 = Document(id='🐲', embedding=np.array([0, 0]))
d1 = Document(id='🐦', embedding=np.array([1, 0]))
d2 = Document(id='🐢', embedding=np.array([0, 1]))
d3 = Document(id='🐯', embedding=np.array([1, 1]))
d0.chunks.append(d1)
d0.chunks[0].chunks.append(d2)
d0.matches.append(d3)
d0.plot() # simply `d0` on JupyterLab
|
Click here to see more about MultimodalDocument
MultimodalDocument
A MultimodalDocument
is a document composed of multiple Document
from different modalities (e.g. text, image, audio).
Jina provides multiple ways to build a multimodal Document. For example, you can provide the modality names and the content in a dict
:
from jina import MultimodalDocument
document = MultimodalDocument(modality_content_map={
'title': 'my holiday picture',
'description': 'the family having fun on the beach',
'image': PIL.Image.open('path/to/image.jpg')
})
One can also compose a MultimodalDocument
from multiple Document
directly:
from jina.types import Document, MultimodalDocument
doc_title = Document(content='my holiday picture', modality='title')
doc_desc = Document(content='the family having fun on the beach', modality='description')
doc_img = Document(content=PIL.Image.open('path/to/image.jpg'), modality='image')
doc_img.tags['date'] = '10/08/2019'
document = MultimodalDocument(chunks=[doc_title, doc_description, doc_img])
Fusion Embeddings from Different Modalities
To extract fusion embeddings from different modalities Jina provides BaseMultiModalEncoder
abstract class, which has a unique encode
interface.
def encode(self, *data: 'numpy.ndarray', **kwargs) -> 'numpy.ndarray':
...
MultimodalDriver
provides data
to the MultimodalDocument
in the correct expected order. In this example below, image
embedding is passed to the encoder as the first argument, and text
as the second.
!MyMultimodalEncoder
with:
positional_modality: ['image', 'text']
requests:
on:
[IndexRequest, SearchRequest]:
- !MultiModalDriver {}
Interested readers can refer to jina-ai/example
: how to build a multimodal search engine for image retrieval using TIRG (Composing Text and Image for Image Retrieval) for the usage of MultimodalDriver
and BaseMultiModalEncoder
in practice.
Flow
Jina provides a high-level Flow API to simplify building CRUD workflows. To create a new Flow:
from jina import Flow
f = Flow().add()
This creates a simple Flow with one Pod. You can chain multiple .add()
s in a single Flow.
To visualize the Flow, simply chain it with .plot('my-flow.svg')
. If you are using a Jupyter notebook, the Flow object will be displayed inline without plot
.
Gateway
is the entrypoint of the Flow.
Get the vibe? Now we're talking! Let's learn more about the basic concepts and features of Jina:
🐣 Basic
Feed Data
To use a Flow, open it via with
context manager, like you would open a file in Python. Now let's create some empty Documents and index them:
from jina import Document
with Flow().add() as f:
f.index((Document() for _ in range(10)))
Flow supports CRUD operations: index
, search
, update
, delete
. In addition, it also provides sugary syntax on ndarray
, csv
, ndjson
and arbitrary files.
Input |
Example of index /search
|
Explain |
numpy.ndarray
|
with f:
f.index_ndarray(numpy.random.random([4,2]))
|
Input four |
CSV |
with f, open('index.csv') as fp:
f.index_csv(fp1, field_resolver={'pic_url': 'uri'})
|
Each line in |
JSON Lines/ndjson /LDJSON
|
with f, open('index.ndjson') as fp:
f.index_ndjson(fp1, field_resolver={'question_id': 'id'})
|
Each line in |
Files with wildcards |
with f:
f.index_files(['/tmp/*.mp4', '/tmp/*.pdf'])
|
Each file captured is constructed as a |
Fetch Result
Once a request is done, callback functions are fired. Jina Flow implements a Promise-like interface: You can add callback functions on_done
, on_error
, on_always
to hook different events. In the example below, our Flow passes the message then prints the result when successful. If something goes wrong, it beeps. Finally, the result is written to output.txt
.
def beep(*args):
# make a beep sound
import os
os.system('echo -n "\a";')
with Flow().add() as f, open('output.txt', 'w') as fp:
f.index(numpy.random.random([4, 5, 2]),
on_done=print, on_error=beep, on_always=lambda x: fp.write(x.json()))
Add Logic
To add logic to the Flow, use the uses
parameter to attach a Pod with an Executor. uses
accepts multiple value types including class name, Docker image, (inline) YAML or built-in shortcut.
f = (Flow().add(uses='MyBertEncoder') # class name of a Jina Executor
.add(uses='docker://jinahub/pod.encoder.dummy_mwu_encoder:0.0.6-0.9.3') # the image name
.add(uses='myencoder.yml') # YAML serialization of a Jina Executor
.add(uses='!WaveletTransformer | {freq: 20}') # inline YAML config
.add(uses='_pass') # built-in shortcut executor
.add(uses={'__cls': 'MyBertEncoder', 'with': {'param': 1.23}})) # dict config object with __cls keyword
The power of Jina lies in its decentralized architecture: Each add
creates a new Pod, and these Pods can be run as a local thread/process, a remote process, inside a Docker container, or even inside a remote Docker container.
Inter & Intra Parallelism
Chaining .add()
s creates a sequential Flow. For parallelism, use the needs
parameter:
f = (Flow().add(name='p1', needs='gateway')
.add(name='p2', needs='gateway')
.add(name='p3', needs='gateway')
.needs(['p1','p2', 'p3'], name='r1').plot())
p1
, p2
, p3
now subscribe to Gateway
and conduct their work in parallel. The last .needs()
blocks all Pods until they finish their work. Note: parallelism can also be performed inside a Pod using parallel
:
f = (Flow().add(name='p1', needs='gateway')
.add(name='p2', needs='gateway')
.add(name='p3', parallel=3)
.needs(['p1','p3'], name='r1').plot())
Decentralized Flow
A Flow does not have to be local-only: You can put any Pod to remote(s). In the example below, with the host
keyword gpu-pod
, is put to a remote machine for parallelization, whereas other Pods stay local. Extra file dependencies that need to be uploaded are specified via the upload_files
keyword.
123.456.78.9 |
# have docker installed
docker run --name=jinad --network=host -v /var/run/docker.sock:/var/run/docker.sock jinaai/jina:latest-daemon --port-expose 8000
# to stop it
docker rm -f jinad
|
Local |
import numpy as np
from jina import Flow
f = (Flow()
.add()
.add(name='gpu_pod',
uses='mwu_encoder.yml',
host='123.456.78.9:8000',
parallel=2,
upload_files=['mwu_encoder.py'])
.add())
with f:
f.index_ndarray(np.random.random([10, 100]), output=print)
|
We provide a demo server on cloud.jina.ai:8000
, give the following snippet a try!
from jina import Flow
with Flow().add().add(host='cloud.jina.ai:8000') as f:
f.index(['hello', 'world'])
Asynchronous Flow
While synchronous from outside, Jina runs asynchronously under the hood: it manages the eventloop(s) for scheduling the jobs. If the user wants more control over the eventloop, then AsyncFlow
can be used.
Unlike Flow
, the CRUD of AsyncFlow
accepts input and output functions as async generators. This is useful when your data sources involve other asynchronous libraries (e.g. motor for MongoDB):
from jina import AsyncFlow
async def input_fn():
for _ in range(10):
yield Document()
await asyncio.sleep(0.1)
with AsyncFlow().add() as f:
async for resp in f.index(input_fn):
print(resp)
AsyncFlow
is particularly useful when Jina is using as part of integration, where another heavy-lifting job is running concurrently:
async def run_async_flow_5s(): # WaitDriver pause 5s makes total roundtrip ~5s
with AsyncFlow().add(uses='- !WaitDriver {}') as f:
async for resp in f.index_ndarray(numpy.random.random([5, 4])):
print(resp)
async def heavylifting(): # total roundtrip takes ~5s
print('heavylifting oth
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.