Skip to main content

Jina is the cloud-native neural search solution powered by the state-of-the-art AI and deep learning

Project description

Jina banner

An easier way to build neural search in the cloud

Jina Python 3.7 3.8 3.9 PyPI Docker Image Version (latest semver) CI CD codecov

Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g. text, images, video, audio) in the cloud.

⏱️ Time Saver - The design pattern of neural search systems, from zero to a production-ready system in minutes.

🍱 Full-Stack Ownership - Keep an end-to-end stack ownership of your solution, avoid the integration pitfalls with fragmented, multi-vendor, generic legacy tools.

🌌 Universal Search - Large-scale indexing and querying of unstructured data: video, image, long/short text, music, source code, etc.

🧠 First-Class AI Models - First-class support for state-of-the-art AI models, easily usable and extendable with a Pythonic interface.

🌩️ Fast & Cloud Ready - Decentralized architecture from day one. Scalabe & cloud-native by design: enjoy containerizing, distributing, sharding, async, REST/gRPC/WebSocket.

❤️ Made with Love - Never compromise on quality, actively maintained by a passionate full-time, venture-backed team.


DocsHello WorldQuick StartLearnExamplesContributeJobsWebsiteSlack

Installation

📦
x86/64,arm/v6,v7,v8 (Apple M1)
On Linux/macOS & Python 3.7/3.8/3.9 Docker Users
Standard pip install -U jina docker run jinaai/jina:latest
Daemon pip install -U "jina[daemon]" docker run --network=host jinaai/jina:latest-daemon
With Extras pip install -U "jina[devel]" docker run jinaai/jina:latest-devel
Dev/Pre-Release pip install --pre jina docker run jinaai/jina:master

Version identifiers are explained here. To install Jina with extra dependencies please refer to the docs. Jina can run on Windows Subsystem for Linux. We welcome the community to help us with native Windows support.

Jina "Hello, World!" 👋🌍

Just starting out? Try Jina's "Hello, World" - a simple image neural search demo for Fashion-MNIST. No extra dependencies needed, simply run:

jina hello-world  # more options in --help

...or even easier for Docker users, no install required:

docker run -v "$(pwd)/j:/j" jinaai/jina hello-world --workdir /j && open j/hello-world.html  
# replace "open" with "xdg-open" on Linux
Click here to see console output

hello world console output

This downloads the Fashion-MNIST training and test dataset and tells Jina to index 60,000 images from the training set. Then it randomly samples images from the test set as queries and asks Jina to retrieve relevant results. The whole process takes about 1 minute, and after running opens a webpage and shows results:

Covid-19 Chatbot

For NLP engineers, we provide a simple chatbot demo for answering Covid-19 questions. You will need PyTorch and Transformers, which can be installed along with Jina:

pip install "jina[torch,transformers]"
jina hello-world-chatbot

This downloads CovidQA dataset and tells Jina to index 418 question-answer pairs with DistilBERT. The index process takes about 1 minute on CPU. Then it opens a webpage where you can input questions and ask Jina.

Get Started

🥚 CRUD FunctionsDocumentFlow
🐣 Feed DataFetch ResultAdd LogicInter & Intra ParallelismDecentralizeAsynchronous
🐥 Customize EncoderTest EncoderParallelism & BatchingAdd Data IndexerCompose Flow from YAMLSearchEvaluationREST Interface

🥚 Fundamental

CRUD Functions

First we look at basic CRUD operations. In Jina, CRUD corresponds to four functions: index (create), search (read), update, and delete. With Documents below as an example:

import numpy as np
from jina import Document
docs = [Document(id='🐲', embedding=np.array([0, 0]), tags={'guardian': 'Azure Dragon', 'position': 'East'}),
        Document(id='🐦', embedding=np.array([1, 0]), tags={'guardian': 'Vermilion Bird', 'position': 'South'}),
        Document(id='🐢', embedding=np.array([0, 1]), tags={'guardian': 'Black Tortoise', 'position': 'North'}),
        Document(id='🐯', embedding=np.array([1, 1]), tags={'guardian': 'White Tiger', 'position': 'West'})]

Let's build a Flow with a simple indexer:

from jina import Flow
f = Flow().add(uses='_index')

Document and Flow are basic concepts in Jina, which will be explained later. _index is a built-in embedding + structured storage that one can use out of the box.

Index
# save four docs (both embedding and structured info) into storage
with f:
    f.index(docs, on_done=print)
Search
# retrieve top-3 neighbours of 🐲, this print 🐲🐦🐢 with score 0, 1, 1 respectively 
with f:
    f.search(docs[0], top_k=3, on_done=lambda x: print(x.docs[0].matches))
{"id": "🐲", "tags": {"guardian": "Azure Dragon", "position": "East"}, "embedding": {"dense": {"buffer": "AAAAAAAAAAAAAAAAAAAAAA==", "shape": [2], "dtype": "<i8"}}, "score": {"opName": "NumpyIndexer", "refId": "🐲"}, "adjacency": 1}
{"id": "🐦", "tags": {"position": "South", "guardian": "Vermilion Bird"}, "embedding": {"dense": {"buffer": "AQAAAAAAAAAAAAAAAAAAAA==", "shape": [2], "dtype": "<i8"}}, "score": {"value": 1.0, "opName": "NumpyIndexer", "refId": "🐲"}, "adjacency": 1}
{"id": "🐢", "tags": {"guardian": "Black Tortoise", "position": "North"}, "embedding": {"dense": {"buffer": "AAAAAAAAAAABAAAAAAAAAA==", "shape": [2], "dtype": "<i8"}}, "score": {"value": 1.0, "opName": "NumpyIndexer", "refId": "🐲"}, "adjacency": 1}
Update
# update 🐲 embedding in the storage
docs[0].embedding = np.array([1, 1])
with f:
    f.update(docs[0])
Delete
# remove 🐦🐲 Documents from the storage
with f:
    f.delete(['🐦', '🐲'])

Document

Document is Jina's primitive data type. It can contain text, image, array, embedding, URI, and accompanied by rich meta information. To construct a Document, one can use:

import numpy
from jina import Document

doc1 = Document(content=text_from_file, mime_type='text/x-python')  # a text document contains python code
doc2 = Document(content=numpy.random.random([10, 10]))  # a ndarray document

Document can be recurred both vertically and horizontally to have nested documents and matched documents. To better see the recursive structure of a document, one can use .plot() function. If you are using JupyterLab/Notebook, all Document objects will be auto-rendered.

import numpy
from jina import Document

d0 = Document(id='🐲', embedding=np.array([0, 0]))
d1 = Document(id='🐦', embedding=np.array([1, 0]))
d2 = Document(id='🐢', embedding=np.array([0, 1]))
d3 = Document(id='🐯', embedding=np.array([1, 1]))

d0.chunks.append(d1)
d0.chunks[0].chunks.append(d2)
d0.matches.append(d3)

d0.plot()  # simply `d0` on JupyterLab 
Click here to see more about MultimodalDocument

MultimodalDocument

A MultimodalDocument is a document composed of multiple Document from different modalities (e.g. text, image, audio).

Jina provides multiple ways to build a multimodal Document. For example, one can provide the modality names and the content in a dict:

from jina import MultimodalDocument
document = MultimodalDocument(modality_content_map={
    'title': 'my holiday picture',
    'description': 'the family having fun on the beach',
    'image': PIL.Image.open('path/to/image.jpg')
})

One can also compose a MultimodalDocument from multiple Document directly:

from jina.types import Document, MultimodalDocument

doc_title = Document(content='my holiday picture', modality='title')
doc_desc = Document(content='the family having fun on the beach', modality='description')
doc_img = Document(content=PIL.Image.open('path/to/image.jpg'), modality='image')
doc_img.tags['date'] = '10/08/2019' 

document = MultimodalDocument(chunks=[doc_title, doc_description, doc_img])
Fusion Embeddings from Different Modalities

To extract fusion embeddings from different modalities Jina provides BaseMultiModalEncoder abstract class, which has a unqiue encode interface.

def encode(self, *data: 'numpy.ndarray', **kwargs) -> 'numpy.ndarray':
    ...

MultimodalDriver provides data to the MultimodalDocument in the correct expected order. In this example below, image embedding is passed to the endoder as the first argument, and text as the second.

!MyMultimodalEncoder
with:
  positional_modality: ['image', 'text']
requests:
  on:
    [IndexRequest, SearchRequest]:
      - !MultiModalDriver {}

Interested readers can refer to jina-ai/example: how to build a multimodal search engine for image retrieval using TIRG (Composing Text and Image for Image Retrieval) for the usage of MultimodalDriver and BaseMultiModalEncoder in practice.

Flow

Jina provides a high-level Flow API to simplify building CRUD workflows. To create a new Flow:

from jina import Flow
f = Flow().add()

This creates a simple Flow with one Pod. You can chain multiple .add()s in a single Flow.

To visualize the Flow, simply chain it with .plot('my-flow.svg'). If you are using a Jupyter notebook, the Flow object will be displayed inline without plot.

Gateway is the entrypoint of the Flow.

Get the vibe? Now we are talking! Let's learn more about the basic concepts and features in Jina.


🥚 CRUD FunctionsDocumentFlow
🐣 Feed DataFetch ResultAdd LogicInter & Intra ParallelismDecentralizeAsynchronous
🐥 Customize EncoderTest EncoderParallelism & BatchingAdd Data IndexerCompose Flow from YAMLSearchEvaluationREST Interface

🐣 Basic

Feed Data

To use a Flow, open it via with context manager, like you would open a file in Python. Now let's create some empty document and index it:

from jina import Document

with Flow().add() as f:
    f.index((Document() for _ in range(10)))

Flow supports CRUD operations: index, search, update, delete. Besides, it also provides sugary syntax on ndarray, csv, ndjson and arbitrary files.

Input Example on index/search Explain
numpy.ndarray
with f:
  f.index_ndarray(numpy.random.random([4,2]))

Input four Document, each document.blob is a ndarray([2])

CSV
with f, open('index.csv') as fp:
  f.index_csv(fp1, field_resolver={'pic_url': 'uri'})

Each line in the index.csv is constructed as Document, CSV's field pic_url is mapped to document.uri.

JSON Lines/ndjson/LDJSON
with f, open('index.ndjson') as fp:
  f.index_ndjson(fp1, field_resolver={'question_id': 'id'})

Each line in index.ndjson is constructed as Document, JSON's field question_id is mapped to document.id.

Files with wildcard
with f:
  f.index_files(['/tmp/*.mp4', '/tmp/*.pdf'])

Each file captured is constructed as a Document, whose content (text, blob, buffer) is auto-guessed & filled.

Fetch Result

Once a request is done, callback functions are fired. Jina Flow implements Promise-like interface, you can add callback functions on_done, on_error, on_always to hook different events. In the example below, our Flow passes the message then prints the result when successful. If something wrong, it beeps. Finally, the result is written to output.txt.

def beep(*args):
    # make a beep sound
    import os
    os.system('echo -n "\a";')

with Flow().add() as f, open('output.txt', 'w') as fp:
    f.index(numpy.random.random([4, 5, 2]),
            on_done=print, on_error=beep, on_always=lambda x: fp.write(x.json()))

Add Logic

To add logic to the Flow, use the uses parameter to attach a Pod with an Executor. uses accepts multiple value types including class name, Docker image, (inline) YAML or built-in shortcut.

f = (Flow().add(uses='MyBertEncoder')  # class name of a Jina Executor
           .add(uses='docker://jinahub/pod.encoder.dummy_mwu_encoder:0.0.6-0.9.3')  # the image name
           .add(uses='myencoder.yml')  # YAML serialization of a Jina Executor
           .add(uses='!WaveletTransformer | {freq: 20}')  # inline YAML config
           .add(uses='_pass')  # built-in shortcut executor
           .add(uses={'__cls': 'MyBertEncoder', 'with': {'param': 1.23}}))  # dict config object with __cls keyword

The power of Jina lies in its decentralized architecture: each add creates a new Pod, and these Pods can be run as a local thread/process, a remote process, inside a Docker container, or even inside a remote Docker container.

Inter & Intra Parallelism

Chaining .add()s creates a sequential Flow. For parallelism, use the needs parameter:

f = (Flow().add(name='p1', needs='gateway')
           .add(name='p2', needs='gateway')
           .add(name='p3', needs='gateway')
           .needs(['p1','p2', 'p3'], name='r1').plot())

p1, p2, p3 now subscribe to Gateway and conduct their work in parallel. The last .needs() blocks all Pods until they finish their work. Note: parallelism can also be performed inside a Pod using parallel:

f = (Flow().add(name='p1', needs='gateway')
           .add(name='p2', needs='gateway')
           .add(name='p3', parallel=3)
           .needs(['p1','p3'], name='r1').plot())

Decentralized Flow

A Flow does not have to be local-only, one can put any Pod to remote(s). In the example below, with the host keyword gpu-pod is put to a remote machine for parallelization, whereas other pods stay local. Extra file dependencies that need to be uploaded are specified via the upload_files keyword.

123.456.78.9
# have docker installed
docker run --name=jinad --network=host -v /var/run/docker.sock:/var/run/docker.sock jinaai/jina:latest-daemon --port-expose 8000
# to stop it
docker rm -f jinad
Local
import numpy as np
from jina import Flow

f = (Flow()
     .add()
     .add(name='gpu_pod',
          uses='mwu_encoder.yml',
          host='123.456.78.9:8000',
          parallel=2,
          upload_files=['mwu_encoder.py'])
     .add())

with f:
    f.index_ndarray(np.random.random([10, 100]), output=print)

We provide a demo server on cloud.jina.ai:8000, give the following snippet a try!

from jina import Flow

with Flow().add().add(host='cloud.jina.ai:8000') as f:
    f.index(['hello', 'world'])

Asynchronous Flow

Synchronous from outside, Jina runs asynchronously underneath: it manages the eventloop(s) for scheduling the jobs. If the user wants more control over the eventloop, then AsyncFlow comes to use.

Unlike Flow, the CRUD of AsyncFlow accepts input & output functions as async generator. This is useful when your data sources involve other asynchronous libraries (e.g. motor for MongoDB):

from jina import AsyncFlow

async def input_fn():
    for _ in range(10):
        yield Document()
        await asyncio.sleep(0.1)

with AsyncFlow().add() as f:
    async for resp in f.index(input_fn):
        print(resp)

AsyncFlow is particular useful when Jina is using as part of the integration, where another heavy-lifting job is running concurrently:

async def run_async_flow_5s():  # WaitDriver pause 5s makes total roundtrip ~5s
    with AsyncFlow().add(uses='- !WaitDriver {}') as f:
        async for resp in f.index_ndarray(numpy.random.random([5, 4])):
            print(resp)

async def heavylifting():  # total roundtrip takes ~5s
    print('heavylifting other io-bound jobs, e.g. download, upload, file io')
    await asyncio.sleep(5)
    print('heavylifting done after 5s')

async def concurrent_main():  # about 5s; but some dispatch cost, can't be just 5s, usually at <7s
    await asyncio.gather(run_async_flow_5s(), heavylifting())

if __name__ == '__main__':
    asyncio.run(concurrent_main())

AsyncFlow is very useful when using Jina inside the Jupyter Notebook. As Jupyter/ipython already manages an eventloop and thanks to autoawait, AsyncFlow can run out-of-the-box in Jupyter.

That's all you need to know for understanding the magic behind hello-world. Now let's dive into it!


🥚 CRUD FunctionsDocumentFlow
🐣 [Feed Dat

Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jina-0.9.33.dev7.tar.gz (337.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page