Jina is the cloud-native neural search solution powered by the state-of-the-art AI and deep learning
Project description
An easier way to build neural search in the cloud
Quick Start • Hello World • Learn • Contribute • Jobs • Website • Slack
English •
Français •
Deutsch •
中文 •
日本語 •
한국어 •
Português •
Русский язык •
український
Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g. text, images, video, audio) in the cloud.
⏱️ Time Saver - Bootstrap an AI-powered system in just a few minutes.
🧠 First-Class AI Models - The design pattern for neural search systems, with first-class support for state-of-the-art AI models.
🌌 Universal Search - Large-scale indexing and querying of any kind of data on multiple platforms: video, image, long/short text, music, source code, etc.
☁️ Cloud Ready - Decentralized architecture with cloud-native features out-of-the-box: containerization, microservice, scaling, sharding, async IO, REST, gRPC.
🧩 Plug & Play - Easily extendable with Pythonic interface.
❤️ Made with Love - Quality first, never compromises, maintained by a full-time, venture-backed team.
Installation
On Linux/macOS with Python 3.7/3.8:
pip install -U jina
To install Jina with extra dependencies, or install on Raspberry Pi please refer to the documentation. Windows users can use Jina via the Windows Subsystem for Linux. We welcome the community to help us with native Windows support.
In a Docker Container
Our universal Docker image supports multiple architectures (including x64, x86, arm-64/v7/v6). They are ready-to-use:
docker run jinaai/jina --help
Jina "Hello, World!" 👋🌍
Just starting out? Try Jina's "Hello, World" - a simple image neural search demo for Fashion-MNIST. No extra dependencies needed, simply run:
jina hello-world
...or even easier for Docker users, no install required:
docker run -v "$(pwd)/j:/j" jinaai/jina hello-world --workdir /j && open j/hello-world.html
# replace "open" with "xdg-open" on Linux
Click here to see console output
Intrigued? Play with different options:
jina hello-world --help
Get Started
Create
Jina provides a high-level Flow API to simplify building search/index workflows. To create a new Flow:
from jina.flow import Flow
f = Flow().add()
This creates a simple Flow with one Pod. You can chain multiple .add()
s in a single Flow.
Visualize
To visualize the Flow, simply chain it with .plot()
. If you are using a Jupyter notebook, it will render a flowchart inline:
f.plot()
Gateway
is the entrypoint of the Flow.
Feed Data
Let's create some random data and index it:
with f:
f.index_ndarray(numpy.random.random[4,2], output_fn=print) # index ndarray data, document sliced on first dimension
f.index_lines(['hello world!', 'goodbye world!']) # index textual data, each element is a document
f.index_files(['/tmp/*.mp4', '/tmp/*.pdf']) # index files and wildcard globs, each file is a document
f.index((jina_pb2.Document() for _ in range(10))) # index raw Jina Documents
To use a Flow, open it using the with
context manager, like you would a file in Python. Once a batch is indexed, the callback function output_fn
is invoked. In the example above, our Flow simply passes the message then prints the result. The whole data stream is asynchronous and efficient.
Add Logic
To add logic to the Flow, use the uses
parameter to attach a Pod with an Executor. uses
accepts multiple value types including class name, Docker image, (inline) YAML or built-in shortcut.
f = (Flow().add(uses='MyBertEncoder') # class name of a Jina Executor
.add(uses='jinahub/pretrained-cnn:latest') # Dockerized Jina Pod
.add(uses='myencoder.yaml') # YAML serialization of a Jina Executor
.add(uses='!WaveletTransformer | {freq: 20}') # inline YAML config
.add(uses='_pass')) # built-in shortcut executor
The power of Jina lies in its decentralized architecture: each add
creates a new Pod, and these Pods can be run as a local thread/process, a remote process, inside a Docker container, or even inside a remote Docker container.
Inter & Intra Parallelism
Chaining .add()
s creates a sequential Flow. For parallelism, use the needs
parameter:
f = (Flow().add(name='p1', needs='gateway')
.add(name='p2', needs='gateway')
.add(name='p3', needs='gateway')
.needs(['p1','p2', 'p3'], name='r1').plot())
p1
, p2
, p3
now subscribe to Gateway
and conduct their work in parallel. The last .needs()
blocks all Pods until they finish their work. Note: parallelism can also be performed inside a Pod using parallel
:
f = (Flow().add(name='p1', needs='gateway')
.add(name='p2', needs='gateway')
.add(name='p3', parallel=3)
.needs(['p1','p3'], name='r1').plot())
That's all you need to know for understanding the magic behind hello-world
. Now let's dive into it!
Breakdown of hello-world
Customize Encoder
Let's first build a naive image encoder that embeds images into vectors using an orthogonal projection. To do this, we simply inherit from BaseImageEncoder
: a base class from the jina.executors.encoders
module. We then override its __init__()
and encode()
methods.
import numpy as np
from jina.executors.encoders import BaseImageEncoder
class MyEncoder(BaseImageEncoder):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
np.random.seed(1337)
H = np.random.rand(784, 64)
u, s, vh = np.linalg.svd(H, full_matrices=False)
self.oth_mat = u @ vh
def encode(self, data: 'np.ndarray', *args, **kwargs):
return (data.reshape([-1, 784]) / 255) @ self.oth_mat
Jina provides a family of Executor
classes, which summarize frequently-used algorithmic components in neural search. This family consists of encoders, indexers, crafters, evaluators, and classifiers, each with a well-designed interface. You can find the list of all 107 built-in executors here. If they don't meet your needs, inheriting from one of them is the easiest way to bootstrap your own Executor. Simply use our Jina Hub CLI:
pip install jina[hub] && jina hub new
Test Encoder in Flow
Let's test our encoder in the Flow with some synthetic data:
def validate(docs):
assert len(docs) == 100
assert NdArray(docs[0].embedding).value.shape == (64,)
f = Flow().add(uses='MyEncoder')
with f:
f.index_ndarray(np.random.random([100, 28, 28]), output_fn=validate, callback_on='docs')
All good! Now our validate
function confirms that all one hundred 28x28 synthetic images have been embedded into 100x64 vectors.
Parallelism & Batching
By setting a larger input, you can play with batch_size
and parallel
:
f = Flow().add(uses='MyEncoder', parallel=10)
with f:
f.index_ndarray(np.random.random([60000, 28, 28]), batch_size=1024)
Add Data Indexer
Now we need to add an indexer to store all the embeddings and the image for later retrieval. Jina provides a simple numpy
-powered vector indexer NumpyIndexer
, and a key-value indexer BinaryPbIndexer
. We can combine them in a single YAML file:
!CompoundIndexer
components:
- !NumpyIndexer
with:
index_filename: vec.gz
- !BinaryPbIndexer
with:
index_filename: chunk.gz
metas:
workspace: ./
!
tags a structure with a class namewith
defines arguments for initializing this class object.
Essentially, the above YAML config is equivalent to the following Python code:
from jina.executors.indexers.vector import NumpyIndexer
from jina.exeuctors.indexers.keyvalue import BinaryPbIndexer
a = NumpyIndexer(index_filename='vec.gz')
b = BinaryPbIndexer(index_filename='vec.gz')
c = CompoundIndexer()
c.components = lambda: [a, b]
Compose Flow in Python/YAML
Now let's add our indexer YAML file to the Flow with .add(uses=)
. Let's also add two shards to the indexer to improve its scalability:
f = Flow().add(uses='MyEncoder', parallel=2).add(uses='myindexer.yml', shards=2, separated_workspace=True).plot()
When you have many arguments, constructing a Flow in Python can get cumbersome. In that case, you can simply move all arguments into one flow.yml
:
!Flow
pods:
encode:
uses: MyEncoder
parallel: 2
index:
uses: myindexer.yml
shards: 2
separated_workspace: true
And then load it in Python:
f = Flow.load_config('flow.yml')
Search via Query Flow
Querying a Flow is similar to what we did with indexing. Simply load the query Flow and switch from f.index
to f.search
. Say you want to retrieve the top 50 documents that are similar to your query and then plot them in HTML:
f = Flow.load_config('flows/query.yml')
with f:
f.search_ndarray(shuffle=True, size=128, output_fn=plot_in_html, top_k=50)
REST Interface of Query Flow
In practice, the query Flow and the client (i.e. data sender) are often physically seperated. Moreover, the client may prefer to use a REST API rather than gRPC when querying. You can set port_expose
to a public port and turn on REST support with rest_api=True
:
f = Flow(port_expose=45678, rest_api=True)
with f:
f.block()
That is the essense behind jina hello-world
. It is merely a taste of what Jina can do. We’re really excited to see what you do with Jina! You can easily create a Jina project from templates with one terminal command:
pip install jina[hub] && jina hub new --type app
This creates a Python entrypoint, YAML configs and a Dockerfile. You can start from there.
Tutorials
Jina 101: First Things to Learn About JinaEnglish • 日本語 • Français • Português • Deutsch • Русский язык • 中文 • عربية |
Level | Tutorials |
---|---|
🐣 |
Build an NLP Semantic Search SystemSearch South Park scripts and practice with Flows and Pods |
🐣 |
My First Jina AppUsing cookiecutter for bootstrap a jina app |
🐣 |
Fashion Search with Query LanguageSpice up the Hello-World with Query Language |
🕊 |
Use Chunk to search LyricsSplit documents in order to search on a finegrained level |
🕊 |
Mix and Match images and captionsSearch cross modal to get images from captions and vice versa |
🚀 |
Scale Up Video Semantic SearchImprove performance using prefetching and sharding |
Documentation
Documentation is built on every push, merge, and release of Jina's master branch.
The Basics
- Use Flow API to Compose Your Search Workflow
- Input and Output Functions in Jina
- Use Dashboard to Get Insight of Jina Workflow
- Distribute Your Workflow Remotely
- Run Jina Pods via Docker Container
Reference
- Command line interface arguments
- Python API interface
- YAML syntax for Executor, Driver and Flow
- Protobuf schema
- Environment variables
- ... and more
Are you a "Doc"-star? Join us! We welcome all kinds of improvements on the documentation.
Documentation for older versions is archived here.
Contributing
We welcome all kinds of contributions from the open-source community, individuals and partners. We owe our success to your active involvement.
Contributors ✨
Community
- Code of conduct - play nicely with the Jina community
- Slack workspace - join #general on our Slack to meet the team and ask questions
- YouTube channel - subscribe to the latest video tutorials, release demos, webinars and presentations.
- LinkedIn - get to know Jina AI as a company and find job opportunities
- - follow and interact with us using hashtag
#JinaSearch
- Company - know more about our company and how we are fully committed to open-source.
Open Governance
GitHub milestones lay out the path to Jina's future improvements.
As part of our open governance model, we host Jina's Engineering All Hands in public. This Zoom meeting recurs monthly on the second Tuesday of each month, at 14:00-15:30 (CET). Everyone can join in via the following calendar invite.
The meeting will also be live-streamed and later published to our YouTube channel.
Join Us
Jina is an open-source project. We are hiring full-stack developers, evangelists, and PMs to build the next neural search ecosystem in open source.
License
Copyright (c) 2020 Jina AI Limited. All rights reserved.
Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.