Skip to main content

Task-oriented finetuning for better embeddings on neural search.

Project description



Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

Task-oriented finetuning for better embeddings on neural search

PyPI Codecov branch PyPI - Downloads from official pypistats

Fine-tuning is an effective way to improve the performance on neural search tasks. However, it is non-trivial for many deep learning engineers.

Finetuner makes fine-tuning easier, faster and performant by streamlining the workflow and handling all complexity and infrastructure on the cloud. With Finetuner, one can easily uplift pre-trained models to be more performant and production ready.

📈 Performance promise: uplift pretrained model and deliver SOTA performance on domain-specific neural search applications.

🔱 Simple yet powerful: easy access to 40+ mainstream losses, 10+ optimisers, layer pruning, weights freezing, dimensionality reduction, hard-negative mining, cross-modal model, distributed training.

All-in-cloud: instant training with our free GPU; manage runs, experiments and artifacts on Jina AI Cloud without worrying about provisioning resources, integration complexity and infrastructure.

Documentation

Benchmark

Model Task Metric Pretrained Finetuned Delta Run it!
BERT Quora Question Answering mRR 0.835 0.967 15.8%

Open In Colab

Recall 0.915 0.963 5.3%
ResNet Visual similarity search on TLL mAP 0.110 0.196 78.2%

Open In Colab

Recall 0.249 0.460 84.7%
CLIP Deep Fashion text-to-image search mRR 0.575 0.676 17.4%

Open In Colab

Recall 0.473 0.564 19.2%

All metrics are evaluated on k@20 after training for 5 epochs using Adam optimizer with learning rates of 1e-4 for ResNet, 1e-7 for CLIP and 1e-5 for the BERT models.

Install

Make sure you have Python 3.7+ installed. Finetuner can be installed via pip by executing:

pip install -U finetuner

If you want to encode docarray.DocumentArray objects with the finetuner.encode function, you need to install "finetuner[full]". In this case, some extra dependencies are installed which are necessary to do the inference, e.g., torch, torchvision, and open clip:

pip install "finetuner[full]"

From 0.5.0, Finetuner computing is hosted on Jina AI Cloud. THe last local version is 0.4.1, one can install it via pip or check out git tags/releases here.

Get Started

The following code snippet describes how to fine-tune ResNet50 on Totally Looks Like dataset, it can be run as-is (If there is already a run called resnet50-tll-run, choose a different name):

import finetuner
from finetuner.callback import EvaluationCallback

finetuner.login()

run = finetuner.fit(
    model='resnet50',
    run_name='resnet50-tll-run',
    train_data='tll-train-data',
    callbacks=[
        EvaluationCallback(
            query_data='tll-test-query-data',
            index_data='tll-test-index-data',
        )
    ],
)

Here, the training data used is gathered from the Jina AI Cloud, however data can also be passed as a CSV file or DocumentArray, as described here.
Fine-tuning might take 5 minutes to finish. You can later re-connect your run with:

import finetuner

finetuner.login()

run = finetuner.get_run('resnet50-tll-run')

for log_entry in run.stream_logs():
    print(log_entry)

run.save_artifact('resnet-tll')

Specifically, the code snippet describes the following steps:

  • Login to Jina AI Cloud.
  • Select backbone model, training and evaluation data for your evaluation callback.
  • Start the cloud run.
  • Monitor the status: check the status and logs of the run.
  • Save model for further use and integration.

Finally, you can use the model to encode images:

import finetuner
from docarray import Document, DocumentArray

da = DocumentArray([Document(uri='~/Pictures/your_img.png')])

model = finetuner.get_model('resnet-tll')
finetuner.encode(model=model, data=da)

da.summary()

When encoding, you can provide data either as a DocumentArray or a list. Since the modality of your input data can be inferred from the model being used, there is no need to provide any additional information besides the content you want to encode. When providing data as a list, the finetuner.encode method will return a np.ndarray of embeddings, instead of a docarray.DocumentArray:

import finetuner
from docarray import Document, DocumentArray

images = ['~/Pictures/your_img.png']

model = finetuner.get_model('resnet-tll')
embeddings = finetuner.encode(model=model, data=images)

Training on your own data

If you want to train a model using your own dataset instead of one on the Jina AI Cloud, you can provide labeled data in a CSV file in the following way:

This is an apple    apple_label
This is a pear      pear_label
...

You can then provide the path to your CSV file as your training data:

run = finetuner.fit(
    model='bert-base-cased',
    run_name='bert-my-own-run',
    train_data='path/to/some/data.csv',
)

More information on providing your own training data is found in the Prepare Training Data section of the walkthrough.

Next steps

Intrigued? That's only scratching the surface of what Finetuner is capable of. Read our docs to learn more.

Support

Join Us

Finetuner is backed by Jina AI and licensed under Apache-2.0. We are actively hiring AI engineers, solution engineers to build the next neural search ecosystem in opensource.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

finetuner-0.6.6.tar.gz (32.7 kB view details)

Uploaded Source

File details

Details for the file finetuner-0.6.6.tar.gz.

File metadata

  • Download URL: finetuner-0.6.6.tar.gz
  • Upload date:
  • Size: 32.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.15

File hashes

Hashes for finetuner-0.6.6.tar.gz
Algorithm Hash digest
SHA256 c3386f70cc56043065e63b1395821b06fb0c7420bb660f6f43432699bd921dfb
MD5 5b72fa3061d69f0cd895f75cea2f93a5
BLAKE2b-256 7c85a28ca57dfb783170af8fcaa7a3a3a1e3c2cb8c929f0b0cf9f52bb7c76ce0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page