How to build an Image-to-Image search tool using CLIP & VectorDBs

November 23, 2023
 min read
How to build an Image-to-Image search tool using CLIP & VectorDBs

In this article we walk you through the process of building an Image-to-Image search tool from scratch!

By the end of this post you will learn by doing why Image-to-Image search is a powerful tool that can help you find similar images in a vector database.

Table of Contents

  1. Image-to-Image Search
  2. CLIP & VectorDBs: in brief
  3. Building an image-to-image search tool
  4. Time for testing: The Lord of The Rings
  5. Wait, what can I do if I have 1M or even 100M images?

1. Image-to-Image Search

What do we mean by Image-to-Image Search?

In traditional image search engines, you typically use text queries to find images, and the search engine returns results based on keywords associated with those images. On the other hand, in Image-to-Image search, you start with an image as a query and the system retrieves images that visually resemble the query image.

Imagine you have a painting, like a beautiful picture of a sunset. Now, you want to find other paintings that look just like it, but you can’t use words to describe it. Instead, you show the computer your painting, and it goes through all the paintings it knows and finds ones that are very similar, even if they have different names or descriptions. Image-to-Image Search, ELI5.

What can I do with this search tool?

An image-to-image search engine opens up exciting possibilities:

  • Finding specific data — Search for images that contain specific objects you want to train a model to recognize.
  • Error analysis — When a model misclassifies an object, search for visually similar images it also fails on.
  • Model debugging — Surface other images that contain attributes or defects that cause unwanted model behavior.

2. CLIP and VectorDBs: in brief

Figure 1. Indexing stage in Image-to-Image search

Figure 1 shows the steps to index a dataset of images in a vector database.

  • Step 1: Gathering a dataset of images (can be raw/unlabelled images).
  • Step 2: CLIP [1], an embedding model, is used to extract a high-dimensional vector representation of an image that captures its semantic and perceptual features.
  • Step 3: These images are encoded into an embedding space, where embeddings (of the images) are indexed in a vector database like Redis or Milvus.

Figure 2. Query stage: the most similar images to the given query are retrieved

At query time, Figure 2, a sample image is passed through the same CLIP encoder to obtain its embedding. A vector similarity search is performed to efficiently find the top k nearest database image vectors. Images with the highest similarity score to the given query are returned as the most visually similar search results.

Cosine similarity is the most used similarity metric in VectorDB applications:

Cosine Similarity is a measure of similarity between two non-zero vectors defined in an inner product space.

Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. [3]

At query time, Figure 2, a sample image is passed through the same CLIP encoder to obtain its embedding. A vector similarity search is performed to efficiently find the top k nearest database image vectors. Images with the highest cosine similarity scores to the query embedding are returned as the most visually similar search results.

3. Building an image-to-image search engine

3.1 Dataset — The Lord of The Rings

We use Google Search to query images related to the keyword: “the lord of the rings film scenes”. Building on top of this code, we create a function to retrieve 100 urls based on the given query.

import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36"

params = {
    "q": "the lord of the rings film scenes", # search query
    "tbm": "isch",                # image results
    "hl": "en",                   # language of the search
    "gl": "us",                   # country where search comes from
    "ijn": "0"                    # page number

html = requests.get("", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

def get_images():

    if you try to json.loads() without json.dumps() it will throw an error:
    "Expecting property name enclosed in double quotes"

    google_images = []

    all_script_tags ="script")

    # #
    matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)

    matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)

    matched_google_images_thumbnails = ", ".join(
                   str(matched_google_image_data))).split(", ")

    thumbnails = [
        bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails

    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))

    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)

    full_res_images = [
        bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images

    return full_res_images

3.2 Obtaining embedding vectors with CLIP

🗒 Note: Find all the libraries and helper functions to run the code in this Colab notebook.

Extracting all the embeddings for our set of images.

def get_all_image_embeddings_from_urls(dataset, processor, model, device, num_images=100):
    embeddings = []

    # Limit the number of images to process
    dataset = dataset[:num_images]
    working_urls = []

    #for image_url in dataset['image_url']:
    for image_url in dataset:
      if check_valid_URL(image_url):
              # Download the image
              response = requests.get(image_url)
              image ="RGB")
              # Get the embedding for the image
              embedding = get_single_image_embedding(image, processor, model, device)
              #embedding = get_single_image_embedding(image)
          except Exception as e:
              print(f"Error processing image from {image_url}: {e}")
          print(f"Invalid or inaccessible image URL: {image_url}")

    return embeddings, working_urls

LOR_embeddings, valid_urls = get_all_image_embeddings_from_urls(list_image_urls, processor, model, device, num_images=100)
Invalid or inaccessible image URL:
Invalid or inaccessible image URL:
Invalid or inaccessible image URL:

97 out of 100 urls contain a valid image.

3.3. Storing our embeddings in Pinecone

For this article, we’ll use Pinecone as an example of a VectorDB, but you may use a variety of other VectorDB’s providers, such as: QDrant, Milvus, Mongo, or Redis.

🔍 You can find a nice comparison of these vector database services on our article on VectorDBs.

To store our embeddings in Pinecone [2], you first need to create a Pinecone account. After that, create an index with the name “image-to-image”.

   api_key = "YOUR-API-KEY",
   environment="gcp-starter"  # find next to API key in console

my_index_name = "image-to-image"
vector_dim = LOR_embeddings[0].shape[1]

if my_index_name not in pinecone.list_indexes():
  print("Index not present")

# Connect to the index
my_index = pinecone.Index(index_name = my_index_name)

Create a function to store your data in your Pinecone index.

def create_data_to_upsert_from_urls(dataset,  embeddings, num_images):
  metadata = []
  image_IDs = []
  for index in range(len(dataset)):
        'ID': index,
        'image': dataset[index]
  image_embeddings = [arr.tolist() for arr in embeddings]
  data_to_upsert = list(zip(image_IDs, image_embeddings, metadata))
  return data_to_upsert

Run the above function to obtain:

LOR_data_to_upsert = create_data_to_upsert_from_urls(valid_urls, 
                                LOR_embeddings, len(valid_urls))

my_index.upsert(vectors = LOR_data_to_upsert)
# {'upserted_count': 97}

# {'dimension': 512,
# 'index_fullness': 0.00097,
# 'namespaces': {'': {'vector_count': 97}},
# 'total_vector_count': 97}

3.4 Testing our image-to-image search tool

# For a random image
n = random.randint(0,len(valid_urls)-1)
print(f"Sample image with index {n} in {valid_urls[n]}")

Sample image with index 47 in

Figure 3. Sample image to query (can be found in the above URL)

# 1. Get the image from url
LOR_image_query = get_image(valid_urls[n])
# 2. Obtain embeddings (via CLIP) for the given image
LOR_query_embedding = get_single_image_embedding(LOR_image_query, processor, model, device).tolist()
# 3. Search on Vector DB index for similar images to "LOR_query_embedding"
LOR_results = my_index.query(LOR_query_embedding, top_k=3, include_metadata=True)
# 4. See the results

Figure 5. Results showing the similarity score for every match

Figure 4 shows the results obtained by our Image-to-Image search tool. All of them depict at least two characters walking in an open background. resembling a landscape. Specifically, the sample with ID 47 attains the highest similarity score, 1.0. This is no surprise, as our dataset includes the original image used in the query (Figure 3). The next most similar samples are a tie: both ID 63 and ID 30 each have a score of 0.77.

5. Wait, what if have 1 Million images or even 100 Million images?

As you might have realized, building a tool to do image-to-image search by querying some images from Google Search is fun. But, what if you actually have a dataset of 100M+ images? 🤔

In this case you are likely to build a system rather than a tool. Setting up a scalable system is not an easy feat though. Also, there are a number of costs involved (e.g., storage costs, maintenance, writing the actual code).

For these scenarios, at Tenyks we have built a best-in class Image-to-Image search engine that can help you perform multi-modal queries in seconds, even if you have 1 Million images or more!

Our system also supports text and object-level search! Try our free sandbox account here.


[1] CLIP

[2] Text-to-image and image-to-image Pinecone

Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan

If you would like to know more about Tenyks, sign up for a sandbox account.

Stay In Touch
Subscribe to our Newsletter
Stay up-to-date on the latest blogs and news from Tenyks!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Reach Super-Human Model Performance at Record Breaking Speed!

Figure out what’s wrong and fix it instantly
Try for Free