Fashion Image Search Engine

Introduction

Computers are able to see, hear and learn. Welcome to the future.

Dave Waters

In this post, I want to talk about a computer vision use case, it’s called Content Based Image Retrieval or CBIR in short. In simple words, retrieving images relevant to the user needs from image databases on the basis of low-level visual features.

Image Search Engines are a great example of CBIR, where the user simply gives an image as the input and the search engine returns the images similar to the input image. In this blog post, we are going to cover some theoretical concepts of computer vision and then using these concepts, we will be building a Fashion Image Search Engine. So without a further adieu, let’s get started.

Computer Vision

Computer Vision is an application of Artificial Intelligence where we try to extract information and insights from visual inputs such as digital images, videos etc. Some of the major use cases in computer vision are,

  1. Self Driving Cars
  2. Object Detection
  3. Image Retrieval
  4. X-Ray Analysis
  5. CT and MRI

In this blog post, we will look into one such application of Computer Vision called the Content Based Image Retrieval or in simple words an Image Search Engine.

Before moving to the next section, I would like to clarify that this post assumes that you have some basic knowledge of Convolutional Neural Networks or CNNs in short. If not, please feel free to refer to my another post here, which talks about the CNNs in depth. In the next, section I will take the discussion on CNNs further and talk about about the concept of transfer learning and then discuss some pre-trained CNN architectures that will be used in our project.

Transfer Learning

It is generally not a good idea to train a Neural Network from scratch, rather we should always try to find an existing neural network that accomplishes a similar task and use its weights and lower layers to solve your problem. This is the concept of Transfer Learning. This can help ease your job because now you don’t have to train an entire neural network on your data. The concept of transfer learning can be implemented in computer vision using pre-trained CNN architectures.

In the next section, we will discuss about some famous CNN architectures which will be used further in our project with the help of transfer learning.

CNN Architectures

A CNN architecture can be made by stacking a few convolutional layers followed by pooling layers and then finally fully connected layers. There are some famous CNN architectures which have managed to secure top positions in a challenge called ImageNet challenge and are often used for various use cases using the concept of transfer learning. We will discuss three such architectures here and use them in our project.

1. ResNet

ResNet stands for Residual Network. There are 3 variants of this architecture which have 34, 50 and 101 layers. This is a deep CNN architecture. The key to train such deep networks is to use skip connections or shortcut connections. A skip connection is a signal feeding into a layer which is added to the output layer. The following diagram explains the skip connections in CNNs.

Microsoft Presents : Deep Residual Networks | by Baki Er | Medium
Regular Neural Network and Neural Network with Skip Connections
Source

From the above diagram it can be observed that a regular network models a cost function while the network with skip connection models a cost function plus the inputs. When we initialise a regular neural network, its weights are close to zero, so the network just outputs values close to zero. If we add a skip connection, the resulting network just outputs a copy of its inputs. This is what speeds up the training of deep networks. We can also add a set of residual units to train even deeper networks. A Residual Unit is a small neural network with skip connections. The following image shows a Neural Network with residual units.

Neural Network with Residual Units
Source

2. VGGNet

VGGNet is a very simple and classical architecture with either 2 or 3 convolutional layers and a pooling layer, then again 2 or 3 convolutional layers and a pooling layer, and so on. The network reaches a total of just 16 or 19 convolutional layers. The architecture ends with a final dense network with 2 hidden layers and the output layer,

3. Xception

Xception stands for Extreme Inception. The architecture uses a special type of layer called a Depth-wise Separable Convolution Layer. Regular convolutional layer uses filters that try to simultaneously capture the spatial patterns (e.g., shapes like oval, circle etc.) and cross-channel patterns (e.g., mouth + nose + eyes = face). What a depth-wise separable layer does is, it models these 2 patterns (spatial and cross-channel) separately. It is composed of 2 parts: the first part applies a single spatial filter for each input feature map and the second part deals with the cross-channel patterns (it is a regular convolution layer with 1X1 filters). The following image shows the depth-wise separable convolutional layer in comparison with a regular convolution layer. The first part is known as depth-wise layer as it has spatial-only filters one per input channel (there are generally 3 input channels R,G & B). The second part is known as a point-wise layer which is a 1X1 convolutional layer hence, it looks at each point.

a): Regular Convolution layer, b): Depth-wise separable convolution layer
Source

So here we discussed about the concept of transfer learning and various CNN architectures. In the next section, we discuss the most crucial part required to build a Image Search Engine, it is called an Image Retrieval System which is a subtype of Information Retrieval System.

Image Retrieval System

The simple concept of an Information Retrieval System is to compute the similarity between a query and the information/features available in the database and return the most similar items. In case of an Image Retrieval System we extract the features of all the images from the database and store it. Then when a query image arrives, we extract its features and compute the similarity between the features of the query image and the features of the images from our database. Finally, top n similar images are returned to the user. The following image demonstrates the idea of an Image Retrieval System.

An Image Retrieval System
Source

Fashion Image Search Engine

Source

A Fashion Image Search Engine returns the images of apparels similar to the input image along with a short description. It is an application of Content Based Image Retrieval. We will be using Keras and Tensorflow libraries to build this project. You can access the data used for this project from here. The link can also be found in the “References” section. The dataset consists of the following,

  1. CSV file that contains the listing of products, their short descriptions and their prices.
  2. Images from the csv files that contains the farfetch listing

Let’s load the listing data and look at the variables that we have. The following code cell demonstrates this,

# Read the data files
listing_data = pd.read_csv("current_farfetch_listings.csv")
listing_data.head()

Next, we create a function to load the images from the directories. The below code cells demonstrates the following.

# Extracting the Image 
def load_images():
    
    # Store the directory path in a varaible
    cutout_img_dir = "/content/cutout-img/cutout"
    model_img_dir = "/content/model-img/model"
    
    # list the images in these directories
    cutout_images = os.listdir(cutout_img_dir)
    model_images = os.listdir(model_img_dir)
    
    # load 10 Random Cutout Images: Sample out 10 images randomly from the above list
    sample_cutout_images = sample(cutout_images,10)
    fig = plt.figure(figsize=(10, 5))
    
    print("==============Cutout Images==============")
    for i, img_name in enumerate(sample_cutout_images):
        plt.subplot(2, 5, i+1)
        img_path = os.path.join(cutout_img_dir, img_name)
        loaded_img = image.load_img(img_path)
        img_array = image.img_to_array(loaded_img, dtype='int')
        plt.imshow(img_array)
        plt.axis('off')
        
    plt.show()
    print()
    # load 10 Random Model Images: Sample out 10 images randomly from the above list
    sample_model_images = sample(model_images,10)
    fig = plt.figure(figsize=(10,5))
    
    print("==============Model Images==============")
    for i, img_name in enumerate(sample_model_images):
        plt.subplot(2, 5, i+1)
        img_path = os.path.join(model_img_dir, img_name)
        loaded_img = image.load_img(img_path)
        img_array = image.img_to_array(loaded_img, dtype='int')
        plt.imshow(img_array)
        plt.axis('off')
        
    plt.show()

Next, we build a Feature Extractor class using pre-trained CNN architectures discussed above. The ImageNet weights are used for the pre-trained models.

# Creating a class for feature extraction and finding the most similar images

'''
Comparing 3 different models

1. VGG 16
2. ResNet 50
3. Xception
'''

class FeatureExtractor:
    
    # Constructor
    def __init__(self, arch='VGG'):
        
        self.arch = arch
        
        # Using VGG -16 as the architecture with ImageNet weights
        if self.arch == 'VGG' :
            base_model = VGG16(weights = 'imagenet')
            self.model = Model(inputs = base_model.input, outputs = base_model.get_layer('fc1').output)
        
        # Using the ResNet 50 as the architecture with ImageNet weights
        elif self.arch == 'ResNet':
            base_model = ResNet50(weights = 'imagenet')
            self.model = Model(inputs = base_model.input, outputs = base_model.get_layer('avg_pool').output)
        
        # Using the Xception as the architecture with ImageNet weights
        elif self.arch == 'Xception':
            base_model = Xception(weights = 'imagenet')
            self.model = Model(inputs = base_model.input, outputs = base_model.get_layer('avg_pool').output)
            
    
    # Method to extract image features
    def extract_features(self, img):
        
        # The VGG 16 & ResNet 50 model has images of 224,244 as input while the Xception has 299, 299
        if self.arch == 'VGG' or self.arch == 'ResNet':
            img = img.resize((224, 224))
        elif self.arch == 'Xception':
            img = img.resize((299, 299))
        
        # Convert the image channels from to RGB
        img = img.convert('RGB')
        
        # Convert into array
        x = image.img_to_array(img)
        x = np.expand_dims(x, axis=0)
        
        if self.arch == 'VGG':
            # Proprocess the input as per vgg 16
            x = vgg_preprocess(x)
            
        elif self.arch == 'ResNet':
            # Proprocess the input as per ResNet 50
            x = resnet_preprocess(x)
            
        elif self.arch == 'Xception':
            # Proprocess the input as per ResNet 50
            x = xception_preprocess(x)
        
        
        # Extract the features
        features = self.model.predict(x) 
        
        # Scale the features
        features = features / np.linalg.norm(features)
        
        return features      

The above class extract features from Images. We can use and compare upto 3 CNN architectures namely, ResNet 50, VGG 16 and Xception. The performance metric used to compare the 3 models is Mean Average Precision @ K. It is the average of Precision @ K over all queries. Precision at K can be defined as the number of relevant items in top K items divided by K. To learn more about Precision @ K and MAP @ K refer to this blog post by @Ren Jie Tan.

Next, we test this model. In this blog post I will share the results of the ResNet 50 model (top performing). To check the detailed comparison of all the 3 models, you can refer to the POC notebook here.

The below code cell extracts the features of randomly selected 10000 images using the ResNet architecture.

# Create the model object and extract the features of 10000 images (ResNet)
resnet_feature_extractor = FeatureExtractor(arch='ResNet')

# dictionary to store the features and index of the image
image_features_resnet = {}
for i, (idx, img_path) in enumerate(zip(index_values, modelImages)):
    
    # Extract features and store in a dictionary
    img = image.load_img(img_path)
    feature = resnet_feature_extractor.extract_features(img)
    image_features_resnet[idx] = feature

Next we randomly select an image from the dataset as a query and test the model’s performance.

# Create a query
queryImage_idx = np.random.choice(index_values)
queryImage_path = listing_data.iloc[queryImage_idx]['modelImages_path']
queryImage = image.load_img(queryImage_path)

# Extract Features from queryImage

# ResNet
queryFeatures_Resnet = resnet_feature_extractor.extract_features(queryImage)

Next, we use Euclidean Distance to compute the distance between the query features and the features of the images extracted above. Euclidean Distance can be considered as the similarity metric.

# Compute Similarity between queryFeatures and every other image in image_features_resnet

# ResNet
similarity_images_resnet = {}
for idx, feat in image_features_resnet.items():
    
    # Compute the similarity using Euclidean Distance
    similarity_images_resnet[idx] = np.sum((queryFeatures_Resnet - feat)**2) ** 0.5
    
similarity_resnet_sorted = sorted(similarity_images_resnet.items(), key = lambda x : x[1], reverse=False)
top_10_simiarity_scores_resnet = [score for _, score in similarity_resnet_sorted][ : 10]
top_10_indexes_resnet = [idx for idx, _ in similarity_resnet_sorted][ : 10]

Next, let’s just plot the top 10 most similar images along with the query image.

# Plot the results from all three models and prepare a comparison

print("========================================== QUERY IMAGE ===============================================")
plt.figure(figsize=(10,10))
plt.imshow(image.img_to_array(queryImage, dtype='int'))
plt.axis('off')
plt.show()
print("======================================================================================================")
print()

# 1. ResNet
top_10_similar_imgs_Resnet = listing_data.iloc[top_10_indexes_resnet]['modelImages_path']
brand_Resnet = listing_data.iloc[top_10_indexes_resnet]['brand.name']
map_resnet = MAP(top_10_simiarity_scores_resnet, threshold=0.55, k=10)

print("========================================== ResNet Results =============================================")
fig = plt.figure(figsize=(15,5))
for i, (img_path, brand) in enumerate(zip(top_10_similar_imgs_Resnet, brand_Resnet)):
    plt.subplot(2, 5, i+1)
    img = image.load_img(img_path)
    img_arr = image.img_to_array(img, dtype='int')
    plt.imshow(img_arr)
    plt.xlabel(price)
    plt.title(brand)
    plt.axis('off')
plt.show()
print("======================================================================================================")
Source
Source

Conclusion

So, this was a Computer Vision use case called the Content based Image Retrieval System or Fashion Image Search Engine. You can access the Kaggle Notebook for this project here and the entire project on my GitHub profile.

I hope you find the blog post informative. To keep up with data science, please subscribe to my blog Keeping Up With Data Science. I would highly appreciate if you could share this post within your community. As always, feedback is welcome and I would like to wish you all happy learning 🙂 .

References

One response to “Fashion Image Search Engine”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: