Building an Automatic Image Captioner with Deep Learning

16 min readMar 17, 2022

Images with their corresponding captions.

When you look at the image above, what is the first thought that comes to your mind? Well, I am sure that you think of an image that consists of 3 different images with their corresponding captions. Even though most of us don’t think about it consciously, it's absolutely amazing how our brains are able to look at an image in a single glance and be able to summarize the main points in just a couple of sentences! Curiously, I wondered if I could get a computer to be able to captions images like we do! This problem seemed interesting to me as it combined the areas of Natural Language Processing and Computer Vision, which are 2 areas that I am super interested in. In this article, I am going to talk about what Automatic Image Captioning is, how it works, an explanation of the code for this project, and lessons that I learned from this project!

What are the implications of Automatic Image Captioning?

Before we get into how Automatic Image Captioning works, let’s take a step back, and look at what the implications of Automatic Image Captioning are, and how it is useful. Automatic Image Captioning can simplify the process of extracting important data from images or videos, as the information is summarized into text which is much easier to parse through.

That might seem like a very minor improvement in efficiency compared to getting a human to look at the image and then summarize it, and you’re right about that. The problem though is that when there is more data the efficiency of manually captioning images becomes extremely laborious, and having a system that can automatically caption images will be exponentially more effective.

According to The Atlantic, 657 Billion new images are uploaded to the internet every year. At this rate of growing image/video data on the internet, multiple lifetimes would be required to parse through it and derive meaningful information. A Deep Learning-based system would be able to parse through the same amount of data in minutes.

How does Automatic Image Captioning actually work?

By now, you’re probably like “Dev, enough talk on why Automatic Image Captioning is useful, talk about how a Machine Learning model actually performs Image Captioning?”. Well, the answer is a Sequence-Sequence Encoder-Decoder Architecture, and this architecture has been the backbone for state-of-the-art models that are being used today.

I am sure that most people reading this article don’t know what a Sequence-Sequence Encoder-Decoder Architecture is. So, let's break it down and first look at what Encoder-Decoder Architecture is!

Encoder-Decoder Architecture:

The diagram above explains at a high level what an Encoder-Decoder network is. In essence, the encoding part of the architecture is responsible for taking in the input(images, text, etc), and passing this input data through a series of layers that extract important information. The encoder network is also responsible for representing this information in a format that can be understood by the decoder.

The Decoder is responsible for fetching the output of the Encoder and decoding the information from the encoder. Once the information is decoded, it would be used to help produce the desired output of this model is according to the task that it is being used for.

I know I know, you are probably still confused by what exactly an Encoder-Decoder model is, as I gave a very general explanation of what it does. To understand this concept better, let’s assume that we want to create a model that can translate our input sentence(in English) to an output sentence(in French).

As you can see in the image above, the model takes in an English sentence “I dream of old McDonald every night” and then passes it through the encoder(in this case that's an LSTM/RNN). What the encoder will do is take important words from this sentence like old, McDonald, night, and dream giving them a larger weight(importance) than other words in the sentence like I, of, and every. The encoder would encode this information into an intermediate representation also known as hidden states, which would be passed as input to the decoder.

The decoder is able to decode the information from this intermediate representation from the encoder. It would understand the important words in the sentence, and their positions in the sentence to understand the meaning. This information would help the decoder to create an accurate translation in French that has a very similar/same meaning as the English version.

Sequence-Sequence Encoder-Decoder Architecture:

In the last section, we got a pretty good understanding of what Encoder-Decoder models are, and how they work. Now, let’s look at how a Sequence-Sequence Encoder-Decoder model works, and why I used it for Image Captioning!

But…. before we get into how a Sequence-Sequence Encoder-Decoder model works, let's answer the question of why there are different types of Encoder-Decoder models and why we need a Sequence-Sequence Encoder-Decoder.

Encoder-Decoder models are used throughout different areas of Machine Learning mainly in the areas of Computer Vision and Natural Language Processing. For example, model architectures that performed Semantic Segmentation used an Encoder-Decoder approach, check out my article on using UNET to perform Semantic Segmentation if you want to learn more about that! We need to have an Encoder-Decoder model that is specifically made to perform Image Captioning.

Image Captioning is considered as a Sequence-to-Sequence problem, meaning that the input is a sequence of words, characters, numbers, etc and the output is another sequence of words, characters, numbers, etc. In the context of Image Captioning, the input is an image(aka sequences of numbers) and the output is a sentence(aka sequences of words). This is why we need a Sequence-Sequence Encoder-Decoder model!

An example of what a Sequence-Sequence Encoder-Decoder model would look like for Automatic Image Captioning.

The diagram above shows at a high-level a model architecture that would be used to perform Image Captioning would look like. In the first part of the model, images would be passed as input to a CNN which in this case is pre-trained on ImageNet. The CNN would be the Encoder in this model, as it would extract the most important features from the image and output a feature vector/map containing this information.

This feature map would be used by the LSTM(Decoder) as shown below, to understand the objects in the image, interactions happening between them, and other important information from the image which would help the LSTM create a textual representation of the image(output).

What are some variations/additions of Sequence-Sequence-based Encoder-Decoders that increase accuracies?

Sequence-Sequence-based Encoder-Decoders had been the state-of-the-art in performing Image Captioning until Attention-Mechanism was found to have a significantly increased accuracy of predictions. Over the years, researchers have also found that using an Object Detection-based Encoder instead of a conventional causes improvements in the accuracy of predictions.

There are hundreds of papers on how Attention-Mechanism and Object-Detection can be applied to Image Captioning, some creating new networks which incorporated these techniques which lead to better results when performing Image Captioning. Since I didn’t actually implement any of these techniques, I won’t be going super deep into them. But, I thought that it would be worth mentioning some papers + a TLDR of what some of these papers are about!

Here are some papers that I read while researching more into this area:

https://proceedings.neurips.cc/paper/2019/file/680390c55bbd9ce416d1d69a9ab4760d-Paper.pdf — This was an interesting paper I read on a new Image Captioning model that used Object Detection as an encoder, I also ended up getting to talk to one of the authors of this paper!

https://arxiv.org/ftp/arxiv/papers/1706/1706.02430.pdf — Image Captioning with Object Detection and Localization

https://openaccess.thecvf.com/content_ICCV_2017/papers/Pedersoli_Areas_of_Attention_ICCV_2017_paper.pdf — How Attention-Mechanism can be applied to Image Captioning

From some of the papers and articles that I read around Attention-Mechanism and Object-Detection, I found that Object Detection and Attention-Mechanism both did a great job at localizing different objects and their interactions much better than previous approaches. The image below describes a network that was proposed by the first research paper that I linked above, that combined a Transformer(which is a state-of-the-art model used for Natural Language Processing Tasks) with an Object Detector.

This is a proposed model architecture called Object Relation Transformer to perform Image Captioning.

Object Detectors create bounding boxes(boxes around objects that it detects in an image), which means that not only can it classify what the object is, but it can also figure out where the object is. When captioning images that have complicated interactions happening between objects and lots of objects in the image, localizing them will make the captions more descriptive.

Attention-Mechanism also replicates human behavior of focusing on a certain area of an object at a time to make more accurate predictions on what the object is + its characteristics. In the context of Image Captioning, focusing on certain parts of the image, and classifying and localizing objects in that region also lead to significant increases in the accuracy of captions.

Understanding the code for Automatic Image Captioning 💻

In this article, we have just talked about what Automatic Image Captioning is and how it works at a theoretical level, but we actually haven’t looked at any of the code that I wrote for it.

I truly believe that building is one of the best ways to learn, so it would be valuable for me to explain what exactly I built, and its purpose. It’s a way for me to test my knowledge and a good way for you to better understand how this project works.

So, without any further ado, let’s get into the code!

Step 1: Importing Dependencies:

The first step of the project is quite simple, we are just importing all of the libraries that we need for the project like Tensorflow, Tensorflow Addons, Tensorflow Hub, Keras, OpenCV, Pandas, Numpy, and more. After importing the libraries, we are importing the images and also initializing the image_size, batch_size, and epochs. The last 2 variables — num_distinct_words which is the length of the text and embedding_dim which is a parameter that will be passed to the model.

Step 2: Loading in the images:

The second step of the project is using the OS library to iterate over every image in the directory that contains the image dataset. After we read this image, were are going to perform data preprocessing on it like; resizing the image, adding a blur to the image, normalizing the image, etc. Lastly, after performing all of the data preprocessing, we will save the images to the image_dataset list.

Step 3: Loading in the Captions:

The third step in this project is loading in the captions, which is similar to what we did in the previous step. The for loop above iterates over all of the captions in captions.txt and then stores them in caption_set. Then, I applied text preprocessing techniques like removing stop words and characters from the text, performing lemmatization, contracting words, and performing tokenization.

Step 4: Train, Test, Validation Splitting:

The fourth step in this project is training, testing, and validation splitting. Since this dataset doesn’t come with training, testing, and validation set, I had to split up the dataset into chunks for training, testing, and validation. First I had to zip the caption dataset and the image dataset together so that I can make splits on the entire dataset and then split these individual zipped sets of images and captions. That’s what I did, splitting the original dataset into 80% training and 20% validation sets. From the validation set, I took out 10 images for the testing set, which would be used to test the model after it’s trained.

Step 5: Splitting Training, Testing, and Validation datasets into Caption datasets and Image datasets:

The fifth step in this project is taking all of the datasets, and splitting them into their caption sets and image sets respectively. Here we have a function image_caption_separation that performs this, taking as input the unzipped versions of training, testing, and validation image + caption sets, and 2 empty lists for the images and captions to go after the output of the function. Lastly, I converted the lists containing images and captions to NumPy arrays, so they are in the right format to be used by the model.

Step 6: Getting predictions from the Encoder for the training images

The sixth step of this project is passing the training images through the Encoder which is the VGG-16 pre-trained CNN(Convolutional Neural Network). Its job is to extract the most important information from the image and create a feature map containing that. The model is imported from tensorflow_hub, and then created a model with it called image_feature_model.

Once the model is ready, the next step was to batch the training dataset as it has too many images to get a prediction from when using the predict function. After passing in all of the training dataset batches through the predict function, I combined all of them using np.concatenate which concatenates all output batches together from the predict function and converts them to a NumPy array.

Step 7: Getting predictions from the Encoder for the testing and validation images

In the seventh step of the project, this step is quite similar compared to the last. In this step, I am passing the testing and validation steps through the model using the predict function. The validation and testing sets are not as large as the training, I can pass the entirety of both sets to the predict function. After storing the output from the model in feature_maps_validation or feature_maps_testing, a for loop is going to iterate over every feature map in each set, convert it to an array, and then concatenate it with other feature maps in the set of the feature maps. Lastly, we are reshaping the training and validation arrays into a format that the model accepts.

The final step: Training of the Decoder(LSTM)

The final step of this project is creating and training the Decoder of the Sequence-Sequence Model(LSTM). The type of model that I will be using is known as an LSTM(Long Short Term Memory) which is a type of RNN(Recurrent Neural Network).

I won’t go super deep into what an LSTM is and how it works as it’s out of the scope in the context of understanding the code. In essence, an LSTM is able to take a sequence as input(text, images, etc), which it then uses to output another sequence. If you want to learn more about LSTMs and how they work, then take a look at the resources that I post at the end of the article.

From the code snippet above, you probably noticed that there is a layer above the Stacked LSTM layers, known as the Embedding layer. The Embedding layer is responsible for converting the input sequences of tokens(words that have been tokenized) to a vector that represents each word and links that to a token. It’s an alternative to using other word embedding algorithms like word2vec or bag-of-words which create sparse vectors to represent words, often containing lots of zeros, making them more memory intensive.

After the LSTM layers, there is a Dense layer(output layer) with the number of words in every sequence as the number of neurons for the layer. The activation for the layer is Softmax, which creates a probability distribution for all possible words in a sequence, and the word with the highest probability would be used as the next word in the output sequence.

After the LSTM is compiled with an optimizer of Adam, a loss of Categorial CrossEntropy, and metrics of accuracy. Lastly, we train the model using the fit function with the training images, training captions, validation images, and validation_captions.

Failures & Lessons learned from the project 🧠:

I have been only talking about what automatic image captioning is, why its important, how it works, the code behind it, but I didn’t talk about the lessons that I learned from building this project and my failures throughout the project.

Lesson #1: Having a First-Principles Mindset is 🔑

The first thing that I failed at in this project was understanding what type of models/architectures would perform automatic image captioning. I had read articles on what a basic model architecture would look like for image captioning, but still didn’t truly understand the purpose of that.

When I was reading papers on object detectors for automatic image captioning some used Faster R-CNN, while others used VGG-16 or InceptionV3 which are image classifiers. I ended up spending over a week figuring out what type of object detector/image classifier to use. The problem was that I didn’t completely understand what an object detection system would do in the context of image captioning vs what an image classifier would do.

After I realized that the purpose of an object detector or image classifier is to extract the important information from images and pass that to the Decoder(LSTM). An image classifier would identify 1 object in an image, whereas object detection would detect multiple objects plus localize them as well which is better for captioning more complex images.

Having the First-Principles mindset is very important when coding and understanding the project at hand better because it allows you to think about the purpose of something and understand how it works. If I had a good understanding of how object Detectors and image classifiers would work in Image Captioning, then I could have saved 1 week of time and also spend less time understanding how these systems would integrate into the sequence-sequence model architecture.

My approach whenever I build projects, is to look at what other implementations have used and try to replicate that. Although, after this project, I realized that understanding the different aspects of a project really effectively is key if you want to implement it.

Lesson #2: Research Papers are🔑to getting a deep understanding of a project

Throughout the process of building this project, I had only read 2 research papers. I didn’t completely read through them, and mostly just skimmed over the paper to get a better understanding of different model architectures used for automatic image captioning.

So, why did I not properly read through lots of research papers?

I gave up reading those research papers because of how complicated they were, or at least I thought that they were. I would read research papers to get a deeper understanding of certain concepts, but the moment complicated explanations came up or math came up, I would just quit. Instead, I resorted to youtube videos and medium articles which were much easier to understand. But, the problem was that the explanations were super high-level and explanations could vary depending on which sources you go to.

Research Papers generally solve all of the problems when relying on secondary sources to understand concepts like Youtube or Medium. Admittedly, they are much harder to understand than secondary sources, but they provide a very in-depth explanation making your understanding of the concept extremely well and getting all of the details you need from one source.

Lesson #3: Knowing how to use Stack Overflow correctly is a superpower 🚀

Ever since I have been coding, I have been using Stack Overflow to help me solve errors in my code. But, after building this project I realized how to properly use Stack Overflow to effectively and efficiently solve problems in code.

You are probably wondering, “How do you properly use Stack Overflow, and what did I do wrong in previous projects?”

Any errors that I would encounter that I wasn’t able to solve, I would just Stack Overflow it and look at what the different solutions are for the problem. I would just try all of the solutions, and see which ones worked and which didn’t. But, as I started to work on more complex machine learning projects like YOLO, Semantic Segmentation, Text Summarization, and now Automatic Image Captioning the trial-error strategy doesn’t work and is very inefficient.

This strategy has been the cause of hours of frustration while building some of the projects above. Instead of doing the trial-error strategy, the first step to solving a complicated error would be to try to understand it as well as possible.

An example of a Python Script having multiple errors.

If we look at the example above, the error seems to be that one of the lines of code has a variable whose value is negative and isn’t supported by the function it’s being used in.

The second step is to use Stack Overflow to understand the error better and understand the root cause of the error. The last step would be to solve the root cause of the error and use Stack Overflow to look at what other solutions are that people have created. After this process, 90% of the problems that I encounter will be solved with this technique.

Lesson #4: Being Intention for any project is 🔑

For most ML and coding projects that I have worked on, my intention always was primarily to learn about new concepts in Machine Learning or Computer Science. The problem with that intention is that it’s very vague; you can learn in different ways, learn different things, and learn new things at varying degrees.

Since I didn’t have a clear and specific intention for what I wanted to get out of building this project, I didn’t know if I wanted to spend more time understanding concepts like Bahdanau and Luong Attention Mechanism or just finish the project and move on. I should have spent more time understanding these concepts to go deeper into NLP, as that an area that I want to focus my work on. My intention should have been to get a deeper understanding of NLP concepts through this project, which would have made the project more valuable.

Conclusion:

In conclusion, I learned some very valuable lessons from this project and also got to learn about new concepts, model architectures, and ML models in Computer Vision and Natural Language Processing. If you want to learn more about the process of building this project from doing research for the project to reading articles on certain concepts to coding out the project you can check out a video on this that I made.

Here are some resources that you can use to learn more about some concepts talked about in this article:

https://towardsdatascience.com/lstm-networks-a-detailed-explanation-8fae6aefc7f9 — A Detailed Explanation of LSTM Networks

https://www.analyticsvidhya.com/blog/2021/01/understanding-architecture-of-lstm — Understanding the architecture of LSTMs

https://towardsdatascience.com/understanding-fast-r-cnn-and-faster-r-cnn-for-object-detection-adbb55653d97 — Understanding Fast R-CNNs and Faster-RCNNs

https://www.kaggle.com/blurredmachine/vggnet-16-architecture-a-complete-guide — Explanation of the VGG-16 Architecture

https://medium.com/@AnasBrital98/inception-v3-cnn-architecture-explained-691cfb7bba08 — Explanation of the InceptionV3 Architecture

https://www.analyticsvidhya.com/blog/2020/08/a-simple-introduction-to-sequence-to-sequence-models/ — Introduction to Sequence-Sequence Models

That’s it for this article, if you liked this article and got learn something new, it would be great if you can give it a few claps! Otherwise, I hope to see you in my future articles talking about the new ML projects I made!