This project leverages a Keras/TensorFlow-based image captioning application using CNN (Convolutional Neural Networks) for feature extraction and Transformer networks (Encoder-Decoder) for language modeling. In particular, the architecture consists of three key models:
- CNN (EfficientNetB0): The CNN is responsible for extracting features from images. We use EfficientNetB0, which is pre-trained on ImageNet, for robust and efficient image feature extraction.
- TransformerEncoder: The extracted image features are passed to a Transformer encoder. This encoder processes the image features and generates a contextual representation that captures important information from the image.
- TransformerDecoder: The decoder takes both the encoder's output (the image features) and the textual data (captions) as inputs. It tries to predict the caption for the image, learning to generate grammatically correct and semantically accurate descriptions.
- You can run this code on any Python platform, such as PyCharm or any online platform that supports TensorFlow/Keras.
The following versions of the libraries are required for smooth execution of the code:
- Python: >= 3.6
- Libraries:
- numpy
- seaborn
- keras
- tensorflow
- tqdm
- nltk
The Flickr30k dataset is a widely used benchmark dataset for image captioning tasks in the field of computer vision and natural language processing (NLP). It consists of 31,783 images, each paired with five English descriptions. The dataset provides a diverse range of images from various categories and is used to evaluate image captioning models where the goal is to generate descriptive captions for the images.
Our proposed architecture follows the Transformer-based model for image captioning, where we combine a CNN for feature extraction and a Transformer network for language modeling.
- Encoder: Processes the image features and generates a contextual representation.
- Decoder: Takes the encoder output along with the textual sequence and generates the caption.
- Positional Encoding: Since Transformers do not have any inherent sense of order, positional encoding is added to the input sequences to preserve the order information.
- Embeddings: Both image features and text tokens are embedded to be input to the Transformer.
- Multi-Headed Attention: This mechanism allows the model to focus on different parts of the input sequence at each step, enhancing its capacity to learn from different perspectives.
The model's performance is evaluated using the BLEU score (Bilingual Evaluation Understudy), which is a standard metric for evaluating the quality of machine-generated text, particularly for machine translation and caption generation tasks. For this project, we used greedy decoding, but it’s noted that better results can be achieved by implementing beam search for more advanced decoding strategies.
The qualitative outcomes of the model's inference show that the Transformer-based model generates grammatically valid captions for images. The images may not have been present in the prior training dataset, but the model can generate appropriate and sensible captions based on its learned features. Although the BLEU scores are not perfect, they provide a reasonable reference for evaluating the model's ability to describe unseen images.
*Figure 3: Example of generated captions by the Transformer-based model. These captions are generated for images that were not present in the training dataset, demonstrating the model's ability to generalize.*The validation accuracy of the model increases with each epoch, indicating that the model is learning effectively during training.
*Figure 4: Training and validation accuracy over epochs. As the model trains, we see an increase in the accuracy on both the training and validation sets, which suggests that the model is learning and generalizing well.*