![]() Init-Inject: Normally in the case of an RNN we use an initial state vector which is set to a zero vector of the given dimension. There are two basic types of architectures: A paper by Marc Tanti and Albert Gatt, Institute of Linguistics and Language Technology, University of Malta covered a comparison study, of all the approaches. So, the question arises how or in what order should we introduce the pieces of information into our model? Elaborately speaking, we need a language RNN model as we want to generate a word sequence, so, when should we introduce the image data vectors in the language model. We are dealing with two types of information, a language one and another image one. One is the image we need to describe, a feed to the CNN, and the second is the words in the text sequence produced till now as a sequence as the input to the RNN. We have seen that we need to create a multimodal neural network that uses feature vectors obtained using both RNN and CNN, so consequently, we will have two inputs. You can study more about the BLEU score from this awesome blog post. The perfect match is 1.0 and a perfect mismatch is 0.0. It is a metric for evaluating a generated sentence to a reference sentence. So, how do we evaluate our model? For a sequence to sequence problems, like summarization, language translations, or captioning we use a Metrics called the BLEU score.īLEU stands for Bilingual Evaluation Understudy. We have seen that we can describe the above images in several ways. Now, one issue we might have overlooked here. FLICKR 8K, FLICKR 30K, and MS-COCO are some most used datasets for the purpose. If we can obtain a suitable dataset with images and their corresponding human descriptions, we can train networks to automatically caption images. The first part is handled by CNNs and the second is handled by RNNs. So, how are we doing this? While forming the description, we are seeing the image but at the same time, we are looking to create a meaningful sequence of words. If we are told to describe it, maybe we will describe it as: “A puppy on a blue towel” or “ A brown dog playing with a green ball”. Photo by Isabela Kronemberger on Unsplash: FIg 1 Say, we as humans are seeing a scene as given below. For the Language part, we use recurrent Neural Networks and for the Image part, we use Convolutional Neural Networks to obtain the feature vectors respectively. For this purpose, we need to process both the language or statements and the images. Image captioning can be regarded as an end-to-end Sequence to Sequence problem, as it converts images, which is regarded as a sequence of pixels to a sequence of words. We will also be looking at a python demo example on the Flickr Dataset in Python. In this article, we will be talking about the most used and well-known approaches proposed as a solution to this problem. One of the most notable work has been put forward by Andrej Karpathy, Director of AI, Tesla in his Ph.D. Several approaches have been made to solve the task. NVIDIA is using image captioning technologies to create an application to help people who have low or no eyesight. Image captioning has a huge amount of application. It has been a very important and fundamental task in the Deep Learning domain. Image Captioning is the process of generating a textual description for given images.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |