Hi James,
Great question! Question that reveals a big mistake. You are absolutely right to get confused. If we train a network that can map inputs one of 20 classes (training+zero-shot), so what is the point of all layer removing, checking distance checking between vectors?
Although I pointed out several times saying that we are not going to use zero-shot classes in training phase!, I violated it while in a rush. Thanks to you, defects are fixed now.
- While one-hot encoding the labels, only training labels should be considered. This way, we get one-hot encoded vectors for each training class with the size of 15 because we have 15 training classes.
- Last layer of the network should be allowed to output 15 values as the size of one-hot encoded vectors.
After these adjustments, network is able to learn to map a given image feature to a word2vec (in vector space) by training with only 15 classes. Now, the only way to classify a test object (belongs to zero-shot classes) is forming a vector and comparing its distance with word2vecs of possible 20 classes (both training and zero-shot).
I had found getting relatively low accuracy results interesting because I choose discrete and easy classes for demonstration purposes. After these adjustments, new accuracy scores are pretty high as expected.
Thank you for your attention. Best.