Samet Çetin
1 min readSep 12, 2018

--

Hi James,

Great question! Question that reveals a big mistake. You are absolutely right to get confused. If we train a network that can map inputs one of 20 classes (training+zero-shot), so what is the point of all layer removing, checking distance checking between vectors?

Although I pointed out several times saying that we are not going to use zero-shot classes in training phase!, I violated it while in a rush. Thanks to you, defects are fixed now.

  1. While one-hot encoding the labels, only training labels should be considered. This way, we get one-hot encoded vectors for each training class with the size of 15 because we have 15 training classes.
  2. Last layer of the network should be allowed to output 15 values as the size of one-hot encoded vectors.

After these adjustments, network is able to learn to map a given image feature to a word2vec (in vector space) by training with only 15 classes. Now, the only way to classify a test object (belongs to zero-shot classes) is forming a vector and comparing its distance with word2vecs of possible 20 classes (both training and zero-shot).

I had found getting relatively low accuracy results interesting because I choose discrete and easy classes for demonstration purposes. After these adjustments, new accuracy scores are pretty high as expected.

Thank you for your attention. Best.

--

--

No responses yet