I have decided to train ILSVRC2012 classification task from scratch. The neural network architecture used here is MobileNet V1. However, I have got strange loss curve while reducing the learning rate. Training configuration is simple: SGD optimizer, initial learning rate is 0.1 but decays every 0.5 epoch with decay rate 0.1.
It seems as if a maxim that reduced learning rate often leads to smaller loss value, or at least non-increasing loss value. But my experiment contradicts with this belief: the reduced learning rate inversely gets larger loss value, as is shown in the following loss curve figure,
This bug seems to derive from training process at first glance, that’s why I have spent a bunch of days sinking into my training code. However, it finally turns out to be the data input pipeline problem. Following TensorFlow recommended data input pipeline, I adopt
tf.train.string_input_producer() to feed the train data,
1 2 3 4 5 6
Its function is to output strings to a queue for an input pipeline, with the optional parameters to control its working behaviour.
string_tensor can be either a single tfrecords file name or a list of tfrecords file names. Please note that Shuffle only randomly shuffles the file names, not the instances stored within each tfrecord file. In my case, I split the 128w images into 100 shards, each shard stores about 1.2w images. As is discussed earlier, the order of 1.2w images within each shard cannot be changed once being created. That, it is highly possible that your training within several consecutive steps are conducted within a single shard. If the image instances within the shard are not randomly shuffled enough, your traning process may suffer from large label bias – For example, all the labels within the current shard are overall 0, but are 1 within the next shard.
This is where the bug lies, I did not shuffle the
train.txt when generating tfrecord files. Thus, it looks like I train with a bunch of labels with a fixed learning rate, but with another bunch of labels when the learning rate decays, inevitably resulting in the loss curve increase. Another partial reason is that when training a neural network from scratch, a smaller learning rate easily leads to gradient explosion.
Two solutions recommended here.
- create as many shards as possible when producing tfrecords.