Yuhang He's Blog

Some birds are not meant to be caged, their feathers are just too bright.

Tackle the num_epoch issue in tensorflow

Tensorflow “sucks”! There are so many pitfalls along the way you try to learn it. Tensorflow provided multiple thoughtful and foundamental wrappers to boost your development efficiency. One representative is data reader mechanism.

To better digest Tensorflow data reading mechanism, I recommend to read this blog. As recommended in this blog, the first test code you are eager to experiene looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import glob
import tensorflow as tf

input_img_list = glob.glob('dir/*.jpg')
file_name_queue = tf.train.string_input_producer( input_img_list, shuffle = False, num_epochs = 10 )
reader = tf.WholeFileReader()
key_tmp, val_tmp = reader.read( file_name_queue ) # note that read() just reads one instance per time

with tf.Session() as sess:
	init_op = tf.group( tf.global_variables_initializer(), tf.local_variables_initializer() )
	sess.run( init_op )
	output_dir = 'my_output_dir'
	threads = tf.train.start_queue_runners( sess = sess )
	indx = 0
	while True:
		img_tmp = sess.run( val_tmp )
		with open( output_dir + 'd%.jpg'%indx, 'wb' ) as f:
			f.write( img_tmp )
		indx += 1

Believe it or not, you would definitely encouter bug report looks like

FIFOQueue ‘_1_input_producer’ is closed and has insufficient elements tf.train.string_input_producer

What strange is that if you set num_epochs = None (the default value), it succeeds. Various online debugging suggestions overwhelmly recommend to run TensorFlow local or global value initializer. All of them do not help.

The true reason is that queue_runners always works with Coordinator together. Coordinator provides a robust thread manager, it either stop bad thread or throw exceptions when a program needs to be stopped. As a consequence, the right code snippet works as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import glob
import tensorflow as tf

input_img_list = glob.glob('dir/*.jpg')
file_name_queue = tf.train.string_input_producer( input_img_list, shuffle = False, num_epochs = 10 )
reader = tf.WholeFileReader()
key_tmp, val_tmp = reader.read( file_name_queue ) # note that read() just reads one instance per time

with tf.Session() as sess:
	init_op = tf.group( tf.global_variables_initializer(), tf.local_variables_initializer() )
	sess.run( init_op )
	output_dir = 'my_output_dir'
	coord = tf.train.Coordinator()
	threads = tf.train.start_queue_runners( sess = sess, coord = coord )
	indx = 0
	try:
		while not coord.should_stop():
			while True:
				img_tmp = sess.run( val_tmp )
				with open( output_dir + 'd%.jpg'%indx, 'wb' ) as f:
					f.write( img_tmp )
				indx += 1
	except tf.errors.OutOfRangeError:
		print('Epoch limit reached. Training Done')
	finally:
		coord.request_stop()
	coord.join( threads )