# Using TensorFlow’s Batch Normalization Correctly

**Update**: This guide applies to TF1. For TF2, use tf.keras.layers.BatchNormalization layer.

The TensorFlow library’s layers API contains a function for batch normalization: `tf.layers.batch_normalization`

. It is supposedly as easy to use as all the other tf.layers functions, however, it has some pitfalls. This post explains how to use `tf.layers.batch_normalization`

correctly. It does not delve into what batch normalization is, which can be looked up in the paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” by Ioeffe and Szegedy (2015).

Quick link: tf.layers.batch_normalization API docs

**Summary**

**Use the**`training`

parameter of the`batch_normalization`

function.**Update the moving averages by evaluating the ops manually or by adding them as a control dependency.****The final code can be found in this Jupyter notebook.**

We start off by defining a simple computational graph. The input is a placeholder which takes a batch of scalar values.

x = tf.placeholder(tf.float32, [None, 1], 'x')

The input $x$ is fed into a batch normalization layer, yielding the graph’s output $y$:

y = tf.layers.batch_normalization(x)

With this setup we have got some basic batch normalization set up. We can create a session and feed a sample vector, here $x=\begin{bmatrix}-10 & 0 & 10\end{bmatrix}$.

sess = tf.Session() init = tf.global_variables_initializer() sess.run(init) y_out = sess.run(y, feed_dict={x: [[-10], [0], [10]]}) sess.close()

Surprisingly, the value of `y_out`

is not normalized at all. In fact it is something like $y=\begin{bmatrix}-9.995004 & 0 & 9.995004\end{bmatrix}$.

**The batch normalization layer does not normalize based on the current batch if its training parameter is not set to true.** Heading back to the definition of $y$, we can alter the method call a bit:

y = tf.layers.batch_normalization(x, training=True)

After making this change, the output for $y$ looks much more normalized. It is something like $y=\begin{bmatrix}-1.2247357 & 0 & 1.2247357\end{bmatrix}$.

With this setup, the batch normalization layer looks at the current batch and normalized it depending on its value. That might not be desired, at any time: Consider a case where there is only one sample in the batch, e.g. $x=\begin{bmatrix}-10\end{bmatrix}$. In this setup, the output would be $y=\begin{bmatrix}0\end{bmatrix}$ because that way the *batch* is normalized to zero-mean. Therefore it is common to store so called *moving averages* of mean and variance, to be able to choose the normalization factors $\beta$ and $\gamma$ properly. That way smaller batches can be normalized with the same parameters as batches before.

**In order to update the two moving average variables (mean and variance), which the tf.layers.batch_normalization function call creates automatically, two operations must be evaluated while feeding a batch through the layer.** The operations can be found in the collection

`tf.GraphKeys.UPDATE_OPS`

. In the example above, `tf.get_collection(tf.GraphKeys.UPDATE_OPS)`

yields[<tf.Operation 'batch_normalization/AssignMovingAvg' type=AssignSub>, <tf.Operation 'batch_normalization/AssignMovingAvg_1' type=AssignSub>]

If we change the setup and evaluate the update operations alongside with the forward pass

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) y_1 = sess.run([y, update_ops], feed_dict={x: [[-10], [0], [10]]})[0] y_2 = sess.run(y, feed_dict={x: [[-10]]})

the values for $y_1$ and $y_2$ remain the same. That is because the moving averages are only being used, if the `training`

parameter is set to `False`

. We can control it with a placeholder (here a placeholder with a default value)

is_training = tf.placeholder_with_default(False, (), 'is_training') y = tf.layers.batch_normalization(x, training=is_training)

and set it to `True`

when feeding the larger batch (and `False`

for the smaller batch; strictly not necessary because it is the placeholder’s default value anyways):

y_1 = sess.run([y, update_ops], feed_dict={x: [[-10], [0], [10]], is_training: True})[0] y_2 = sess.run(y, feed_dict={x: [[-10]], is_training: False})

The output we get is

- $y_1=\begin{bmatrix}-1.2247357 & 0 & 1.2247357\end{bmatrix}$: Normalized as desired, moving averages were updated based on this normalization.
- $y_2=\begin{bmatrix}-7.766966\end{bmatrix}$: Kind of weird. It’s neither $0$, which it was without moving averages, nor $-1.2247357$, which it should be if it was normalized with the same factors as the $x_1$ batch.

The reason for the *wrong* normalization of the small batch is that the moving averages update slowly. If we were to feed the larger batch multiple times, the second batch would be properly normalized:

for _ in range(1000): y_1 = sess.run([y, update_ops], feed_dict={x: [[-10], [0], [10]], is_training: True})[0] y_2 = sess.run(y, feed_dict={x: [[-10]], is_training: False})

Here, we feed the larger batch 1000 times and run the update operations every time. The result is $y_2=\begin{bmatrix}-1.224762\end{bmatrix}$ — the normalization based on moving averages works. By altering the `momentum`

parameter of `tf.layers.batch_normalization`

the pace of the average update can be adjusted.

Right now we have to call `sess.run`

and pass the `update_ops`

manually. **It is more convenient to add them as a control dependency, such that TensorFlow always executes them if the Tensor y is being evaluated.** The new graph definition looks like that:

x = tf.placeholder(tf.float32, [None, 1], 'x') y = tf.layers.batch_normalization(x, training=is_training) update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(update_ops): y = tf.identity(y)

The forward passes are now much cleaner:

x_1 = [[-10], [0], [10]] x_2 = [[-10]] for _ in range(1000): y_1 = sess.run(y, feed_dict={x: x_1, is_training: True}) y_2 = sess.run(y, feed_dict={x: x_2})

Typically, `is_training`

should be set to `True`

during training and `False`

when performing inference.

The values stored by the batch normalization layer can be examined. In order to so, we retrieve their names from `tf.all_variables()`

which outputs

[<tf.Variable 'batch_normalization/gamma:0' shape=(1,) dtype=float32_ref>, <tf.Variable 'batch_normalization/beta:0' shape=(1,) dtype=float32_ref>, <tf.Variable 'batch_normalization/moving_mean:0' shape=(1,) dtype=float32_ref>, <tf.Variable 'batch_normalization/moving_variance:0' shape=(1,) dtype=float32_ref>]

in our toy example. The last two variables contain the moving averages of mean and variance of the past batches (for which the update ops were evaluated and `training`

was set to `True`

). The actual values can be queried as follows:

with tf.variable_scope("", reuse=tf.AUTO_REUSE): out = sess.run([tf.get_variable('batch_normalization/moving_mean'), tf.get_variable('batch_normalization/moving_variance')]) moving_average, moving_variance = out

thank you for this post i want to ask if you do not mind

How to introduce batch normalization in each layer in case of MLps for regression ? is is with same logic ? thank you.

Yes, it is the same logic. BN has nothing to do with the output. It zero-centers and normalizes the inner inputs of a layer

Bravo~

well done

Do you know if in tf2 this also holds? In the tf2 documentation, it wasn’t clear to me whether “training=true” is default. Also it’s unclear to me whether model.predict() is using moving average or not.

As far as I see things this got a lot simpler in TF2, with the tf.keras.layers.BatchNormalization layer, the values are being updated automatically and the moving averages are being used if the layer is being called with

`training=True`

.I have added a note at the top of the blog.