## tf.data.Dataset `map()`, `cache()`, and `shuffle()` notes

TF recommends following tips [https://www.tensorflow.org/guide/data_performance#overview]

>Here is a summary of the best practices for designing performant TensorFlow input pipelines:
>
>- Use the `prefetch` transformation to overlap the work of a producer and consumer.
- Parallelize the data reading transformation using the `interleave` transformation.
- Parallelize the `map` transformation by setting the `num_parallel_calls` argument.
- Use the `cache` transformation to cache data in memory during the first epoch
- Vectorize user-defined functions passed in to the `map` transformation
- Reduce memory usage when applying the `interleave`, `prefetch`, and `shuffle` transformations.

Note that TF suggests `dataset.map(func).cache()` when `func` is a time-consuming op.

However, if there has random ops in `func`, data will be affected by random op **only once** (first epoch). No randomized data in the rest of epoch occurs.

In conclusion:
  - if `map()` has random ops: `dataset.shuffle().batch().cache().map().prefetch()`
  - `map()` has NO random ops: `dataset.shuffle().batch().map().cache().prefetch()`

Let's verify with tensorflow ...

In [1]:
import tensorflow as tf
import numpy as np

In [2]:
rawvalue = np.arange(8,dtype=np.float32)
rawindex = np.arange(8,dtype=np.int)

## 1. `map()` Augmentation with `tf.random`
Every epoch of dataset is **RANDOM**

In [3]:
def map_tf_randn(v,i):
    return (tf.add(v,tf.random.truncated_normal(shape=())),i)

In [4]:
dataset = tf.data.Dataset.from_tensor_slices((rawvalue,rawindex)).batch(2)
# suggest using `num_parallel_calls=tf.data.experimental.AUTOTUNE in practice`
dataset = dataset.map(map_tf_randn, num_parallel_calls=tf.data.experimental.AUTOTUNE)

In [5]:
for data, index in dataset:
    for d,i in zip(data,index):
        print(f'{i} + randn = {d};')

0 + randn = 0.7018770575523376;
1 + randn = 1.7018771171569824;
2 + randn = 3.3502042293548584;
3 + randn = 4.3502044677734375;
4 + randn = 2.3089733123779297;
5 + randn = 3.3089733123779297;
6 + randn = 6.10546875;
7 + randn = 7.10546875;


In [6]:
for data, index in dataset:
    for d,i in zip(data,index):
        print(f'{i} + randn = {d};')

0 + randn = 0.2537071406841278;
1 + randn = 1.2537071704864502;
2 + randn = 1.8037147521972656;
3 + randn = 2.8037147521972656;
4 + randn = 3.758244514465332;
5 + randn = 4.758244514465332;
6 + randn = 4.913802623748779;
7 + randn = 5.913802623748779;


## 2. `map()` Augmentation with `np.random`
Every epoch of dataset is **NOT RANDOM**

In [7]:
def map_np_randn(v,i):
    return (tf.add(v,np.random.randn()),i)

dataset_np_randn = tf.data.Dataset.from_tensor_slices((rawvalue,rawindex)).batch(2)
dataset_np_randn = dataset_np_randn.map(map_np_randn)

In [8]:
for data, index in dataset_np_randn:
    for d,i in zip(data,index):
        print(f'{i} + randn = {d};')

0 + randn = 1.3981781005859375;
1 + randn = 2.3981781005859375;
2 + randn = 3.3981781005859375;
3 + randn = 4.3981781005859375;
4 + randn = 5.3981781005859375;
5 + randn = 6.3981781005859375;
6 + randn = 7.3981781005859375;
7 + randn = 8.398178100585938;


In [9]:
for data, index in dataset_np_randn:
    for d,i in zip(data,index):
        print(f'{i} + randn = {d};')

0 + randn = 1.3981781005859375;
1 + randn = 2.3981781005859375;
2 + randn = 3.3981781005859375;
3 + randn = 4.3981781005859375;
4 + randn = 5.3981781005859375;
5 + randn = 6.3981781005859375;
6 + randn = 7.3981781005859375;
7 + randn = 8.398178100585938;


## 3. Using `cache()` *after* `map()`
Every epoch of dataset is **NOT RANDOM** (even we use `tf.random`)

In [10]:
dataset_cache = tf.data.Dataset.from_tensor_slices((rawvalue,rawindex)).batch(2)
dataset_cache = dataset_cache.map(map_tf_randn).cache()

In [11]:
for data, index in dataset_cache:
    for d,i in zip(data,index):
        print(f'{i} + randn = {d};')

0 + randn = 1.3161203861236572;
1 + randn = 2.3161203861236572;
2 + randn = 2.256491184234619;
3 + randn = 3.256491184234619;
4 + randn = 2.33833646774292;
5 + randn = 3.33833646774292;
6 + randn = 7.918804168701172;
7 + randn = 8.918804168701172;


In [12]:
for data, index in dataset_cache:
    for d,i in zip(data,index):
        print(f'{i} + randn = {d};')

0 + randn = 1.3161203861236572;
1 + randn = 2.3161203861236572;
2 + randn = 2.256491184234619;
3 + randn = 3.256491184234619;
4 + randn = 2.33833646774292;
5 + randn = 3.33833646774292;
6 + randn = 7.918804168701172;
7 + randn = 8.918804168701172;


## 4. Using `cache()` *before* `map()`
Every epoch of dataset is now **RANDOM**

In [13]:
dataset_cache2 = tf.data.Dataset.from_tensor_slices((rawvalue,rawindex)).batch(2)
dataset_cache2 = dataset_cache2.cache().map(map_tf_randn)

In [14]:
for data, index in dataset_cache2:
    for d,i in zip(data,index):
        print(f'{i} + randn = {d};')

0 + randn = 1.6351758241653442;
1 + randn = 2.6351757049560547;
2 + randn = 1.7928622961044312;
3 + randn = 2.7928624153137207;
4 + randn = 4.905435562133789;
5 + randn = 5.905435562133789;
6 + randn = 6.691409111022949;
7 + randn = 7.691409111022949;


In [15]:
for data, index in dataset_cache2:
    for d,i in zip(data,index):
        print(f'{i} + randn = {d};')

0 + randn = -0.5546321868896484;
1 + randn = 0.44536781311035156;
2 + randn = 3.8618063926696777;
3 + randn = 4.861806392669678;
4 + randn = 3.8405842781066895;
5 + randn = 4.8405842781066895;
6 + randn = 7.156463146209717;
7 + randn = 8.156462669372559;


## 5. `shuffle()` machanism

Sometimes data belongs to same class is gathered together while constructing dataset. Becareful the case when `buffer_size` in `shuffle()` is too small. This makes shuffle working badly. Following figure explains `shuffle` machanism and thus can see why.

<img src="shuffle.png" width=40% height=40% align="left">