Imputing missing data in teras

Imputing missing data in teras#

Using state of the art deep learning data imputation models for tabular data can be quite a challenge, not just because of how complex the model architecture might get, but also because of the data preprocessing and transformation steps involved. But teras makes it as easy as doing a classification or regression task.

As of teras v0.3, it offers two GAN-based architectures for data imputation, namely GAIN and PCGAIN.

For the sake of this tutorial, we’ll use the GAIN architecture.

So without further ado, let’s get to coding!

As always, the first step is to configure your backend. I’ll be using JAX because it’s almost always is the fastest of the three.

To configure your backend for teras, you need to set the KERAS_BACKEND environment variable.

NOTE: You need to configure you backend before importing teras/keras

import os
os.environ["KERAS_BACKEND"] = "jax"

For this tutorial, we’ll be using the Boston Housing dataset made available by keras.

from keras.datasets import boston_housing

(X_train, y_train), (X_test, y_test) = boston_housing.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz
57026/57026 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

Let’s combine all the data since our task here is self-supervised so we don’t need labels or test data to compute any metrics

import numpy as np

dataset = np.concatenate([np.concatenate([X_train, y_train[:, np.newaxis]], axis=1),
                          np.concatenate([X_test, y_test[:, np.newaxis]], axis=1)],
                         axis=0)
dataset.shape

(506, 14)

Always a good idea to normalize our dataset

from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
dataset = normalizer.fit_transform(dataset)

Now, this dataset in itself doesn’t contain any missing value, so we’ll inject missing values ourselves to simulate a real world scenario.

And for that, teras offers a handy utility that can be quite helpful for quickly simulating such situations. It conveniently named inject_missing_values

from teras.utils import inject_missing_values

print("# of missing values: ")
print("Before injecting: ", np.isnan(dataset).sum())
dataset = inject_missing_values(dataset, 0.2)
print("After injecting: ", np.isnan(dataset).sum())

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 1
----> 1 from teras.utils import inject_missing_values
      3 print("# of missing values: ")
      4 print("Before injecting: ", np.isnan(dataset).sum())

ModuleNotFoundError: No module named 'teras'

The GAIN architecture that we’ll be using requires dataset in the form (x_generator, x_discriminator).

There’s a handy data utility function in teras for this purpose named create_gain_dataset.

NOTE: As of teras v0.3.0, you need to have TensorFlow installed to use this function since it makes use of tf.data to create a TensorFlow dataset that is then handled by Keras 3 to be used with any backend. It is also true for any data sampling classes available in teras. You may not like TensorFlow but you cannot not like tf.data.

from teras.data_utils import create_gain_dataset

gain_dataset = create_gain_dataset(dataset)

# Remember to batch your tensorflow dataset
BATCH_SIZE = 64
gain_dataset = gain_dataset.batch(BATCH_SIZE)

2024-04-10 13:43:50.949841: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-10 13:43:50.949885: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-10 13:43:50.951338: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-10 13:43:52.014177: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Now let’s import GAIN

Since GAIN is a generative adversarial network, so it requires a instaces of a generator and a discriminator, which we’ll also import.

from teras.models import GAIN
from teras.models import GAINGenerator
from teras.models import GAINDiscriminator

If you look at the documentation, to instantiate either the GAINGenerator or GAINDiscriminator you need a positional argument namely data_dim. Now it’s usually the same as the input dimensionality of the dataset, but is named so for cases when the input dataset has different dimensionality from the original dataset due to data transformations and such other preprocessing craft.

Anyway, here data_dim refers to the dimensionality of the original dataset.

dataset.shape[1]

generator = GAINGenerator(data_dim=dataset.shape[1])

discriminator = GAINDiscriminator(data_dim=dataset.shape[1])

gain = GAIN(generator,
            discriminator)

NOTE: You can customize these models futher by specifying various keyword arguments. Look up docs! I’ll just stick with default for the sake of this tutorial.

Now let’s compile our model. Note that we’re not passing any loss function to the compile method of GAIN instance, the reason being these specialized architectures contain loss computing methods within.

import keras

gain.compile(generator_optimizer=keras.optimizers.Adam(),
             discriminator_optimizer=keras.optimizers.Adam())

The rule of thumb for GAN-based models in teras is to ALWAYS build them yourself because the dataset that we pass to such architectures is usually deviates from normal (X, y) paired dataset, so Keras fails to build such models automatically due to failure to infer expected input shape.

So let’s build the model ourself!

gain.build((BATCH_SIZE, dataset.shape[1]))

Now, if and only if you’re using the JAX backend, you’ll have to call the build_optimizers method when using any GAN based model or any model that makes use of more than one optimizer. It is not needed for other backends like TensorFlow or PyTorch, neither it is needed for any architecture that only uses a single optimizer, which is usually how it is in 99.99% of the cases.

Anyway, since we ARE using the JAX backend, so we’ll call this method.

gain.build_optimizers()

WARNING: Calling build_optimizers method on a backend other than JAX will result in error!

history = gain.fit(gain_dataset, epochs=2)

Epoch 1/2
8/8 ━━━━━━━━━━━━━━━━━━━━ 5s 384ms/step - discriminator_loss: 0.7368 - generator_loss: 48.5096
Epoch 2/2
8/8 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - discriminator_loss: 0.7002 - generator_loss: 47.1353

Now the model is trained. Cool. But if we can’t put it to use, it’s useless. So let’s put it into use.

To impute data with missing values, you can either use the predict method of the trained GAIN instance or use a the Imputer class available in teras.tasks module. The Imputer class may not be that useful here, but it can be very useful in cases where you transform your data using a data transformer class.

So, assuming you already know how to use predict, we’ll use the Imputer class here. It offers an impute method that takes in dataset with missing values and returns imputed data. If a data transformer instance is passed in during the instantiation, it will return the imputed data in its original format.

Since we’re not using any data transformer class so we’ll set the reverse_transform parameter in impute method to False otherwise it’ll result in error.

from teras.tasks import Imputer

gain_imputer = Imputer(gain)

imputed_dataset = gain_imputer.impute(dataset, reverse_transform=False)

16/16 ━━━━━━━━━━━━━━━━━━━━ 1s 32ms/step

print("Missing values in the original dataset: ", np.isnan(dataset).sum())
print("Missing values in the imputed dataset: ", np.isnan(imputed_dataset).sum())

Missing values in the original dataset:  1426
Missing values in the imputed dataset:  0

And that wraps it up! As you saw, it’s super easy and intuitive to use state of the art complex architectures for data imputation, thanks to teras!

If you have any questions or run into an issue, reach us at twitter @TerasML or file an issue at teras github repository.