Neural Style Transfer in a Most Simple Way

Neural Style Transfer in a Most Simple Way

The Neural Style Transfer algorithm has been in a very popular position lately. If we look at the acceleration of "Deep Learning" algorithms in the last 5 years, it seems that Neural Style Transfer algorithms will maintain this popularity for a long time by improving themselves with new approaches every day. In addition to these, with the cyberpunk art understanding in NFT platforms, we have made computers draw more pictures than before, what an irony, right?

What is "Neural Style Transfer"?

According to Wikipedia: Neural Style Transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s). Wikipedia Article about Neural Style Transfer

Alright. Wikipedia knows what that is, so what should "we" understand about this freaking paragraph: Okay okay, don't be angry. I will try to explain everything. Yes, Wikipedia knows what this is and has explained it. What "we" should understand from this, we can simply say it like this: The "Neural Style Transfer" algorithm takes 2 photos as an input, one to focus on the shapes and the other to focus on the texture of the image. Then it creates a new photo by building a synthesis of these 2 photos.

OK, I see! Just let me see the code.

OK, I will let you see the code in a second but I want to give some instructions before starting. I will continue to explain the "Neural Style Transfer" implementation that I have made (You can access the codes from this link). We will continue with the codes and the mathematical background of the algorithm at the same time. So, don't be confused! Please stay on the right track, Sir!

Let's starting - Building the model!

First, as with almost every Computer Vision application, we will need a Feature Extractor model to access more detailed features on our images. With the Feature Extractor model, we will parse the features of the images we have and then make various calculations on these features. I created 3 different models at this stage. (All Models)

When we examine the models, you should have noticed that 2 models include the pre-trained VGG19 model and one model was written completely from scratch. At this stage, we need to know that the pre-trained models are pre-trained with various large datasets (here we prefer the imagenet dataset) and thus are better at feature separation than the models written from scratch. For these reasons, vgg_extractor_model might give us the best performance here. vgg_extractor_model is looks like this:

@classmethod
def vgg_extractor_model(cls):
  vgg_model = tf.keras.applications.VGG19(include_top = False,
                                          weights = "imagenet")

  style_conv_blocks = [f"block{i}_conv1" for i in range(1, 6)]
  content_conv_block = ["block5_conv2"]
  all_activation_layers = style_conv_blocks + content_conv_block

  input_layer = vgg_model.inputs
  output_layers = [vgg_model.get_layer(i).output for i in all_activation_layers]

  model = tf.keras.models.Model(inputs = input_layer,
                                outputs = output_layers,
                                name = "vgg19_extractor")

  return model

At this stage, there is a part that I would like to draw your attention to. If you look carefully, from the vgg_extractor_model we created, which contains the pre-trained VGG19 model, we returned only a few convolution layers, not the entire structure:

style_conv_blocks = [f"block{i}_conv1" for i in range(1, 6)]
content_conv_block = ["block5_conv2"]
all_activation_layers = style_conv_blocks + content_conv_block

The reason for this is that we will only calculate the losses with the features processed in the convolution layers within the model, not with the whole model.

One Step Further - Calculating the loss values!

After creating our model, now it's time to calculate the loss values that we will do with the feature maps that will come from this model. There are 3 values that we need to calculate at this stage:

  • Gramian Matrix (or Gram Matrix)
  • Style Loss
  • Content Loss

Gramian Matrix: The Gramian Matrix is an easier calculatable form that can we use in the calculations of the matrices which are returned from convolution layers by our vgg_extractor_model. We will use this form of matrix when we calculate our style loss. Gramian Matrix Calculation There is the code implementation of Gramian Matrix Transformations:

@classmethod
def gram_matrix(cls, arr):
  """Gramian matrix for calculating style loss"""
  x = tf.transpose(arr, (2, 0, 1))
  features = tf.reshape(x, (tf.shape(x)[0], -1))
  gram = tf.matmul(features, tf.transpose(features))

  return gram

Style Loss: Style loss is simply a loss value that is calculated according to the distance between the fabric of generated image and the fabric of the input style image. Style loss is calculated between the gram matrix of generated image and the gram matrix of input style image.

@classmethod
def style_loss(cls, style, generated):
  style_gram = cls.gram_matrix(style)
  generated_gram = cls.gram_matrix(generated)

  style_loss = tf.reduce_mean(tf.square(generated_gram - style_gram))

  return style_loss

Content Loss: Content loss works similar to style loss but this time we calculate the value for the shapes between the input and generated images.

@classmethod
def content_loss(cls, content, generated):
  """1/2 * sum of (generated - original) ** 2"""
  content_loss = tf.reduce_sum(tf.square((generated - content)))

  return content_loss * 5e-1

Let's Put Everything Together - Final step!

After all these steps, we are finally ready to assemble everything. First of all, let me show you how the code will work in each step:

class Constants(object):
  CONTENT_WEIGHT = 2e-5
  STYLE_WEIGHT = 1e-4


class Train(Loss, FeatureExtractor):

  @classmethod
  def calculate_step_loss(cls, model, content, style, generated):
    tensor = tf.concat([content, style, generated], axis = 0)
    features = cls.extract(image_stack = tensor, model = model)
    content_act, style_act = cls.get_layers(features)

    content_loss = cls.content_loss(content_act[0], content_act[-1])

    style_loss = 0.
    for layer in style_act:
      layer_loss = cls.style_loss(layer[1], layer[-1])
      style_loss += layer_loss

    loss = (content_loss * Constants.CONTENT_WEIGHT) \
           + (style_loss * Constants.STYLE_WEIGHT)

    return loss

Here is what we do; First, we calculate the feature maps with our "feature extractor" model. Afterwards, we calculate the style loss and content loss values in each layer (the outputs from the convolution networks), and finally, after multiplying with the "Constant" values we arbitrarily determined, we can start the process by determining the epoch value. The "Constant" values you see at the top of the code block are the weights of the output, we can give a higher value to whichever feature we want to be in the foreground.

def train(model, content, style, generated, epochs = 10):
  optimizer = tf.keras.optimizers.SGD(learning_rate = 1e-4) 
  for epoch in range(epochs):
    with tf.GradientTape() as GT:
      loss = Train.calculate_step_loss(model, content, style, generated)

    print(f"EPOCH: {epoch + 1} \nLOSS: {loss}\n" + ("---" * 15))

    gradients = GT.gradient(loss, generated)
    optimizer.apply_gradients([(gradients, generated)])

  return generated

Finally, at the top you see the main training function. This function starts to process the image with the images and epoch value we have determined.

Let's See This Bigboy in Action!

from skimage import io
import tensorflow as tf

# train() function from src/train.py
from train import train

# feature extraction model from src/feature_extractor.py
from feature_extractor import FeatureExtractor

# all image paths
BASE = "./my/path/to/my/images"
CONTENT_IMAGE_PATH = os.path.join(BASE, "my_content_image.jpeg")
STYLE_IMAGE_PATH = os.path.join(BASE, "my_style_image.png")

# read your custom images 
CONTENT = io.imread(CONTENT_IMAGE_PATH)
STYLE = io.imread(STYLE_IMAGE_PATH)
COMBINED = tf.Variable(CONTENT) # tf.Tensor

# define feature extractor model
model = FeatureExtractor.vgg_extractor_model
# there are three options for models (check the src/feature_extractor.py)

# main training
STYLED_IMAGE = train(model = model,
                    content = CONTENT,
                    style = STYLE,
                    generated = COMBINED,
                    epochs = 50)

After generating and optimizing for 50 epochs. Results are here: Final Generated Image after 50 Epochs

Conclusion

Alright then! We have seen the really basic implementation and little bit of the mathematical background of "Neural Style Transfer". I have tried to keep it simple. If there is anything in your mind about the article you can reach me anytime.

Check the repository: Neural Style Transfer by Egesabanci

Ask me anything at anytime of the day:

You can follow me on Github, we can collaborate in projects: github.com/Egesabanci