Compare The Performance Of The Models In Art Classification - PLOS

Training settings

In our experiments, we designed a unified framework that can learn representations containing the artist, genre and style of visual arts from a large number of digital artworks with multiple labels separately. Then, the feature extracted from the framework can be used to classify the paintings. Furthermore, it can be used to retrieve similar paintings, as introduced in the Results and Discussion section. To extract features and use them in classification tasks, we built an overall general architecture defining the entire system as shown in Fig 3. The total system contains 2 sections. In the training section, the visual embeddings representing the visual appearance of paintings are extracted. The weights of the CNN can be learned via appropriate learning methods. In the test section, the CNN model trained to extract the visual embedding is used to classify paintings based on the extracted painting features.

thumbnail Download:
  • PPTPowerPoint slide
  • PNGlarger image
  • TIFForiginal image
Fig 3. The overall general architecture of the model.

We modified the last full connection layer of CNN to meet the different categories of paintings in different tasks. There, we show the style categories. We state that all the paintings are all in the public domain(courtesy wikiart.org).

https://doi.org/10.1371/journal.pone.0248414.g003

The pseudocode for training a neural network with a mini-batch and calculating the classification rate using test paintings in our experiment is shown in Algorithm 1. In each iteration, we sample b images to compute the gradients and then update the network parameters. We evaluate the model performance in each epoch and use the maximum accuracy as the final result after K epochs of training. All networks were fine-tuned using stochastic gradient descent (SGD) [36] with L2 regularization, which skips bias and batch normalization, a momentum of 0.9, and a weight decay of 0.0001. In all experiments, the following data augmentation techniques were applied to expand the small amount of training data:

  1. RandomResizedCrop. A random crop operation with range of size between 0.08 and 1.0 of the original size and aspect ratio (from 3/4 to 4/3) of the original aspect ratio were conducted. The cropped image was finally resized to 224*224 using random interpolation.
  2. RandomHorizontalFlip. The given image was randomly flipped horizontally at a preset probability of 50%.
  3. The Python import library (PIL) image was converted to a tensor and normalized using the mean and standard deviation.

Algorithm 1: Train a neural network and get the accuracy for painting classifiation

Input: initialize(net)

Output: Test Acc.

1 for epoch = 1,…,K do

2  for batch = 1,2,…,#images/b do

3   images ← uniformly random sample b images;

4   x, y ← progress(images);

5   z ← forward(net, x);

6   ℓ ← loss(z, y);

7   grad ← backward(ℓ);

8   update(net, grad);

9   Acc = Eval(net,testk);

10  end

11 end

12 Acc = Max(Acc)

For the test data, we first resized the image to 256 * 256 and cropped the images at the center to 224 * 224 (except EfficientNet-B3, which we first resized to 331 * 331 and then cropped the images at the center to 300*300). Then, we converted the PIL image to a tensor and normalized it using the mean and standard deviation. For the learning rate, when training on Painting-91, we chose 0.1 as the initial learning rate for a batch size of 64. When the batch size changed to a larger batch size (b), we increased the initial learning rate to 0.1 * b/64. For example, for the larger datasets WikiArt and MultitaskPainting100K, to improve the training speed without sacrificing model accuracy, we increased the batch size to 128, which means that the initial learning rate was set to 0.2. To save time and memory, we used mixed precision training in this experiment. We trained the model using a Titan RTX, which boasts 16.3 Tflops of single-precision and 130 Tflops of tensor performance (FP16 w/FP32 Acc). When using FP16 to calculate the weights, one major problem is that the parameter values in the models may be out of range because the dynamic range of FP16 is narrower than that of FP32, which interferes with the training process. To solve these problems, we followed [37], which suggested initially storing all the weights, activation and gradients in FP16 and copying the weights to FP32. Then, FP16 was used to calculate the gradients, and FP32 was used to update the weights. To adjust the learning rate, we started with a learning rate warmup period that lasted for the first several epochs and used cosine learning rate decay. The learning rate ηt was computed as shown in (2): (2) where η is the initial learning rate, t0 is the number of warmup epochs, and T is the total number of epochs. Our experiments used 160 epochs. This type of scheduling is called “cosine decay” [38]. After the cosine decay, we trained the models for several epochs (10 in our experiments) using the learning rate ηcool. Therefore, there are 170 (K) total epochs in our experiments. The classifier is a fully connected layer, and its output is used to compute a classification loss using a cross-entropy loss function. This loss can be calculated as shown in (3): (3) where x and class are the output of the classifier and the targeted label of the painting being learned, respectively. We ran our experiments using Pytorch 1.5.0 [39] in Ubuntu 18.04 with Titan RTX and Intel i9 10900k. All the pretrained models we used are from PyTorch Image Models [40](timm) library.

Tag » Art Classification Techniques