-1

I am trying to create a binary classifier that will assess if a photograph is aesthetically pleasing or not. However, despite various experiments, I am struggling to create something that even reaches 50% accuracy.

My dataset is downloaded from Flickr, I am using 240x180 color images. For classification, I am using a score which is defined as follows:

score = log((stars + 1) / (views + 1), 2)

where stars is a number of favorites for a given photo and views is a number of its views. That gives me more-or-less normal distribution of the score (bottom-right graph):

enter image description here

The idea behind this is to get a metric on what percentage of viewers liked the photo. Next, I am classifying an image as 'good' or 'bad' when its score is above or below the mean for this metric across the whole dataset.

For learning the classifier, I am using my own TensorFlow implementation of DCNN. The network architecture is based on configuration A found in this paper, which was proven in a similar problem. I am using MSE loss function and Adam optimizer. So far I have downloaded over 1 milion photos from Flickr. After selecting images with proper resolution and above 100 views threshold, I am able to train my classifier on train set containing ~60 000 images.

However, my classifier achieves very bad results. The model is not converging, accuracy on both training and cross validation sets is oscillating around 55% or 45% (explanation lower), the error is not decreasing.

While debugging the code I noticed that on the first iteration my model is producing reasonable predictions for given examples, like 0.9 (the output layer is one sigmoid neuron which is supposed to give a probability that a photo is aesthetically pleasing). However, after only one iteration, my classifier is constantly predicting ones or zeroes for any example. Due to the fact that my dataset is not divided perfectly, this results in accuracy like 55% or 45%, depending on the value that is always predicted on the output layer.

Things I tried to improve the classifier:

  • changing GradientDescent optimizer to AdamOptimizer,
  • training the model on bigger training set (I started with 12k images, now I have 60k and constantly downloading more),
  • implementing weight decay,
  • trying two different DCNN architectures,
  • changing the loss function to cross entropy or mean square.

Nothing improved the quality of predictions. From the very beginning, I expected quite poor performance, due to the nature of the problem, but it seems that something is completely wrong. I think I am missing something obvious, despite investigating the problem for last two weeks. If anyone could provide some guidance on where to look for flaws, I would be extremely grateful. The code could be viewed here.

1 Answers1

0

I looked very briefly at the code, really only to see you were using convolution and pooling. I think part of your problem may lie there. You can run a convolution kernel and pool it down to activate features like the presence of the shape of a cat's eye, but the same doesn't hold true for beauty, in general, in art. I've made a lot of art with blender and won many of the weekend contests (like 15) and I know a thing or two about what makes a (photoreal) image appealing, and no kernel will find it directly. I'd recommend engineering some features yourself based on the images overall color and greyscale levels. An image with fully saturated colors of all hues will tend to turn viewers off. An image that is overly blown out, or badly underexposed is bad. An even range, and a limited color scheme are good. I suppose my advice in a nutshell is stop looking at the trees and examine the forest. Also, puppies and baby chickens are gold!

photox
  • 486
  • 5
  • 7
  • Thanks for your insight. I do agree that this is not trivial challenge, but it seems that it is possible with DCNNs - see here: https://devblogs.nvidia.com/parallelforall/understanding-aesthetics-deep-learning/ I am just wondering whether I did a mistake in my implementation, or this is an issue with data quality (obviously manually curated data is better than mine, from Flickr) or different loss function makes a difference. – mc.suchecki Feb 10 '17 at 12:44
  • Maybe it is possible to extract beauty from filters alone. The write-up does not include however any type of results or examples. And of course the entire subject is subjective, a group of other 'experts' may well rate completely differently. I might argue that a large number of independent amateur ratings (like yours) is more meaningful. I would recommend this: hand feed your network a tiny training set like 4 images, two image smooth and red (perfect score), and the other noisy and blue(0 score). Can your network get 100% on these? if not you have issues with your input. – photox Feb 10 '17 at 14:00
  • After reading the other link to the paper, I see that their final layer is using soft max (section 2.1), you might try that as sigmoid would definitely behave as you describe. Also why not just use images with 100+ views and calculate the average number of stars, and then use the global mean for your classification? – photox Feb 11 '17 at 10:34
  • Firstly, thanks for your interest! It is nice to exchange thoughts. Considering the subjective matter of the problem, I completely agree. That exact fact makes the problem interesting for me. My goal here is to find something like a common ground for aesthetics tastes in photography. Thanks for your experiment proposal with simple cases. I will try that later and come back with the results. Right now I am trying to train a pretrained model in Caffe on my data to check if the problem lies in the poor data quality. – mc.suchecki Feb 11 '17 at 13:54
  • Considering softmax activation - the difference comes from the fact, that I am doing binary classification and they have 1000 classes. It seems like it is supposed to be the same in binary classification, see [here](http://stats.stackexchange.com/questions/207049/neural-network-for-binary-classification-use-1-or-2-output-neurons). However, I will try with the softmax also, just for curiosity. Considering your proposed different metric, I am going to try that too. However, as I need to normalize the stars count, I am now in the process of downloading the missing upload dates for the images. – mc.suchecki Feb 11 '17 at 14:05
  • I'm really just spit balling, but you might also consider linear activation on the final layer, and forgetting the classification, just train using the mean score, and then predict the score outright. As is you're forcing the model to make a binary classification based on what may have come down to a few subjective votes. And don't forget to do a sanity check using a tiny dataset where the answers are obvious. Best of luck, hopefully some more users will chime in! Another though, polarize the data,use the very best (highest rated), and the very worst, and exclude middle rated images. – photox Feb 11 '17 at 14:38
  • Thank you, I really like your ideas, I will give them a try next week! – mc.suchecki Feb 11 '17 at 18:35
  • Best of luck. I was thinking about it, and it seems like the bulk of your data are close to the mean, and imposing a class on them is a bit artificial. Regardless of your scoring methodology, some line must be drawn to separate good vs bad. And with some many images close to this line, the model may be flopping around having gotten very close but being told it's 100% wrong. Imagine in the classic binary cat/dogs classification, if we took 30% of the data and switched labels. The model would start to thrash. I really like the project and hope it works! – photox Feb 11 '17 at 23:02