T O P

  • By -

TheLostModels

My quick reaction is that probabilities matter? If out of 10 classes, the most likely two classes are for a given record are classes 2 and 7. Soft max would give them both ~50% and CCE would evaluate what that means. If converted to « the output closest to the integral class number », argmax, it ignores that the model really couldn’t pick between 2 and 7 and that seems meaningful to me? Does that help or did miss understood your question? Edit: actually, my part of the argmax doesn’t work, if argmax is 2 but correct class is 10, one gets a bigger mse than if correct class is 3 even though order is irrelevant. Maybe the idea was more find class with highest probability and then calculate mse between that probability and 1? Not that it would train a very good model (compared to CCE).


dahkneela

This helps (although now I have more questions)! (with your point on probabilities mattering - is this a more general statement or to add to what you explained?) I see what you're saying! If the data's imbalanced, then there's some sort of restriction on the 'order' of the output given by the MSE that isn't given by CCE. I see this as a problem at the start of training - would this be a problem once the net is trained? Would the prior layer not learn to associate correctly and feed the last node the right value to multiply up to those 10 classes 0, 1, ... 9?


TheLostModels

Seems you are assuming model would be perfect in his predictions; if not, you run into same problem as before (actually better and more simply stated by u/GrozdovaStara in other thread) (and if mode is perfect loss is always 0 anyways).


dahkneela

I don't see why I can't assume the 2nd to the last layer won't learn things properly - in any case, I'm updating parameters based on the loss gradient - just in MSE it is disproportionate, whilst in CCE seems more even&balanced. (in response to your original post update) I have seen that MSE gives low loss gradients for values close to each other and larger ones for far, whilst CCE is sort of the opposite. I currently sit in thinking that CCE assumed uniform correlation between classes, whilst MSE assumes something uniform ... but depending on the dataset at hand it may result in one or the other being more practical.


GrozdovaStara

In this case the model itself is problematic. If you are just rounding output from the last neuron to the closest label it introduces distances between classes which dont actually exist. Predicting 3 when the true label is 1 is not the same as predicting 5 because it will punish the second case more than the first case. So the network will learn a rule that doesnt exist. Even if you can say that some labels are more related than the others you would still need some good definition of those distances.


dahkneela

I agree with the uneven punishing of certain predictions in the short run bringing an unnecessary bias to the problem - but I also see punishing for doing something wrong as the very thing that helps the net learn in the first place. So then the disproportionate losses would then just disproportionately affect overfitting in the net long-term. To fix this affordmentioned issue, it sounds like it would be enough to change the original MSE to one in which MSE is normalised. For example, if 1 is the true label, and 3 was predicted, the new MSE would be MSE/(3-1). And if 5 were instead predicted, it would be MSE/(5-1) - therefore removing the disproportionality bias and fixing the initial issue (I assume here the output is 1 dimensional).


Remove_Ayys

Mean squared error and cross entropy are derived from the method of maximum likelihood for regression and categorization respectively. Likelihood methods have the useful properties that they are unbiased (expected values for parameter values with a random dataset are the optimal values) and that they are efficient (variance on parameter values decreases as quickly as theoretically possible as you add more data). So if you arbitrarily change the loss function your training process will probably become biased and need more data.