There are two parts to this answer:
1) As a direct answer to your question, what you are suggesting is OK if you have classes that you don't mind taking into account jointly. For instance, if your classes are 'dog', 'cat', 'car' and 'red car', depending on the context it might be OK to treat 'red car' as simply 'car'. I very much doubt you will find literature on this as it is very problem specific. So the really what you need to answer is "do I care if I can't tell these categories apart?"
2) You can reduce the number of weights in you NN by relabeling your response variable. For instance, suppose you had 32 categories. In that case, instead of 32 output neurons, you could recode them as 4-digit binary numbers and have your NN try to predict that. For example, if you would code class "9" as "1001". For 39 classes you'd need 5 output neurons.