A few thoughts:
First, even if you ultimately need to evaluate the accuracy of your model, training and testing your model on accuracy is probably not be the best way to proceed. This issue is discussed extensively on this site, with this page being a good place to start. That probably explains why your cross-entropy losses (log losses) agree much better between your test and training sets than do assessments of accuracy. Stick with cross-entropy.
Second, as you seem to have further processing beyond this initial modeling, consider doing that further processing in a way that carries through the predicted class probabilities until the end rather than depending on an early all-or-none assignment of cases to 1 of your 30 classes. That could lead to more reliable final results.
Third, you haven't said much about the nature of your "deep learning model." You might need to consider a different type of model, or adjusting the learning characteristics of the model (as with the $\ell 2$ penalization you seem to be considering).
Fourth, it's possible that you just don't have enough data of a type that can discriminate among class memberships, particularly for the low-prevalence classes. Even the best attempts at such problems can hit unavoidable barriers.