Big-picture issues
You are confusing the probability-modeling aspect of your problem with its decision-threshold aspect.
Accuracy, sensitivity, and specificity depend on a particular choice of a probability cutoff along your receiver operating characteristic (ROC) curve. That's often a hidden choice within the software, of a 0.5 predicted probability cutoff. Values of accuracy, sensitivity or specificity based on a single cutoff are thus poor measures of model quality. The AUC, the area under the ROC curve, isn't perfect but as a measure of model quality it at least takes all potential probability cutoffs into account. See the link above for the importance of proper scoring rules like log-loss or the Brier score, which evaluate the full probability model. My guess is that your model with the highest AUC will be the best overall model for predicting class probabilities, unless it's overfit.
For a decision threshold based on estimated probability, the default choice of 0.5 makes an implicit assumption that the costs of false-positive and false-negative class assignments are identical. Is that the case for your application? If not, then you shouldn't be using a cutoff of p = 0.5
. You should be using a different probability cutoff appropriate to the relative misclassification costs. See this answer, this page, and their links for ways to incorporate cost estimates into your choice of a decision threshold.
I would recommend focusing more on the models with high AUC values and see how well they perform when you take your misclassification costs into account. That said, you might well be close to "as good as it gets" based on the information that you have. For example, an AUC of 0.67 means that if you randomly choose 1 member of each class, your class-membership probability predictions will be in the wrong order for that pair 1/3 of the time. Is that good enough for your application?
Other things to try
You've already tried many of the modeling approaches used for this type of problem, so you might be stuck with being "as good as it gets" for your data. Your data aren't that badly unbalanced at 12%/88%, and with your data set you seem to have over 6000 cases in the minority class, giving you a lot of room to include available predictors. See this thread among others on this site for further discussion about imbalance; changes in data sampling are unlikely to help.
One thing that you might be able to take more advantage of, given the size of your data set, is interactions among the predictors in your models. I don't know how many predictor interactions you allowed for in your models to date. With over 6000 members of the minority class, you might be able to include up to 400 predictors and predictor combinations/interactions in your model without overfitting, allowing for many interaction terms.
Gradient boosting can allow for a large number of potential interactions, with a complexity set by the depth you allow for the trees. Gradient boosting thus can let the data determine which predictor interactions are most important, a potential advantage over your defining all the candidate interactions in advance for a standard logistic regression model. You should use a slow learning rate to minimize the chance of overfitting the data. A log-loss penalty on the pseudo residuals makes the process follow the loss criterion of logistic regression. That overcomes the dangers discussed above of the implicit probability cutoff in accuracy-based cost criteria that might otherwise be used in boosting (and probably were used in your neural net as well). See this answer and its links for more details.