I'm going to stress that, in the absence of a well-defined analysis plan or protocol for handling such values, the answer is: you leave them in. You report unadulterated results as a primary analysis: the one in which the p-value is viewed as answering the main question. If it is necessary and instructive to discuss results from excluding high-leverage points, this is considered a secondary or a post-hoc analysis and has a significantly lesser weight of evidence, more of a "hypothesis generating" result than a "hypothesis confirming" one.
The reason for not excluding such values is because you compromise the interpretation of the results and the reproducibility of your analysis. When you make ad-hoc decisions about which values are and are not worth leaving in, you cannot trust that another statistician would do the same. The practice of throwing observations out is very bad science. Doing so, you actually revise your hypothesis (because you've defined your population differently than originally stated), and the new "population" is paradoxically defined by what you've observed. The p-value, then, doesn't mean what people think it means, and is, in a way, a falsified result.
This brings into question the role of diagnostic statistics. It may sound like I'm advocating to never use them. It's quite the opposite. Running diagnostics is good only insofar as it helps to understand the inherent assumptions in the model. As Box said, "All models are wrong, some models are useful." Even with non-linear trends, sometimes the linear relationship is close enough to give us "rules of thumb" that are worth guiding decision making. Take the relationship between lead exposure at birth and adulthood IQ. Very few, if any, children have 0 exposure to lead. Virtually all of us have been exposed in such a way that our IQ has been significantly diminished from what it could have been otherwise. When sampling individuals in such a fashion, you would almost certainly find one or more highly influential individuals who have low lead exposure and high IQ. Think about the difference in hypotheses that are ultimately tested in the scenarios when such individuals are either excluded or maintained in the primary analysis.
When diagnostics indicate problematic observations, you need to address a number of issues:
Are there unknown sources of variation or covariation present within subgroups? e.g. correlation btn household members or a wave of lab assays run by a contracted lab that has poorly calibrated equipment?
Does the mean model hold approximately? Is the hypothesis more accurately tested by using a more flexible modeling approach such as with smoothing splines or ever higher order polynomial effects?
Is variance weighting sufficiently accounted for? In LS modeling, this means standard errors are calculated from homoscedastic data or else robust standard errors are used. GLMs automatically reweight such data according to probability models for outcomes. In that case, is the probability model correct?