build model with complicated types of feature variables

Question

I have been asked to build a model to predict a life span of a material based on a couple of features. The features can be classed into the following categories:

1) The feature variables just have 0 or 1 value

2) The feature variables are ordered variables, such as a={1,2,3,4,5} or b={0, 200, 210, 840,…}

3) The feature variables are continuous

4) The feature variables are not exactly ordered, such as c={42, 43.4, 57, -1, 0, 0,…} The predicted life span is a numerical value.

I would like to get some advice on how to handle these mixed type of features. Can multiple regression handle this kind of scenario? Are there any other statistical models that can help?

Ordinal predictors are covered [here](http://stats.stackexchange.com/questions/33413/continuous-dependent-variable-with-ordinal-independent-variable). But what's "not exactly ordered" mean? — Scortchi - Reinstate Monica, Oct 14 '14 at 19:59

score 0 · Accepted Answer · answered Oct 14 '14 at 03:48

0

A question like this sounds like we may be picking wires in the bomb without understanding what they actually do :). To side note the question: multiple regression, GLM, etc modeling would be conducted, for example, to examine what effects, and perhaps to what degree of effect, modeled variables have on an outcome.

That is not the problem domain here. What you are asking is not really a predictive question as one of estimation. In particular you should be conducting a survival analysis with respect to some well-defined time-to-event criteria.

In other words, what is the defined end-of-life for the material? The time at which it breaks? Crumbles? Goes poof? Or would we examine something more microscopic, such as material decay, oxidation, etc that would render the risk of the material too great to consider further use? This is the first thing you need to know.

Once your question is refined and definitions provided, you need to understand your data. In order to determine an end-of-life based upon some definition, it is pretty important to have had data that contains observations of end-of-life. That is, you need a history.

If you do not have this, but have well defined criteria for what end-of-life is considered to be, then your problem becomes rather Bayesian - in the sense that you will need to incorporate variables, rules, and even expert opinion to inform what you expect to be an end-of-life outcome without having observed it yet. This sort of problem is also a data gathering venture, such that when the data overwhelm the prior information (i.e. you capture actual observations of 'end-of-life'), the observations themselves inform the survival estimation.

answered Oct 14 '14 at 03:48

pmiddlet

16

Some shrewd observations here; it's often useful to step back from the question as posed & to examine the context. But I'm puzzled by your apparent contrast of prediction + multiple regression, GLMs &c. on the one hand with estimation + survival analysis on the other. The approach of defining a model, estimating its parameters from data, & using the model with the estimated parameters to make predictions is had in common by techniques that fall under the umbrella of survival analysis & techniques that don't, & questions about how to represent predictors relate to Cox or Weibull regression ... – Scortchi - Reinstate Monica Oct 15 '14 at 09:33
... as much as to multiple linear regression & GLMs. – Scortchi - Reinstate Monica Oct 15 '14 at 09:36
I contrast these simply due to the ubiquitous, but inconsistent use of the term 'predictive' as both an 'explanatory' and 'temporal' term. Sure, the OP may be trying to predict the probability of failure over n items, but i suspect from the OQ the OP may not understand this, and may be pulling at straws for an answer which is clearly a time-to-event question ('lifespan'). My suspicion is that the OP may be of a CS/ML background (the term 'features' being one such note). I may be too presumptive here, but i felt the need to make the distinction early on. – pmiddlet Oct 16 '14 at 21:09
Yes, of course you're correct. I made this artificial distinction simply due to the well known inconsistent use of the term 'predictive' as both an explanatory and temporal term outside of statistical nomenclature. Sure, the OP may be trying to predict the probability of failure over time (and implicitly over n items). But my suspicion is that the OP may have a CS/ML background (the term 'features' stood out), and will likely use a different set of terms. The need to cater language to different groups who are conducting stat analyses, but use a different vernacular, isn't that unusual nowadays – pmiddlet Oct 16 '14 at 21:38
Reposted above reply in a clearer manner and couldn't get the edit to take. Sorry for the revised post. – pmiddlet Oct 16 '14 at 21:40

build model with complicated types of feature variables

1 Answers1