Infering relative importance of features in driving a dependent variable

Question

In an auctions website I maintain, auctions are listed most-recent-first. There are 20 auctions per page. A user can click next in the footer to view older auctions.

It's early days so there's currently no other way to search an auction. The most-recent-first view is the only discovery mechanism. I do provide a filter along cities, but that too in most-recent-first ordering.

I need to infer which features are driving bid submission in this auction website (so that I can improve bid submission rates).

My dataset comprises auctions that were all alive for 7 whole days. My initial plan was to apply logistic regression to this dataset with unique_bids_per_day as the dependent variable. I had a very useful discussion about that here.

In a nutshell, I was advised that if the dependent variable followed the Bernoulli or Binomial distribution, then logistic/binomial regression could be useful. So I did a quick analysis to check distribution of unique_bids across the 7 days an auction is live. Results suggest that most bids come within 24 hours of auction submission (or creation). I.e.

This is unsurprising given how the website is organized (described at the start).

So wouldn't this mean that unique_bids_per_day is not following a binomial distribution (the probability of getting a bid is not uniformly distributed over the life of an auction)? And if that is the case, that would jeopardize using logistic regression in this type of scenario. So then what should I do to infer which features are driving bids? Would be great to get an illustrative answer.

Note: features are categorical and numeric both

This is the head of the data (summarized; the actual data has more features). unique_clicks_per_day is actually unique_bids_per_day.

This is the natural log of days_since_submission. Looks slightly bi-modal:

What are the possible values that `unique_bids_per_day` can take on? If it is not 0 and 1 then without recoding, you cannot use this variable in a logistic regression. If you could provide a snapshot of the first few lines of your data it would help folks at the site better address your modeling questions. — Matt Barstead, Aug 08 '17 at 16:50
@MattBarstead: Added the `head` at the end of the question. One way to recode `unique_bids_per_day` could be to calculate `median_bids_per_day`, and then classify all values above median as `1`, and `0` otherwise. However, this way, I'd lose some information. Could there be a better way to accomplish the inference I want? Responders to the question I've linked to seemed to imply Binomial regression is the way to go (which I assume is the same as logistic in this case). — Hassan Baig, Aug 08 '17 at 17:24
I guess a fundamental question is what does the value .142857 represent in the first row of `unique_clicks_per_day`? It can't be number of clicks because it is not an integer and it doesn't seem like it is a proportion as there is a value in the same vector that exceeds 1. (it is also repeated in the third row of data though that could be a coincidence). — Matt Barstead, Aug 08 '17 at 18:49
@MattBarstead: it's basically `total unique clicks` garnered in 7 days, divided by `7`. 7 days is the life of a single auction. — Hassan Baig, Aug 08 '17 at 19:04
Okay understood. So my thought is that one option is to not perform the linear transformation and keep your measure as `total unique clicks`. Each auction sounds like it last for the same length of time so there is nothing to be gained by standardizing to a daily click rate from an analytical standpoint (it only alters the interpretation). The untransformed variable is now a count variable (has to be an integer between 0 and infinity), and you can now use a generalized linear model (specifically using the Poisson link function). — Matt Barstead, Aug 09 '17 at 01:43

score 1 · Accepted Answer · answered Aug 09 '17 at 02:18

I am not well versed in Matlab but I am confident that you can run a generalized linear model in the program. A quick Google search led me to this site with what looks like relevant code for such a model.

I'll walk you through a Poisson regression using R. The basic properties and output of the analysis should be the same.

First, I simulated some truncated data (I tried to give it some similar properties to your data set):

> head(dat)
  total_clicks descrp
1            4     23
2            1     19
3            3     22
4            7     21
5            0     14
6            1     15

My goal here is to determine whether total clicks over the 7 days can be predicted by the number of words in the description.

Here is the model and output from R:

> fit<-glm(total_clicks~descrp, family = 'poisson', data=dat)
> summary(fit)

Call:
glm(formula = total_clicks ~ descrp, family = "poisson", data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7788  -0.7558  -0.1032   0.6439   2.5157  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.50713    0.26060   1.946  0.05165 . 
descrp       0.03516    0.01250   2.812  0.00492 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 116.82  on 99  degrees of freedom
Residual deviance: 108.88  on 98  degrees of freedom
AIC: 402.55

Number of Fisher Scoring iterations: 5

Briefly, the model suggests there is a positive relation between the number of words in a description and the number of unique clicks received. However, this value does not have an obvious interpretation. We have to convert it back to the units of measure (instead of taking the log - which is the link function for the Poisson regression - we exponentiate this coefficient)

In R code this is how you exponentiate:

> exp(fit$coefficients)
(Intercept)      descrp 
   1.660519    1.035782

This gives you a value for the intercept (which is the estimated frequency of clicks when your predictor or predictors equal 0) and the slope. Some call this latter exponentiated coefficient the incident rate ratio (IRR) others refer to it as the event rate ratio (ERR). Regardless of the terminology used the interpretation is the same. For each additional word in a description the click rate increases by a factor of 1.036. Or another way of thinking of this value that may be more intuitive is that each additional word is related to a 3.6% increase in the frequency of clicks (on average).

Thanks for the detailed answer Matt. I have 2 follow up questions about the fundamentals of Poisson. **1)** In the Poisson mass function, lambda is the rate per unit time (and assumed to be steady). In my case, clicks decay as an auction ages. I.e. their rate is not steady. Wouldn't that compromise the underlying assumption of the distribution? If so, what do we do about it? **2)** a trivial question is; Poisson regression works fine with categorical features too right? — Hassan Baig, Aug 09 '17 at 11:52
Though I understand where question #1 comes from, I am not sure that its answer threatens your analysis in a meaningful way. Sure the rate at the start of the week is not the same as the rate at the end of the week. However, your goal is not to find the "true" click rate in your population of auctions. It is to see if certain features predict the number of clicks over a pre-defined window of time - one week. If you want to factor the decay rate in your model or even model it explicitly there are ways to do so using multilevel growth models. — Matt Barstead, Aug 09 '17 at 12:36
For question #2 - yes, you can incorporate categorical predictors in a Poisson regression using the same general strategies you would for any regression model. — Matt Barstead, Aug 09 '17 at 12:38
I also made a separate question about this here: https://stackoverflow.com/questions/45796641/fixing-typeerror-in-poisson-regression-using-python?noredirect=1#comment78551311_45796641 — Hassan Baig, Aug 21 '17 at 12:41

Infering relative importance of features in driving a dependent variable

1 Answers1

Linked