3

In a study of crowdfunding, I am investigating the relationship between the reputation of a person seeking funding for a project (a continuous variable) and the proportion of funding target received (i.e., funding received/funding target*100).

The data I have however is right-censored. Some of the crowdfunding projects are still in the funding stage and therefore they may end up meeting or exceeding their funding target by the end of their funding period.

For the moment, I am using linear regression as follows in R:

lm(proportionFunded ~ reputationBorrower + controlVariables, data=subSample)

where subSample is all projects which have expired - i.e., whose remaining funding days is zero.

I understand that this is not the right approach as I am not considering the remaining projects (about 25%) which are still running (i.e., which are still not expired).

Is a Cox proportion model more suited for this purpose? If so, could you suggest what additional variables I need to compute and the syntax to use the Cox model in R?

Thank you!

SanMelkote
  • 621
  • 5
  • 20
  • 3
    Even ignoring censoring, please notice that the "proportion funded" is a variable between $[0,1]$; it would make sense to model this therefore using beta regression. – usεr11852 Feb 02 '20 at 10:57

2 Answers2

3

I think survival analysis would be a sensible choice here, given the fact that this process unfolds over time; that the process has a binary outcome (as you've construed it); and that some of your observations are right censored. If you go that route, you would need to make at least two significant changes.

  • Reframe your outcome of interest as a binary event, e.g., the achievement of full funding. You could then move your continuous measure of funding to date to the other side of the equation, to incorporate information about partial successes into the analysis.
  • Add time to the mix. This would probably be something like "time since founding," where you can put all the cases on a comparable scale, even if they occurred at different points in calendar time.

In R, the workhorse package for survival analysis is, appropriately, survival. For an overview of available tools and links to relevant documents, check out the CRAN Task View (here).

I'm not going to try to write code to perform an analysis when I don't have, and am not familiar with, the relevant data or research question. I will say, though, that you'll probably want to look into designs and code for analyses with time-varying covariates, i.e., where values of some or all of your independent variables change over the course of time between project launch and the end of observation (via success or censoring). That would be a must if you want to consider partial funding as a covariate.

The alternative would be to estimate an initial-conditions model with reputation as the (sole?) covariate and time to full funding as the outcome of interest. If you go that route, you just need to ensure that your measure of reputation is based strictly on information available prior to project launch.

If you really want to complicate things, you could try a competing-risks approach, where you simultaneously consider the likelihood of two competing events: achievement of full funding, or project termination short of full funding. Project "death" isn't the same as right censoring of "live" projects, and this approach would allow your estimates to reflect that fact.

ulfelder
  • 485
  • 3
  • 12
3

The "time" does not have to be a true time (although it could be) and it is not really the case that proportion is a number between 0 and 1 even if the numbers in your dataset are because projects can be funded beyond target. The main assumption in the Cox model is the proportional hazard assumption and that is quite a weak assumption as it only requires a common baseline hazard. The shape of the hazard is unrestricted.

coxph in the R survival package can be used for fitting and cox.zph in the same package to test the proportional hazard assumption. Try the code below and see ?Surv regarding the status vector which is the additional input needed to indicate whether each observation is censored or not.

library(survival)
coxph(Surv(proportionFunded, status) ~ reputationBorrower + controlVariables, 
  data = subSample)

For more information, there is an example of using the Cox model here: http://www.sthda.com/english/wiki/cox-proportional-hazards-model and an example of using cox.zph here: Extended Cox model and cox.zph Also the survival package has a number of vignettes (PDF documents) that you can look at.

The main alternative would be an accelerated failure time model which is a parametric model that can be fit with survreg (in the same package).

G. Grothendieck
  • 1,255
  • 6
  • 12