1

I am currently working on a simple linear time series regression model that looks like this: $$P_{t}=\beta _{0}+\beta _{1}X_{t}+\varepsilon _{t}$$ Yet, I have problems regarding how to deal with the issue of a very heavy tailed regression model.

Plotting the data against each other results in the following graph:

Data plotted against each other

Qqplotting the residuals of the regression results in the following graph:

QQPlot - Regression Model

Now, in essence I have two questions regarding this issue:

  1. I would assume, that the first graph implies non-linearity of the regression model. Is this a reasonable assumption, or can one still assume linearity? If no - how do I establish linearity in this specific case? I have tried my best, yet I can not find a solution.

  2. In this case (bivariate linear regression), both graphs basically display the same thing. Is this correct?

I am thankful for any comment or even an answer on how to deal with this issue. I tried reading the threads touching this topic, but I did not find them very helpful regarding my questions.

EDIT 1: The dependent variable is electricity price and the independent variable is load data. The endogenity problem here is another issue, which I just have to deal with, as I am explicitly supposed to use the load as the independent variable. The nature of the data also implies, that the fat tails are not caused by measurement errors, but by extreme fluctuations in the electricity market.

EDIT 2: As requested, several time series plots.

Independent variable against time:

enter image description here

Dependent variable against time:

enter image description here

ACF (Regression Model as described above):

enter image description here

PACF (Regression Model as described above):

enter image description here

I wanted to account for seasonality with dummy variables. The ACF/PACF charts indicate AR(1), at least to my understanding. I was planning to apply the chochrane-orcutt method in order to eliminate serial correlation in the error terms.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
shenflow
  • 750
  • 8
  • 20
  • 4
    At most, normal distribution is an assumption (better read as ideal condition) for the errors, **not** any of the variables that enter a regression. Nothing in your _analysis_ has time series flavour and the most relevant graph is the scatter plot which poses a challenging mixture of mostly linear behaviour and some marked departures from that for low and high values. The most useful next graph is one not here, plotting the two variables against time. The best model will surely be found by considering the processes underlying your system, about which is nothing is said here. – Nick Cox Nov 15 '17 at 17:24
  • 3
    In addition to answering @NickCox points it might help to say what the variables are as substantive knowledge often helps to explain distribution. – mdewey Nov 15 '17 at 17:27
  • Thank you, I entirely reedited the question. Yet the problem I have is linearity of the regression model - how does plotting the data against time helps with that? Isn´t that a way to go when dealing with autocorrelation? – shenflow Nov 15 '17 at 18:42
  • Minimally, you've described what you have done as a time series regression model, but as you've explained it, your model ignores time. (Subscripting here is just cosmetic; you're assuming or asserting instantaneous response, so that lags don't bite.) That may be sensible but without seeing the data as time series we have no way to check on or advise on whether it is sensible. Time series people might also reasonably ask for autocorrelation, cross-correlation etc. – Nick Cox Nov 15 '17 at 19:34
  • Alright, got it - I edited the question and included the time series charts. I also included an ACF as well as a PACF chart. I was planning to account for seasonality by adding dummy variables and to eliminate autocorrelation by applying cochrane-orcutt. I thought the problem of **one explanatory variable** displaying a non-linear relationship (even if I expanded the model to a multiple regression model by adding dummy variables) would violate the assumption of a linear model. That is why I thought other information would not add any value to the issue I originally formulated. – shenflow Nov 15 '17 at 21:05
  • You seem to have a daily cycle in there. The data seem to extend over about 1 year so there may be seasonality (usual sense, dependent on time of year) too. That being so, your time series graph and correlation functions would be easier to think about if you plotted using labels that are multiples of 24; there is no special or helpful meaning to 20 or 2000 for hourly data. To your main point, I hope time series experts chime in with good ideas. – Nick Cox Nov 15 '17 at 22:42
  • Spelling is Cochrane-Orcutt, both people no longer with us. – Nick Cox Nov 15 '17 at 22:42
  • What you seem to be describing is a demand curve and not a time series. You probably have multiple demand curves depending on the season you are in. I don't think time matters here in the sense of $P_{t+1}=f(P_t,X_t)$. I do think time matters because you are likely looking at multiple equations broken by time, so there probably is at least one structural break if not season breaks. I think the problem is you are just misspecified. – Dave Harris Nov 16 '17 at 00:46
  • The load data I am dealing with is residual (so load - electricity system input from renewable energy). Plus I am assuming price inelastic demand. The fluctuation is hugely influenced by renewable Energy Generation. That beeing said I actually do not think I am estimating demand functions (please explain if you still think so). I basically just want to measure the influence of the independent variable on the dependent variable in a descriptive way - I do not want to forecast anything. – shenflow Nov 16 '17 at 06:52
  • And just for me to understand - looking at the ACF/PACF plots I would assume AR(1), you are saying this is the wrong conclusion in this case? – shenflow Nov 16 '17 at 07:09
  • It looks like you have a circadian cycle. Have you considered a cosinor model? – Roland Nov 16 '17 at 07:36
  • @Roland Actually, I am not familiar with cosinor models. Are you referring to the seasonality issue? – shenflow Nov 16 '17 at 08:34
  • @DaveHarris I thought about the autocorrelation/structural break issue you pointed out. Maybe a regime switching model would be a solution I guess? – shenflow Nov 16 '17 at 08:57
  • @shenflow Yes. You want to get time series thinking out of your mind. These are not time series. I would begin by graphing the data under a variety of cuts, such as annual, summer, and so forth. You can also choose to let the data choose your cut points. Still, I would graph the data without imposing any models. Due to the nature of the data, it is apt to spike at the extremes anyway. You may also pick up extra data on the percentage of renewable resources in the mix. It should change your curve. – Dave Harris Nov 17 '17 at 17:18
  • Actually, these are time series. This is hourly data for a specific year. What makes you conclude that these are not time series? The percentage of renewable resources in the mix is 0%. The load data is **residual**, so renewable resources are excluded. – shenflow Nov 17 '17 at 17:23
  • @DaveHarris the plots against hours might be a bit confusing, since they might give the impression of the data not beeing a time series, this is just because i plotted against 8760 hours (I do not even know why I did that) instead of just displaying the data in a plot. To be clear - I have hourly data for the residual load and the electricity price in a specific year. – shenflow Nov 17 '17 at 17:27
  • @shenflow it is not a time series, it is data collected across time. Plot price versus load and drop time. Time only impacts the curve if it is a proxy for other things such as it being warmer in the summer. Time is not a variable here, but it may be a proxy for other effects. – Dave Harris Nov 17 '17 at 17:52
  • Could someone else comment on this? To my understanding, time series data is defined by the fact that it is data collected over a period of time and thus indexed by time. Basic definitions in for example "Introductory Econometrics" by Wooldridge are confirming this definition. Take for example static regression models - the variables are contemporaneously dated and thus time is not necessarily a variable *per se* in a regression model that is applied to time series data. – shenflow Nov 17 '17 at 18:34
  • @DaveHarris are you implying this is panel data and not time series data? – shenflow Nov 17 '17 at 19:22

0 Answers0