0

I have been struggling to fit my data to a sine curve.

My data looks like:

frame = data.frame(hour = c(0, 1, 2, ... 24), value = (numbers between 0 and 500))

I have the following model:

summary

It doesn't model much. Adjusted R squared of .1836.

Here is the plot: graph

Histogram of log(value) hist

Plot of log(value) plot

As a time series time series

compguy24
  • 537
  • 4
  • 9
  • judging from your plot you should not expect to get higher R^2 value, because you have a whole range of response for a single hour. – Karolis Koncevičius Nov 22 '14 at 07:00
  • @KarolisKoncevičius so how can I better model this then? – compguy24 Nov 22 '14 at 07:01
  • 1
    If you don't have external causal variables that *explain* why a particular value at 10am is high rather than low, you won't be able to model it better. "Noise" is whatever you don't know about, and you apparently have a lot of it. – Stephan Kolassa Nov 22 '14 at 07:09
  • For doing sinus fits you might want to look here: http://stats.stackexchange.com/questions/60500/how-to-find-a-good-fit-for-semi-sinusoidal-model-in-r Regarding R^2 - your example is a little inconsistent. Your data only has hours, yet in your summary table you show days and months. Unless day and month can explain some of the variance on your last plot, like @StephanKolassa said - there is not much you can do to increase R^2. – Karolis Koncevičius Nov 22 '14 at 07:13
  • 1
    Your response appears to be strictly positive. You might be better off with a GLM perhaps. What does the response consist of? – Glen_b Nov 22 '14 at 08:59
  • @Glen_b it's air pollution data plotted against hour of the day -- did I answer your question? – compguy24 Nov 22 '14 at 09:30
  • What does it look like on the log scale? – Glen_b Nov 22 '14 at 09:49
  • @Glen_b plot and histogram of log(value) plotted – compguy24 Nov 22 '14 at 11:16
  • 1
    There's at least some suggestion from that information that a gamma model might be reasonable for the conditional distribution. Alternatively, you might want to consider whether some kind of time series model in the logs (possibly AR, say, perhaps with regressors as well). It may that there's simply a lot of noise. One thing that has me curious is why a period like 50*hour? Was that deliberately chosen? – Glen_b Nov 22 '14 at 15:00
  • This should cross-reference your previous question: http://stats.stackexchange.com/questions/124816/transform-time-dependent-data Using a GLM was one suggestion in that thread. – Nick Cox Nov 23 '14 at 10:24
  • The use of sine and cosine of 50 and 100 $\times$ hour is bizarre. Plot these functions over the observed range to see that they repeat many times in the course of a daily cycle. Terms in 2 $\pi$ hour/24 repeat once per day. If day is day of month 1..31, what physical effect does it capture? If month is month of year 1..12, why is it not treated periodically? (If these are not the correct definitions what are they?) – Nick Cox Nov 23 '14 at 10:32
  • If you have no drivers in terms of meteorological conditions, etc., etc., industry, traffic, or business activity, etc., etc. then it is inevitable that you can **only** explain what can be captured by direct or indirect functions of time. (Not a criticism; very likely the data are not easily available.) – Nick Cox Nov 23 '14 at 10:36

0 Answers0