I'm developing an application that is fed with continuous data while older data is discarded. I'm using some algorithms to compute simple linear regression on these data with Perl.
Basically that works well, but for big numbers I'm seeing numeric effects that cause things like taking the square root of negative numbers. I'm looking for suggestions which algorithms to use that are both, efficient and numerically stable.
First the algorithms used (package LR
, derived from some code dated 1992):
sub init($)
{
my $self = shift;
$self->{'n'} = 0; # number of samples
$self->{'sx'} = 0.0; # sum of x
$self->{'sx2'} = 0.0; # sum of x^2
$self->{'sxy'} = 0.0; # sum of x*y
$self->{'sy'} = 0.0; # sum of y
$self->{'sy2'} = 0.0; # sum of y^2
}
sub add_sample($$$)
{
my ($self, $_x, $_y) = @_;
++($self->{'n'});
$self->{'sx'} += $_x;
$self->{'sy'} += $_y;
$self->{'sxy'} += $_x * $_y;
$self->{'sx2'} += $_x**2;
$self->{'sy2'} += $_y**2;
}
sub B($)
{
my $self = shift;
my $d = $self->{'n'} * $self->{'sx2'} - $self->{'sx'}**2;
return 1 if ($d == 0);
return ($self->{'n'} * $self->{'sxy'} - $self->{'sx'} * $self->{'sy'}) / $d;
}
sub A($)
{
my $self = shift;
if ((my $n = $self->{'n'}) > 0) {
return ($self->{'sy'} - $self->B * $self->{'sx'}) / $n;
}
croak "A: invalid n";
}
sub r($)
{
my $self = shift;
my $s = ($self->{'n'} * $self->{'sx2'} - $self->{'sx'}**2)
* ($self->{'n'} * $self->{'sy2'} - $self->{'sy'}**2);
return 1 if ($s == 0);
return ($self->{'n'} * $self->{'sxy'} - $self->{'sx'} * $self->{'sy'})
/ sqrt($s);
}
# covariance
sub cov($)
{
my $self = shift;
if ((my $n = $self->{'n'}) > 1) {
return ($self->{'sxy'} - $self->{'sx'} * $self->{'sy'} / $n)
/ ($n - 1);
}
croak "cov: invalid n";
}
sub sigma($)
{
my $self = shift;
if ((my $n = $self->{'n'}) > 1) {
my $sy = $self->{'sy'};
my $var = ($self->{'sy2'} - ($sy * $sy) / $n) / ($n);
return sqrt($var)
if ($var >= 0);
croak "sigma: negative var ($var)";
return 0;
}
croak "sigma: invalid n";
return 0;
}
sub new
{
my $class = shift;
my $self = {
'n' => undef, # number of samples
'sx' => undef, # sum of all x
'sx2' => undef, # sum af all x^2
'sxy' => undef, # sum of all x * y
'sy' => undef, # sum of all y
'sy2' => undef # sum of all y^2
};
bless $self, $class;
init($self);
return $self;
}
Frequent problems are:
- "sigma: negative var" (like -3.6e-7)
- "sigma: negative s2"
- "r: negative s" (like -560661624)
Finally, here's some sample numeric data (use columns 1 and 3, obviously):
"time(2)" "capacity(3)" "available(10)"
1582183652 15357962 3556620
1582183729 15357962 3556620
1582183801 15357962 3556620
1582185601 15357962 3554537
1582187401 15357962 3553028
1582189201 15357962 3551650
1582191001 15357962 3550107
1582192801 15357962 3548724
1582194601 15357962 3545775
1582196401 15357962 3535712
1582198201 15357962 3532808
1582200001 15357962 3530688
1582201801 15357962 3529146
1582203601 15357962 3620938
1582205401 15357962 3619426
1582207201 15357962 3618046
1582209001 15357962 3615828
1582210801 15357962 3614460
1582212601 15357962 3613074
1582214401 15357962 3613051
1582216201 15357962 3612108
1582218002 15357962 3612085
1582219801 15357962 3610363
1582221601 15357962 3610340
1582223401 15357962 3609532
1582225201 15357962 3604908
1582227001 15357962 3601223
1582228801 15357962 3599850
1582230601 15357962 3598986
1582232401 15357962 3598291
1582234201 15357962 3595594
1582236001 15357962 3593270
1582237801 15357962 3592459
1582239601 15357962 3591055
1582241401 15357962 3585715
1582243201 15357962 3584115
1582245001 15357962 3577642
1582246801 15357962 3575523
1582248602 15357962 3572615
1582250401 15357962 3570502
1582252201 15357962 3568953
1582254001 15357962 3568930
1582255801 15357962 3568801
1582257601 15357962 3568778
1582259401 15357962 3567983
1582261201 15357962 3567240
1582263001 15357962 3567105
1582264801 15357962 3567081
1582266601 15357962 3565554
1582268402 15357962 3564858
1582270201 15357962 3562456
1582272001 15357962 3560362
1582704497 15357962 3169346
1582722625 15357962 3201017
1582722834 15357962 3199608
1582723801 15357962 3197253
1582725601 15357962 3192441
1582727401 15357962 3186576
1582729202 15357962 3181764
1582731001 15357962 3175658
1582732801 15357962 3170904
1582734601 15357962 3166522
1582736401 15357962 3162655
1582738201 15357962 3158955
1582740001 15357962 3157452
1582741801 15357962 2740200
1582743601 15357962 2110691
1582745401 15357962 1775101
1582747201 15357962 1774428
1582749001 15357962 1773604
1582750801 15357962 3145823
1582752601 15357962 3140821
1582754401 15357962 3137691
1582756201 15357962 3137304
1582758001 15357962 3137308
1582759801 15357962 3128067
1582761602 15357962 3126471
1582763401 15357962 3120125
1582765201 15357962 3118018
1582767001 15357962 3115086
1582768802 15357962 3113679
1582770601 15357962 3112744
1582772401 15357962 3111363
1582774201 15357962 3109631
1582776001 15357962 3108954
1582777801 15357962 3108161
1582779601 15357962 3106755
1582781401 15357962 3106601
1582783201 15357962 3105909
1582785001 15357962 3105081
Obvious use of the code is like this:
my $lr = LR->new();
$lr->add_sample($x, $y);
...
my ($a, $b, $r) = ($lr->A(), $lr->B(), $lr->r());