Algorithm for simple linear regression that is efficient and numerically stable

Question

I'm developing an application that is fed with continuous data while older data is discarded. I'm using some algorithms to compute simple linear regression on these data with Perl.

Basically that works well, but for big numbers I'm seeing numeric effects that cause things like taking the square root of negative numbers. I'm looking for suggestions which algorithms to use that are both, efficient and numerically stable.

First the algorithms used (package LR, derived from some code dated 1992):

sub init($)
{
    my $self = shift;

    $self->{'n'}   = 0;    # number of samples
    $self->{'sx'}  = 0.0;  # sum of x
    $self->{'sx2'} = 0.0;  # sum of x^2
    $self->{'sxy'} = 0.0;  # sum of x*y
    $self->{'sy'}  = 0.0;  # sum of y
    $self->{'sy2'} = 0.0;  # sum of y^2
}

sub add_sample($$$)
{
    my ($self, $_x, $_y) = @_;

    ++($self->{'n'});
    $self->{'sx'}  += $_x;
    $self->{'sy'}  += $_y;
    $self->{'sxy'} += $_x * $_y;
    $self->{'sx2'} += $_x**2;
    $self->{'sy2'} += $_y**2;
}

sub B($)
{
    my $self = shift;
    my $d = $self->{'n'} * $self->{'sx2'} - $self->{'sx'}**2;

    return 1 if ($d == 0);
    return ($self->{'n'} * $self->{'sxy'} - $self->{'sx'} * $self->{'sy'}) / $d;
}

sub A($)
{
    my $self = shift;

    if ((my $n = $self->{'n'}) > 0) {
        return ($self->{'sy'} - $self->B * $self->{'sx'}) / $n;
    }
    croak "A: invalid n";
}

sub r($)
{
    my $self = shift;

    my $s = ($self->{'n'} * $self->{'sx2'} - $self->{'sx'}**2)
      * ($self->{'n'} * $self->{'sy2'} - $self->{'sy'}**2);

    return 1 if ($s == 0);
    return ($self->{'n'} * $self->{'sxy'} - $self->{'sx'} * $self->{'sy'})
 / sqrt($s);
}

# covariance
sub cov($)
{
    my $self = shift;

    if ((my $n = $self->{'n'}) > 1) {
    return ($self->{'sxy'} - $self->{'sx'} * $self->{'sy'} / $n)
        / ($n - 1);
    }
    croak "cov: invalid n";
}

sub sigma($)
{
    my $self = shift;

    if ((my $n = $self->{'n'}) > 1) {
    my $sy = $self->{'sy'};
    my $var = ($self->{'sy2'} - ($sy * $sy) / $n) / ($n);

    return sqrt($var)
     if ($var >= 0);
    croak "sigma: negative var ($var)";
    return 0;
    }
    croak "sigma: invalid n";
    return 0;
}

sub new
{
    my $class = shift;
    my $self = {
    'n' => undef,           # number of samples
    'sx' => undef,          # sum of all x
    'sx2' => undef,         # sum af all x^2
    'sxy' => undef,         # sum of all x * y
    'sy' => undef,          # sum of all y
    'sy2' => undef          # sum of all y^2
    };
    bless $self, $class;
    init($self);
    return $self;
}

Frequent problems are:

"sigma: negative var" (like -3.6e-7)
"sigma: negative s2"
"r: negative s" (like -560661624)

Finally, here's some sample numeric data (use columns 1 and 3, obviously):

"time(2)"   "capacity(3)"   "available(10)"
1582183652  15357962    3556620
1582183729  15357962    3556620
1582183801  15357962    3556620
1582185601  15357962    3554537
1582187401  15357962    3553028
1582189201  15357962    3551650
1582191001  15357962    3550107
1582192801  15357962    3548724
1582194601  15357962    3545775
1582196401  15357962    3535712
1582198201  15357962    3532808
1582200001  15357962    3530688
1582201801  15357962    3529146
1582203601  15357962    3620938
1582205401  15357962    3619426
1582207201  15357962    3618046
1582209001  15357962    3615828
1582210801  15357962    3614460
1582212601  15357962    3613074
1582214401  15357962    3613051
1582216201  15357962    3612108
1582218002  15357962    3612085
1582219801  15357962    3610363
1582221601  15357962    3610340
1582223401  15357962    3609532
1582225201  15357962    3604908
1582227001  15357962    3601223
1582228801  15357962    3599850
1582230601  15357962    3598986
1582232401  15357962    3598291
1582234201  15357962    3595594
1582236001  15357962    3593270
1582237801  15357962    3592459
1582239601  15357962    3591055
1582241401  15357962    3585715
1582243201  15357962    3584115
1582245001  15357962    3577642
1582246801  15357962    3575523
1582248602  15357962    3572615
1582250401  15357962    3570502
1582252201  15357962    3568953
1582254001  15357962    3568930
1582255801  15357962    3568801
1582257601  15357962    3568778
1582259401  15357962    3567983
1582261201  15357962    3567240
1582263001  15357962    3567105
1582264801  15357962    3567081
1582266601  15357962    3565554
1582268402  15357962    3564858
1582270201  15357962    3562456
1582272001  15357962    3560362
1582704497  15357962    3169346
1582722625  15357962    3201017
1582722834  15357962    3199608
1582723801  15357962    3197253
1582725601  15357962    3192441
1582727401  15357962    3186576
1582729202  15357962    3181764
1582731001  15357962    3175658
1582732801  15357962    3170904
1582734601  15357962    3166522
1582736401  15357962    3162655
1582738201  15357962    3158955
1582740001  15357962    3157452
1582741801  15357962    2740200
1582743601  15357962    2110691
1582745401  15357962    1775101
1582747201  15357962    1774428
1582749001  15357962    1773604
1582750801  15357962    3145823
1582752601  15357962    3140821
1582754401  15357962    3137691
1582756201  15357962    3137304
1582758001  15357962    3137308
1582759801  15357962    3128067
1582761602  15357962    3126471
1582763401  15357962    3120125
1582765201  15357962    3118018
1582767001  15357962    3115086
1582768802  15357962    3113679
1582770601  15357962    3112744
1582772401  15357962    3111363
1582774201  15357962    3109631
1582776001  15357962    3108954
1582777801  15357962    3108161
1582779601  15357962    3106755
1582781401  15357962    3106601
1582783201  15357962    3105909
1582785001  15357962    3105081

Obvious use of the code is like this:

my $lr = LR->new();
$lr->add_sample($x, $y);
...
my ($a, $b, $r) = ($lr->A(), $lr->B(), $lr->r());

My guess is that the sum of squares looses more significant digits than the squared sum, especially for big absolute numbers. — U. Windl, Feb 27 '20 at 13:07
A possible solution could be scaling: First make all numbers (if positive) zero-based (offsetting them), then pre-scale them to be <= 1. So the effects of loosing significant digits should be minimized. After calculating the regression, scale back and un-bias them. How does that sound? — U. Windl, Feb 27 '20 at 13:09
That is indeed one piece of a robust numerical solution. But there are a great many additional pitfalls: you can compute the sums of squares with far more precision; you can solve the Normal equations without losing as much precision (that's what the Cholesky decomposition is all about); and much more. A full solution would require a long chapter in a numerical analysis textbook. And since creating a least-squares solver from scratch should be considered just a learning exercise, consulting a good text might be a good next step. — whuber, Feb 27 '20 at 14:02

Haitao Du · Answer 1 · 2020-02-27T08:44:47.367

1

Most softwares (such R, Matlab) is using QR decomposition for solving linear regression by default.

Please check this post for details (Especially J.M.'s answer):

What algorithm is used in linear regression?

In addition, I would suggest you to read the Convex Optimization Book, Appendix C2 page 665. For details on complexity and numerical stability.

edited Feb 27 '20 at 08:44

answered Feb 27 '20 at 08:38

Haitao Du

32,885
17
118
213

OK, I was hoping someone could point me at the root of the problem. It'll take some time to read all the recommended readings. – U. Windl Feb 27 '20 at 09:32

Algorithm for simple linear regression that is efficient and numerically stable

1 Answers1