Command-line tool to calculate basic statistics for stream of values

Question

Is there any command-line tool that accepts the flow of numbers (in ascii format) from standard input and gives the basic descriptive statistics for this flow, such as min, max, average, median, RMS, quantiles etc? The output is welcome to be parseable by the next command in command-line chain. Working environment is Linux, but other options are welcome.

I would recommend to take a look at [|STAT](http://hcibib.org/perlman/stat/). That's a pretty old software, yet it is very convenient for such things. There's also [pyp](https://code.google.com/p/pyp/), and [several](http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/) [other](http://www.drbunsen.org/explorations-in-unix/) Un*x tools. — chl, Sep 24 '13 at 15:34
@chl Link ISTAT broken. Can update it or make it an answer, please? — Léo Léopold Hertz 준영, Jul 16 '15 at 09:54
@Masi Yup, it looks like the page no longer exists. Here is an [updated link](http://old.sigchi.org/~perlman/stat/). — chl, Jul 16 '15 at 10:05
http://stackoverflow.com/questions/9789806/command-line-utility-to-print-statistics-of-numbers-in-linux || http://serverfault.com/questions/548322/tool-to-do-statistics-in-the-linux-command-line — Ciro Santilli 新疆再教育营六四事件法轮功郝海东, Oct 12 '15 at 11:14

score 28 · Accepted Answer · edited Jan 27 '21 at 18:44

28

You can do this with R, which may be a bit of overkill...

EDIT 2: [OOPS, looks like someone else hit with Rscript while I was retyping this.] I found an easier way. Installed with R should be Rscript, which is meant to do what you're trying to do. For example, if I have a file bar which has a list of numbers, one per line:

Rscript -e 'summary (as.numeric (readLines ("stdin")))' < bar

Will send the numbers in the file into R and run R's summary command on the lines, returning something like:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    2.25    3.50    3.50    4.75    6.00

You could also do something like:

Rscript -e 'quantile (as.numeric (readLines ("stdin")), probs=c(0.025, 0.5, 0.975))'

to get quantiles. And you could obviously chop off the first line of output (which contains labels) with something like:

Rscript -e 'summary (as.numeric (readLines ("stdin")))' < bar | tail -n +2

I'd highly recommend doing what you want in interactive R first, to make sure you have the command correct. In trying this, I left out the closing parenthesis and Rscript returns nothing -- no error message, no result, just nothing.

(For the record, file bar contains:

edited Jan 27 '21 at 18:44

Michael Mior

101
4

answered Mar 20 '12 at 15:14

Wayne

19,981
4
50
99

So, I should prepend my flow with those `R` commands? – mbaitoff Mar 20 '12 at 15:27
@mbaitoff: Yes. For my test, I created a file `foo` which contained the `summary (as.numeric (readLines()))` as its first line, then one numeric data item per line for the rest of the file. The `readLines()` is just reading from stdin (which is all of what follows, until the end of file). – Wayne Mar 20 '12 at 16:01
Looks like we seriously stuck to `R` in both answers, and it seems to be a huge tool for a tiny task. Well, the answers work, but anyway, is there something else except `R`? – mbaitoff Mar 21 '12 at 05:12
2

@mbaitoff: You could use Python with `scipy`, especially if you already use Python. If you use/like Clojure (lisp based on JVM, http://clojure.org/), there's the `Incanter` (http://incanter.org/) statistical environment built on that. You could also try gnu Octave. – Wayne Mar 21 '12 at 13:14

user2747481 · Answer 2 · 2013-09-24T15:03:23.483

26

Try "st":

$ seq 1 10 | st
N   min   max   sum   mean  stddev
10  1     10    55    5.5   3.02765

$ seq 1 10 | st --transpose
N       10
min     1
max     10
sum     55
mean    5.5
stddev  3.02765

You can also see the five number summary:

$ seq 1 10 | st --summary
min  q1    median   q3    max
1    3.5   5.5      7.5   10

You can download it here:

https://github.com/nferraz/st

(DISCLAIMER: I wrote this tool :))

edited Sep 24 '13 at 15:03

answered Sep 05 '13 at 15:19

user2747481

161
2
3

Welcome to the site, @user2747481. Would you mind fleshing this answer out a bit? We would like our answers to be mostly self-contained. Since you are new here, you may want to read our [about page](http://stats.stackexchange.com/about), which contains information for new users. – gung - Reinstate Monica Sep 05 '13 at 15:52
Thanks! As of 2019 `st` is available via Homebrew `brew install st` – Noah Sussman Jan 27 '19 at 23:19
1

Beware that `st` may also reference to `simple terminal`. – Skippy le Grand Gourou Feb 06 '19 at 10:21
1

This is awesome. brew installed, ran exactly as I expected it to. A++ – Locane Jan 30 '20 at 18:57
1

Nice to have something which installs with minimal dependencies, worked even from a centos-minimal image! – lost Oct 09 '20 at 11:06

score 11 · Answer 3 · edited Jul 27 '15 at 17:02

R provides a command called Rscript. If you have only a few numbers that you can paste on the command line, use this one liner:

Rscript -e 'summary(as.numeric(commandArgs(TRUE)))' 3 4 5 9 7

which results in

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
3.0     4.0     5.0     5.6     7.0     9.0

If you want to read from the standard input use this:

echo 3 4 5 9 7 | Rscript -e 'summary(as.numeric(read.table(file("stdin"))))'

If number on the standard input are separated by carriage returns (ie one number per line), use

Rscript -e 'summary(as.numeric(read.table(file("stdin"))[,1]))'

One can create aliases for these commands:

alias summary='Rscript -e "summary(as.numeric(read.table(file(\"stdin\"))[,1]))"'
du -s /usr/bin/* | cut -f1 | summary
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.0     8.0    20.0    93.6    44.0  6528.0

+1: Sorry I'd just found Rscript and edited my answer to include this, so we've ended up with a similar answer. Your `read.table` idea is a nice way to get around one-item-per-line. — Wayne, Mar 20 '12 at 18:59

score 7 · Answer 4 · answered Sep 29 '16 at 19:57

7

datamash is another great option. It's from the GNU Project.

If you have homebrew / linuxbrew you can do:

brew install datamash

answered Sep 29 '16 at 19:57

Owen

143
1
6

It's amazing for calculating stats over access logs. For example `cat only_times.txt | datamash --header-out mean 1 perc:50 1 perc:90 1 perc:95 1 perc:97 1 perc:99 1 perc:100 1 | column -t`. – Pavel Patrin Jul 29 '21 at 11:20

score 5 · Answer 5 · answered Aug 18 '15 at 15:06

Y.a. tool which could be used for calculating statistics and view distribution in ASCII mode is ministat. It's a tool from FreeBSD, but it also packaged for popular Linux distribution like Debian/Ubuntu.

Usage example:

$ cat test.log 
Handled 1000000 packets.Time elapsed: 7.575278
Handled 1000000 packets.Time elapsed: 7.569267
Handled 1000000 packets.Time elapsed: 7.540344
Handled 1000000 packets.Time elapsed: 7.547680
Handled 1000000 packets.Time elapsed: 7.692373
Handled 1000000 packets.Time elapsed: 7.390200
Handled 1000000 packets.Time elapsed: 7.391308
Handled 1000000 packets.Time elapsed: 7.388075

$ cat test.log| awk '{print $5}' | ministat -w 74
x <stdin>
+--------------------------------------------------------------------------+
| x                                                                        |
|xx                                   xx    x x                           x|
|   |__________________________A_______M_________________|                 |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   8      7.388075      7.692373       7.54768     7.5118156    0.11126122

Because simple statistics are so easy to calculate it's tempting to write an ad-hoc solution time and again. For a programmer it's a habit. I guess I've written hundreds of one-liners in octave, jq, r, shell or awk to calculate basic statistics. For me, this particular answer proved to be the most practicable solution. It's small, simple, ubiquitous and, my distro's package repository has it . Using it might actually become a habit that sticks (finally). — wnrph, Dec 10 '21 at 14:00

Simon · Answer 6 · 2015-07-29T10:16:51.097

2

There is sta, which is a c++ varient of st, also referenced in these comments.

Being written in c++, it's fast and can handle large datasets. It's simple to use, includes the choice of unbiased or biased estimators, and can output more detailed information such as standard error.

You can download sta at github.

Disclaimer: I'm the author of sta.

edited Jul 29 '15 at 10:16

answered Jul 27 '15 at 17:00

Simon

101
3

score 2 · Answer 7 · answered Sep 30 '13 at 20:29

There is also simple-r, which can do almost everything that R can, but with less keystrokes:

https://code.google.com/p/simple-r/

To calculate basic descriptive statistics, one would have to type one of:

r summary file.txt
r summary - < file.txt
cat file.txt | r summary -

Doesn't get any simple-R!

score 1 · Answer 8 · edited Jun 11 '20 at 14:32

You might also consider using clistats. It is a highly configurable command line interface tool to compute statistics for a stream of delimited input numbers.

I/O options

Input data can be from a file, standard input, or a pipe
Output can be written to a file, standard output, or a pipe
Output uses headers that start with "#" to enable piping to gnuplot

Parsing options

Signal, end-of-file, or blank line based detection to stop processing
Comment and delimiter character can be set
Columns can be filtered out from processing
Rows can be filtered out from processing based on numeric constraint
Rows can be filtered out from processing based on string constraint
Initial header rows can be skipped
Fixed number of rows can be processed
Duplicate delimiters can be ignored
Rows can be reshaped into columns
Strictly enforce that only rows of the same size are processed
A row containing column titles can be used to title output statistics

Statistics options

Summary statistics (Count, Minimum, Mean, Maximum, Standard deviation)
Covariance
Correlation
Least squares offset
Least squares slope
Histogram
Raw data after filtering

NOTE: I'm the author.

Tommaso · Answer 9 · 2013-03-26T23:26:45.377

Just in case, there's datastat

https://sourceforge.net/p/datastat/code/

a simple program for Linux computing simple statistics from the command-line. For example,

cat file.dat | datastat

will output the average value across all rows for each column of file.dat. If you need to know the standard deviation, min, max, you can add the --dev, --min and --max options, respectively.

datastat has the possibility to aggregate rows based on the value of one or more "key" columns.

It's written in C++, runs fast and with small memory occupation, and can be piped nicely with other tools such as cut, grep, sed, sort, awk, etc.

score 0 · Answer 10 · answered Jul 18 '14 at 02:43

0

Too much memory and processor power, folks. Using R for something like this is roughly like getting a sledgehammer to kill a mosquito. Use your favorite language and implement a provisional means algorithm. For the mean: $$\bar{x}_n = \frac{(n-1)\,\bar{x}_{n-1} + x_n}{n}$$; and for the variance:$$s^2_n = \frac{S_n}{n-1}$$ $$S_n = S_{n-1} + (x_n-\bar{x}_{n-1})(x_n-\bar{x}_n).$$

Take $\bar{x}_0 = S_0 = 0$ as starting values. Modifications are available for weighted analyses. You can do the computations with two double precision reals and a counter.

answered Jul 18 '14 at 02:43

Dennis

1,710
9
15

Try it with some $x_i$ equivalent to `FLOAT_MAX-1.0` or with $x_i-x_{i+1}$ very small but $x_i-x_{i-1}$ very large. – mbaitoff Jul 18 '14 at 02:59
This is actually what [clistats](https://github.com/dpmcmlxxvi/clistats) does (see answer for details and other features). – dpmcmlxxvi Jun 25 '15 at 19:46

score 0 · Answer 11 · answered Nov 10 '17 at 00:05

Stumbled across this old thread looking for something else. Wanted the same thing, couldn't find anything simple, so did it in perl, fairly trivial, but use it multiple times a day: http://moo.nac.uci.edu/~hjm/stats

example:

 $ ls -l | scut -f=4 | stats                
Sum       9702066453
Number    501
Mean      19365402.1017964
Median    4451
Mode      4096  
NModes    15
Min       0
Max       2019645440
Range     2019645440
Variance  1.96315423371944e+16
Std_Dev   140112605.91822
SEM       6259769.58393047
Skew      10.2405932543676
Std_Skew  93.5768354979843
Kurtosis  117.69005473429

(scut is a slower, but arguably easier to version of cut): http://moo.nac.uci.edu/~hjm/scut described: http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html

JonDeg · Answer 12 · 2020-07-11T07:28:34.403

Another tool: tsv-summarize from eBay's TSV Utilities. Supports many of the basic summary statistics, like min, max, mean, median, quantiles, standard deviation, MAD, and a few more. It is intended for large datasets and supports multiple fields and grouping by key. Output is tab separated. An example for the sequence of numbers 1 to 1000, one per line:

$ seq 1000 | tsv-summarize --min 1 --max 1 --median 1 --sum 1
1   1000    500.5   500500

Headers are normally generated from a header line in the input. If the input has no header one can be added using the -w switch:

$ seq 1000 | tsv-summarize -w --min 1 --max 1 --median 1 --sum 1
field1_min  field1_max  field1_median   field1_sum
1   1000    500.5   500500

Disclaimer: I'm the author.

Command-line tool to calculate basic statistics for stream of values

12 Answers12

I/O options

Parsing options

Statistics options