30

Is there any command-line tool that accepts the flow of numbers (in ascii format) from standard input and gives the basic descriptive statistics for this flow, such as min, max, average, median, RMS, quantiles etc? The output is welcome to be parseable by the next command in command-line chain. Working environment is Linux, but other options are welcome.

mbaitoff
  • 757
  • 1
  • 8
  • 16
  • 1
    I would recommend to take a look at [|STAT](http://hcibib.org/perlman/stat/). That's a pretty old software, yet it is very convenient for such things. There's also [pyp](https://code.google.com/p/pyp/), and [several](http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/) [other](http://www.drbunsen.org/explorations-in-unix/) Un*x tools. – chl Sep 24 '13 at 15:34
  • @chl Link ISTAT broken. Can update it or make it an answer, please? – Léo Léopold Hertz 준영 Jul 16 '15 at 09:54
  • 1
    @Masi Yup, it looks like the page no longer exists. Here is an [updated link](http://old.sigchi.org/~perlman/stat/). – chl Jul 16 '15 at 10:05
  • http://stackoverflow.com/questions/9789806/command-line-utility-to-print-statistics-of-numbers-in-linux || http://serverfault.com/questions/548322/tool-to-do-statistics-in-the-linux-command-line – Ciro Santilli 新疆再教育营六四事件法轮功郝海东 Oct 12 '15 at 11:14

12 Answers12

28

You can do this with R, which may be a bit of overkill...

EDIT 2: [OOPS, looks like someone else hit with Rscript while I was retyping this.] I found an easier way. Installed with R should be Rscript, which is meant to do what you're trying to do. For example, if I have a file bar which has a list of numbers, one per line:

Rscript -e 'summary (as.numeric (readLines ("stdin")))' < bar

Will send the numbers in the file into R and run R's summary command on the lines, returning something like:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    2.25    3.50    3.50    4.75    6.00 

You could also do something like:

Rscript -e 'quantile (as.numeric (readLines ("stdin")), probs=c(0.025, 0.5, 0.975))'

to get quantiles. And you could obviously chop off the first line of output (which contains labels) with something like:

Rscript -e 'summary (as.numeric (readLines ("stdin")))' < bar | tail -n +2

I'd highly recommend doing what you want in interactive R first, to make sure you have the command correct. In trying this, I left out the closing parenthesis and Rscript returns nothing -- no error message, no result, just nothing.

(For the record, file bar contains:

1
2
3
4
5
6
Michael Mior
  • 101
  • 4
Wayne
  • 19,981
  • 4
  • 50
  • 99
  • So, I should prepend my flow with those `R` commands? – mbaitoff Mar 20 '12 at 15:27
  • @mbaitoff: Yes. For my test, I created a file `foo` which contained the `summary (as.numeric (readLines()))` as its first line, then one numeric data item per line for the rest of the file. The `readLines()` is just reading from stdin (which is all of what follows, until the end of file). – Wayne Mar 20 '12 at 16:01
  • Looks like we seriously stuck to `R` in both answers, and it seems to be a huge tool for a tiny task. Well, the answers work, but anyway, is there something else except `R`? – mbaitoff Mar 21 '12 at 05:12
  • 2
    @mbaitoff: You could use Python with `scipy`, especially if you already use Python. If you use/like Clojure (lisp based on JVM, http://clojure.org/), there's the `Incanter` (http://incanter.org/) statistical environment built on that. You could also try gnu Octave. – Wayne Mar 21 '12 at 13:14
26

Try "st":

$ seq 1 10 | st
N   min   max   sum   mean  stddev
10  1     10    55    5.5   3.02765

$ seq 1 10 | st --transpose
N       10
min     1
max     10
sum     55
mean    5.5
stddev  3.02765

You can also see the five number summary:

$ seq 1 10 | st --summary
min  q1    median   q3    max
1    3.5   5.5      7.5   10

You can download it here:

https://github.com/nferraz/st

(DISCLAIMER: I wrote this tool :))

user2747481
  • 161
  • 2
  • 3
11

R provides a command called Rscript. If you have only a few numbers that you can paste on the command line, use this one liner:

Rscript -e 'summary(as.numeric(commandArgs(TRUE)))' 3 4 5 9 7

which results in

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
3.0     4.0     5.0     5.6     7.0     9.0 

If you want to read from the standard input use this:

echo 3 4 5 9 7 | Rscript -e 'summary(as.numeric(read.table(file("stdin"))))'

If number on the standard input are separated by carriage returns (ie one number per line), use

Rscript -e 'summary(as.numeric(read.table(file("stdin"))[,1]))'

One can create aliases for these commands:

alias summary='Rscript -e "summary(as.numeric(read.table(file(\"stdin\"))[,1]))"'
du -s /usr/bin/* | cut -f1 | summary
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.0     8.0    20.0    93.6    44.0  6528.0
whuber
  • 281,159
  • 54
  • 637
  • 1,101
Arnaud A
  • 173
  • 6
  • +1: Sorry I'd just found Rscript and edited my answer to include this, so we've ended up with a similar answer. Your `read.table` idea is a nice way to get around one-item-per-line. – Wayne Mar 20 '12 at 18:59
  • Ok, thanks for the acknowledgement and the +1. – Arnaud A Mar 20 '12 at 22:45
7

datamash is another great option. It's from the GNU Project.

If you have homebrew / linuxbrew you can do:

brew install datamash

Owen
  • 143
  • 1
  • 6
  • It's amazing for calculating stats over access logs. For example `cat only_times.txt | datamash --header-out mean 1 perc:50 1 perc:90 1 perc:95 1 perc:97 1 perc:99 1 perc:100 1 | column -t`. – Pavel Patrin Jul 29 '21 at 11:20
5

Y.a. tool which could be used for calculating statistics and view distribution in ASCII mode is ministat. It's a tool from FreeBSD, but it also packaged for popular Linux distribution like Debian/Ubuntu.

Usage example:

$ cat test.log 
Handled 1000000 packets.Time elapsed: 7.575278
Handled 1000000 packets.Time elapsed: 7.569267
Handled 1000000 packets.Time elapsed: 7.540344
Handled 1000000 packets.Time elapsed: 7.547680
Handled 1000000 packets.Time elapsed: 7.692373
Handled 1000000 packets.Time elapsed: 7.390200
Handled 1000000 packets.Time elapsed: 7.391308
Handled 1000000 packets.Time elapsed: 7.388075

$ cat test.log| awk '{print $5}' | ministat -w 74
x <stdin>
+--------------------------------------------------------------------------+
| x                                                                        |
|xx                                   xx    x x                           x|
|   |__________________________A_______M_________________|                 |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   8      7.388075      7.692373       7.54768     7.5118156    0.11126122
dr.
  • 101
  • 1
  • 3
  • Because simple statistics are so easy to calculate it's tempting to write an ad-hoc solution time and again. For a programmer it's a habit. I guess I've written hundreds of one-liners in octave, jq, r, shell or awk to calculate basic statistics. For me, this particular answer proved to be the most practicable solution. It's small, simple, ubiquitous and, my distro's package repository has it . Using it might actually become a habit that sticks (finally). – wnrph Dec 10 '21 at 14:00
2

There is sta, which is a c++ varient of st, also referenced in these comments.

Being written in c++, it's fast and can handle large datasets. It's simple to use, includes the choice of unbiased or biased estimators, and can output more detailed information such as standard error.

You can download sta at github.

Disclaimer: I'm the author of sta.

Simon
  • 101
  • 3
2

There is also simple-r, which can do almost everything that R can, but with less keystrokes:

https://code.google.com/p/simple-r/

To calculate basic descriptive statistics, one would have to type one of:

r summary file.txt
r summary - < file.txt
cat file.txt | r summary -

Doesn't get any simple-R!

user30888
  • 1
  • 1
1

You might also consider using clistats. It is a highly configurable command line interface tool to compute statistics for a stream of delimited input numbers.

I/O options

  • Input data can be from a file, standard input, or a pipe
  • Output can be written to a file, standard output, or a pipe
  • Output uses headers that start with "#" to enable piping to gnuplot

Parsing options

  • Signal, end-of-file, or blank line based detection to stop processing
  • Comment and delimiter character can be set
  • Columns can be filtered out from processing
  • Rows can be filtered out from processing based on numeric constraint
  • Rows can be filtered out from processing based on string constraint
  • Initial header rows can be skipped
  • Fixed number of rows can be processed
  • Duplicate delimiters can be ignored
  • Rows can be reshaped into columns
  • Strictly enforce that only rows of the same size are processed
  • A row containing column titles can be used to title output statistics

Statistics options

  • Summary statistics (Count, Minimum, Mean, Maximum, Standard deviation)
  • Covariance
  • Correlation
  • Least squares offset
  • Least squares slope
  • Histogram
  • Raw data after filtering

NOTE: I'm the author.

dpmcmlxxvi
  • 111
  • 4
1

Just in case, there's datastat

https://sourceforge.net/p/datastat/code/

a simple program for Linux computing simple statistics from the command-line. For example,

cat file.dat | datastat

will output the average value across all rows for each column of file.dat. If you need to know the standard deviation, min, max, you can add the --dev, --min and --max options, respectively.

datastat has the possibility to aggregate rows based on the value of one or more "key" columns.

It's written in C++, runs fast and with small memory occupation, and can be piped nicely with other tools such as cut, grep, sed, sort, awk, etc.

Tommaso
  • 11
  • 3
0

Too much memory and processor power, folks. Using R for something like this is roughly like getting a sledgehammer to kill a mosquito. Use your favorite language and implement a provisional means algorithm. For the mean: $$\bar{x}_n = \frac{(n-1)\,\bar{x}_{n-1} + x_n}{n}$$; and for the variance:$$s^2_n = \frac{S_n}{n-1}$$ $$S_n = S_{n-1} + (x_n-\bar{x}_{n-1})(x_n-\bar{x}_n).$$

Take $\bar{x}_0 = S_0 = 0$ as starting values. Modifications are available for weighted analyses. You can do the computations with two double precision reals and a counter.

Dennis
  • 1,710
  • 9
  • 15
  • Try it with some $x_i$ equivalent to `FLOAT_MAX-1.0` or with $x_i-x_{i+1}$ very small but $x_i-x_{i-1}$ very large. – mbaitoff Jul 18 '14 at 02:59
  • This is actually what [clistats](https://github.com/dpmcmlxxvi/clistats) does (see answer for details and other features). – dpmcmlxxvi Jun 25 '15 at 19:46
0

Stumbled across this old thread looking for something else. Wanted the same thing, couldn't find anything simple, so did it in perl, fairly trivial, but use it multiple times a day: http://moo.nac.uci.edu/~hjm/stats

example:

 $ ls -l | scut -f=4 | stats                
Sum       9702066453
Number    501
Mean      19365402.1017964
Median    4451
Mode      4096  
NModes    15
Min       0
Max       2019645440
Range     2019645440
Variance  1.96315423371944e+16
Std_Dev   140112605.91822
SEM       6259769.58393047
Skew      10.2405932543676
Std_Skew  93.5768354979843
Kurtosis  117.69005473429

(scut is a slower, but arguably easier to version of cut): http://moo.nac.uci.edu/~hjm/scut described: http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html

0

Another tool: tsv-summarize from eBay's TSV Utilities. Supports many of the basic summary statistics, like min, max, mean, median, quantiles, standard deviation, MAD, and a few more. It is intended for large datasets and supports multiple fields and grouping by key. Output is tab separated. An example for the sequence of numbers 1 to 1000, one per line:

$ seq 1000 | tsv-summarize --min 1 --max 1 --median 1 --sum 1
1   1000    500.5   500500

Headers are normally generated from a header line in the input. If the input has no header one can be added using the -w switch:

$ seq 1000 | tsv-summarize -w --min 1 --max 1 --median 1 --sum 1
field1_min  field1_max  field1_median   field1_sum
1   1000    500.5   500500

Disclaimer: I'm the author.

JonDeg
  • 101
  • 3