Is there any command-line tool that accepts the flow of numbers (in ascii format) from standard input and gives the basic descriptive statistics for this flow, such as min, max, average, median, RMS, quantiles etc? The output is welcome to be parseable by the next command in command-line chain. Working environment is Linux, but other options are welcome.
-
1I would recommend to take a look at [|STAT](http://hcibib.org/perlman/stat/). That's a pretty old software, yet it is very convenient for such things. There's also [pyp](https://code.google.com/p/pyp/), and [several](http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/) [other](http://www.drbunsen.org/explorations-in-unix/) Un*x tools. – chl Sep 24 '13 at 15:34
-
@chl Link ISTAT broken. Can update it or make it an answer, please? – Léo Léopold Hertz 준영 Jul 16 '15 at 09:54
-
1@Masi Yup, it looks like the page no longer exists. Here is an [updated link](http://old.sigchi.org/~perlman/stat/). – chl Jul 16 '15 at 10:05
-
http://stackoverflow.com/questions/9789806/command-line-utility-to-print-statistics-of-numbers-in-linux || http://serverfault.com/questions/548322/tool-to-do-statistics-in-the-linux-command-line – Ciro Santilli 新疆再教育营六四事件法轮功郝海东 Oct 12 '15 at 11:14
12 Answers
You can do this with R, which may be a bit of overkill...
EDIT 2: [OOPS, looks like someone else hit with Rscript
while I was retyping this.] I found an easier way. Installed with R should be Rscript
, which is meant to do what you're trying to do. For example, if I have a file bar
which has a list of numbers, one per line:
Rscript -e 'summary (as.numeric (readLines ("stdin")))' < bar
Will send the numbers in the file into R and run R's summary
command on the lines, returning something like:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.25 3.50 3.50 4.75 6.00
You could also do something like:
Rscript -e 'quantile (as.numeric (readLines ("stdin")), probs=c(0.025, 0.5, 0.975))'
to get quantiles. And you could obviously chop off the first line of output (which contains labels) with something like:
Rscript -e 'summary (as.numeric (readLines ("stdin")))' < bar | tail -n +2
I'd highly recommend doing what you want in interactive R first, to make sure you have the command correct. In trying this, I left out the closing parenthesis and Rscript returns nothing -- no error message, no result, just nothing.
(For the record, file bar contains:
1
2
3
4
5
6

- 101
- 4

- 19,981
- 4
- 50
- 99
-
-
@mbaitoff: Yes. For my test, I created a file `foo` which contained the `summary (as.numeric (readLines()))` as its first line, then one numeric data item per line for the rest of the file. The `readLines()` is just reading from stdin (which is all of what follows, until the end of file). – Wayne Mar 20 '12 at 16:01
-
Looks like we seriously stuck to `R` in both answers, and it seems to be a huge tool for a tiny task. Well, the answers work, but anyway, is there something else except `R`? – mbaitoff Mar 21 '12 at 05:12
-
2@mbaitoff: You could use Python with `scipy`, especially if you already use Python. If you use/like Clojure (lisp based on JVM, http://clojure.org/), there's the `Incanter` (http://incanter.org/) statistical environment built on that. You could also try gnu Octave. – Wayne Mar 21 '12 at 13:14
Try "st":
$ seq 1 10 | st
N min max sum mean stddev
10 1 10 55 5.5 3.02765
$ seq 1 10 | st --transpose
N 10
min 1
max 10
sum 55
mean 5.5
stddev 3.02765
You can also see the five number summary:
$ seq 1 10 | st --summary
min q1 median q3 max
1 3.5 5.5 7.5 10
You can download it here:
(DISCLAIMER: I wrote this tool :))

- 161
- 2
- 3
-
Welcome to the site, @user2747481. Would you mind fleshing this answer out a bit? We would like our answers to be mostly self-contained. Since you are new here, you may want to read our [about page](http://stats.stackexchange.com/about), which contains information for new users. – gung - Reinstate Monica Sep 05 '13 at 15:52
-
Thanks! As of 2019 `st` is available via Homebrew `brew install st` – Noah Sussman Jan 27 '19 at 23:19
-
1Beware that `st` may also reference to `simple terminal`. – Skippy le Grand Gourou Feb 06 '19 at 10:21
-
1
-
1Nice to have something which installs with minimal dependencies, worked even from a centos-minimal image! – lost Oct 09 '20 at 11:06
R provides a command called Rscript. If you have only a few numbers that you can paste on the command line, use this one liner:
Rscript -e 'summary(as.numeric(commandArgs(TRUE)))' 3 4 5 9 7
which results in
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.0 4.0 5.0 5.6 7.0 9.0
If you want to read from the standard input use this:
echo 3 4 5 9 7 | Rscript -e 'summary(as.numeric(read.table(file("stdin"))))'
If number on the standard input are separated by carriage returns (ie one number per line), use
Rscript -e 'summary(as.numeric(read.table(file("stdin"))[,1]))'
One can create aliases for these commands:
alias summary='Rscript -e "summary(as.numeric(read.table(file(\"stdin\"))[,1]))"'
du -s /usr/bin/* | cut -f1 | summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 8.0 20.0 93.6 44.0 6528.0
-
+1: Sorry I'd just found Rscript and edited my answer to include this, so we've ended up with a similar answer. Your `read.table` idea is a nice way to get around one-item-per-line. – Wayne Mar 20 '12 at 18:59
-
datamash is another great option. It's from the GNU Project.
If you have homebrew / linuxbrew you can do:
brew install datamash

- 143
- 1
- 6
-
It's amazing for calculating stats over access logs. For example `cat only_times.txt | datamash --header-out mean 1 perc:50 1 perc:90 1 perc:95 1 perc:97 1 perc:99 1 perc:100 1 | column -t`. – Pavel Patrin Jul 29 '21 at 11:20
Y.a. tool which could be used for calculating statistics and view distribution in ASCII mode is ministat. It's a tool from FreeBSD, but it also packaged for popular Linux distribution like Debian/Ubuntu.
Usage example:
$ cat test.log
Handled 1000000 packets.Time elapsed: 7.575278
Handled 1000000 packets.Time elapsed: 7.569267
Handled 1000000 packets.Time elapsed: 7.540344
Handled 1000000 packets.Time elapsed: 7.547680
Handled 1000000 packets.Time elapsed: 7.692373
Handled 1000000 packets.Time elapsed: 7.390200
Handled 1000000 packets.Time elapsed: 7.391308
Handled 1000000 packets.Time elapsed: 7.388075
$ cat test.log| awk '{print $5}' | ministat -w 74
x <stdin>
+--------------------------------------------------------------------------+
| x |
|xx xx x x x|
| |__________________________A_______M_________________| |
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 8 7.388075 7.692373 7.54768 7.5118156 0.11126122

- 101
- 1
- 3
-
Because simple statistics are so easy to calculate it's tempting to write an ad-hoc solution time and again. For a programmer it's a habit. I guess I've written hundreds of one-liners in octave, jq, r, shell or awk to calculate basic statistics. For me, this particular answer proved to be the most practicable solution. It's small, simple, ubiquitous and, my distro's package repository has it . Using it might actually become a habit that sticks (finally). – wnrph Dec 10 '21 at 14:00
There is sta, which is a c++ varient of st, also referenced in these comments.
Being written in c++, it's fast and can handle large datasets. It's simple to use, includes the choice of unbiased or biased estimators, and can output more detailed information such as standard error.
You can download sta at github.
Disclaimer: I'm the author of sta.

- 101
- 3
There is also simple-r, which can do almost everything that R can, but with less keystrokes:
https://code.google.com/p/simple-r/
To calculate basic descriptive statistics, one would have to type one of:
r summary file.txt
r summary - < file.txt
cat file.txt | r summary -
Doesn't get any simple-R!

- 1
- 1
You might also consider using clistats. It is a highly configurable command line interface tool to compute statistics for a stream of delimited input numbers.
I/O options
- Input data can be from a file, standard input, or a pipe
- Output can be written to a file, standard output, or a pipe
- Output uses headers that start with "#" to enable piping to gnuplot
Parsing options
- Signal, end-of-file, or blank line based detection to stop processing
- Comment and delimiter character can be set
- Columns can be filtered out from processing
- Rows can be filtered out from processing based on numeric constraint
- Rows can be filtered out from processing based on string constraint
- Initial header rows can be skipped
- Fixed number of rows can be processed
- Duplicate delimiters can be ignored
- Rows can be reshaped into columns
- Strictly enforce that only rows of the same size are processed
- A row containing column titles can be used to title output statistics
Statistics options
- Summary statistics (Count, Minimum, Mean, Maximum, Standard deviation)
- Covariance
- Correlation
- Least squares offset
- Least squares slope
- Histogram
- Raw data after filtering
NOTE: I'm the author.

- 111
- 4
Just in case, there's datastat
https://sourceforge.net/p/datastat/code/
a simple program for Linux computing simple statistics from the command-line. For example,
cat file.dat | datastat
will output the average value across all rows for each column of file.dat. If you need to know the standard deviation, min, max, you can add the --dev, --min and --max options, respectively.
datastat has the possibility to aggregate rows based on the value of one or more "key" columns.
It's written in C++, runs fast and with small memory occupation, and can be piped nicely with other tools such as cut, grep, sed, sort, awk, etc.

- 11
- 3
Too much memory and processor power, folks. Using R for something like this is roughly like getting a sledgehammer to kill a mosquito. Use your favorite language and implement a provisional means algorithm. For the mean: $$\bar{x}_n = \frac{(n-1)\,\bar{x}_{n-1} + x_n}{n}$$; and for the variance:$$s^2_n = \frac{S_n}{n-1}$$ $$S_n = S_{n-1} + (x_n-\bar{x}_{n-1})(x_n-\bar{x}_n).$$
Take $\bar{x}_0 = S_0 = 0$ as starting values. Modifications are available for weighted analyses. You can do the computations with two double precision reals and a counter.

- 1,710
- 9
- 15
-
Try it with some $x_i$ equivalent to `FLOAT_MAX-1.0` or with $x_i-x_{i+1}$ very small but $x_i-x_{i-1}$ very large. – mbaitoff Jul 18 '14 at 02:59
-
This is actually what [clistats](https://github.com/dpmcmlxxvi/clistats) does (see answer for details and other features). – dpmcmlxxvi Jun 25 '15 at 19:46
Stumbled across this old thread looking for something else. Wanted the same thing, couldn't find anything simple, so did it in perl, fairly trivial, but use it multiple times a day: http://moo.nac.uci.edu/~hjm/stats
example:
$ ls -l | scut -f=4 | stats
Sum 9702066453
Number 501
Mean 19365402.1017964
Median 4451
Mode 4096
NModes 15
Min 0
Max 2019645440
Range 2019645440
Variance 1.96315423371944e+16
Std_Dev 140112605.91822
SEM 6259769.58393047
Skew 10.2405932543676
Std_Skew 93.5768354979843
Kurtosis 117.69005473429
(scut is a slower, but arguably easier to version of cut): http://moo.nac.uci.edu/~hjm/scut described: http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html
Another tool: tsv-summarize from eBay's TSV Utilities. Supports many of the basic summary statistics, like min, max, mean, median, quantiles, standard deviation, MAD, and a few more. It is intended for large datasets and supports multiple fields and grouping by key. Output is tab separated. An example for the sequence of numbers 1 to 1000, one per line:
$ seq 1000 | tsv-summarize --min 1 --max 1 --median 1 --sum 1
1 1000 500.5 500500
Headers are normally generated from a header line in the input. If the input has no header one can be added using the -w
switch:
$ seq 1000 | tsv-summarize -w --min 1 --max 1 --median 1 --sum 1
field1_min field1_max field1_median field1_sum
1 1000 500.5 500500
Disclaimer: I'm the author.

- 101
- 3