If R were reprogrammed from scratch today, what changes would be most useful to the statistics community?

Question

Many people in the statistics community and other academic fields use R as their primary language for data analysis and statistical computing. It is a wonderful and versatile language that has become extremely popular across both academic and industry. The language has an interesting history which evolved as an improvement of the S language produced by Bell Labs (see e.g., Chambers 2020). While it is a versatile language, the base version of R has a few well-known draw-backs, such as difficulties dealing with "big data", lack of labels on variables, etc. This base functionality is often supplemented by popular packages, but new users can have difficulty learning the required methods.

Since R was developed essentially as an updated reprogramming of a previous language, it is natural to wonder if there might ever be an effort to create a new language that seeks to build on R. In the event that such a project were ever to occur, what kinds of changes would be most useful to the statistical community?

I just added two separate answers but maybe it should be all collected into a single answer? — Sextus Empiricus, Jan 11 '21 at 11:18
I'm a huge R fan. R, however, is simply great because of its huge environment. It has, virtually, no actual advantages over Julia (which I'm a huge fan as well). So, basically, if R were to be redesigned, it should simply be Julia in my opinion. — Firebug, Jan 11 '21 at 13:11
The language is awful. The value comes from the size of its community that wrote all the packages. At this point the best would be to port libraries to a better language — Aksakal, Jan 11 '21 at 13:30
It's a little dated, but https://www.burns-stat.com/pages/Tutor/R_inferno.pdf is a good summary of the issues with R as a language. — Jordan Bentley, Jan 11 '21 at 17:22
I'd advise against building a new language on top of R. Spend a while reading R's documentation and you find that a lot of functions behave in silly ways because they didn't want to break compatibility with S. Start from scratch. — J. Mini, Jan 11 '21 at 22:01
This is a theoretic question, while the practical solution is [Julia](https://julialang.org/) that is a modern alternative to R. — Tim, Jan 11 '21 at 23:21

score 25 · Answer 1 · answered Jan 11 '21 at 08:49

25

More consistency in parameter names. For instance:

matrix() has a parameter dimnames.
write.table() has parameters row.names and col.names (with dots, and no dimnames parameter).
There are functions rownames() and colnames(), without dots.

Yes, this is a tiny detail. But I have been using R on a daily basis for almost 20 years now, and I still have to look at ?matrix each and every time, because I tried to set row.names and am surprised why this doesn't work.

answered Jan 11 '21 at 08:49

Stephan Kolassa

95,027
13
197
357

3

Arguably, this is one of the few changes that could be done today :) – Firebug Jan 11 '21 at 16:44
+1 I have gotten used to type the first letters and pause a short moment to give Rstudio the time to hint me the correct name. – Sextus Empiricus Jan 11 '21 at 17:28
(+1) I still type `loadRDS` or `writeRDS` often enough to be slightly annoyed by the inconsistency in function names. – Scortchi - Reinstate Monica Jan 11 '21 at 18:34
1

`dimnames` in a matrix come from `array()`, which is more general and can have more than two dimensions, hence a need for more name vectors. `rownames` and `colnames` are just shortcuts for a matrix. And `row.names` is generic and used in all other situations. So at least some of those inconsistencies cannot be resolved: if we add `rownames` and `colnames` to a `matrix()` it becomes inconsistent with `array()`. – Karolis Koncevičius Jan 11 '21 at 19:24
@KarolisKoncevičius Only dimnames is necessary, all should have it. In it's absence, alias into other arguments. – Firebug Jan 11 '21 at 21:09
@Firebug - We can use `dimnames()` all the time with matrices, arrays, and even data frames. So as far as I see - all do have it. But for me `rownames()` and `colnames()` are too convenient to give up. – Karolis Koncevičius Jan 11 '21 at 21:21
@KarolisKoncevičius `write.table` does not have `dimnames` – Firebug Jan 11 '21 at 21:28
6

This issue extends beyond just parameters to functions. It also applies to the functions themselves. Why do we have `readline` and `readLines` and `colMeans` and `weighted.mean`? These aren't special library functions, these examples of a lack of any consistent naming scheme are all in base R. – J. Mini Jan 11 '21 at 22:06

Andreas Dzemski · Answer 2 · 2021-01-11T15:09:04.167

16

Useful error messages

Compared to other languages (e.g. Python) it is very difficult to track down bugs based on error messages. Error messages are often not even informative about what part of the code causes the bug.

Optional static typing

Easy way to make sure that i is a number (as it is supposed to be) and not a data frame.

Some (maybe optional) way to get rid of bugs caused by scoping issues

For example I want to be able to tell a function that it should work just with its arguments and under no circumstances try to find variables in other environments (I'm looking at you global environment).

Native support of C++ extensions

Rcpp is a wonderful way to extend R to get performance gains but suffers from the problem that natively R supports only C (not C++). This limits severely what you can do with Rcpp and makes extending R through new packages more difficult than it has to be.

Of course, addressing any of these concerns would require a complete re-design of the language so R wouldn't really be R any longer.

edited Jan 11 '21 at 15:09

answered Jan 11 '21 at 14:32

Andreas Dzemski

697
3
8

1

Could you add references about two points (the fundamental decisions that make error messages less useful and the scoping issues). To me these are new. – Sextus Empiricus Jan 11 '21 at 14:39
1

@SextusEmpiricus re error messsages. I've deleted the speculation about why this may be difficult to address. Maybe this can be addressed within the current framework. E.g., the error messages within the tidyverse are a lot better than base R errors. – Andreas Dzemski Jan 11 '21 at 15:11
1

@SextusEmpiricus re scoping issues. maybe there is a better word for the problem I am describing? – Andreas Dzemski Jan 11 '21 at 15:12
Do you mean that `test = function() {return(pi)}` will return whatever value `pi` has in the global environment and it should be instead explicitly defined inside the function or otherwise an error should be returned? – Sextus Empiricus Jan 11 '21 at 15:55
1

@SextusEmpiricus Yes, exactly. Trips me up almost every time I refactor code. I guess it would maybe suffice to have a linter that complains about references to variables that are not defined within a function. Maybe something like that already exists? – Andreas Dzemski Jan 11 '21 at 19:52
R's bad error messages need to be seen to be believed. You might want to add an example of just how bad it gets. You'll get an error about something called `dim(X)` and not be told what line triggered it, what `X` is, or which function in your code called `dim`. – J. Mini Jan 11 '21 at 22:09

score 12 · Answer 3 · answered Jan 11 '21 at 11:16

12

Standalone executable

To execute the code you need to have R installed. This is similar to Python, which does however have some programs than can turn python into executables.

This makes it more difficult to share programms with users that do not have R installed.

answered Jan 11 '21 at 11:16

Sextus Empiricus

43,080
1
72
161

Shiny is sort-of a workaround to this, no? – JTH Jan 11 '21 at 17:15
@JTH shiny allows you to make interactive applications, but not a standalone executable. – Sextus Empiricus Jan 11 '21 at 17:25
I am pretty sure Python executable is just a bundle with portable Python and the .py source code. Certainly this can be done for R, as well. – lvella Jan 11 '21 at 17:58
I am not a python expert but I saw examples of programs that turn python into c++ code (and from that you can make an executable). https://stackoverflow.com/questions/5458048/ – Sextus Empiricus Jan 11 '21 at 18:37
I agree with @Sextus here: "can be done" is not the same as "is automated as a feature of the base program for simple use by a new user". – Ben Jan 11 '21 at 21:56

score 11 · Answer 4 · answered Jan 11 '21 at 13:37

11

Built-in reproducible environments

If R were designed from scratch, it would be great to have a built-in way to reproducibly use packages and have multiple versions of the same package installed, and bundle information about which packages the code was run with in a single file that could be used to rerun this code with identical packages. Ideally without requiring you to install the same package multiple times.

There are plenty of packages out there to create reproducible R environments, which causes fragmentation, and users do have to use one for their code to be properly reproducible.

answered Jan 11 '21 at 13:37

Erik A

241
1
8

I don't think this needs a built-in solution (meaning that it could be incorporated in current R, no need for redesign). Just like Julia has Pkg Project.toml + Manifest.toml, and Python has pip requirements.txt, R could just as well have an environment solution to be used with CRAN registries. – Firebug Jan 11 '21 at 19:06
@Firebug [1/2] I mainly encounter this problem when using random scripts generated by colleagues months to years ago, not when using packages, so integration in packages wouldn't help me. The problem is: I want to do something. Colleague X has done that thing a year ago. I get his script. Try to run it, tons of errors. Worked for him in the past. I install checkpoint, `checkpoint("2019-01-01")`, less errors. Try different dates, never 0 errors. Hours of debugging later, apparently that colleague had a way outdated version of one specific package, and the script works. – Erik A Jan 11 '21 at 20:39
your example is exactly about packages though. An environment like in Python or Julia would solve everything. – Firebug Jan 11 '21 at 20:42
[2/2] If we look at Python, Conda fixes this issue by forcing users to run their stuff in an environment, and by easily generating a file listing all packages installed in that environment. It has its flaws too. What I'd truly like is a built-in way to use "environments", and without integration into R itself, we get stuff like Packrat, which would be nice if people would actually use it, but mostly is a hassle so no-one I know actually uses. While it could be integrated into R Studio instead of R that would make scripts dependent on an IDE which is usually a bad idea. – Erik A Jan 11 '21 at 20:49
Python user nosing in. Forgive me if this is entirely stupid, but can you not run R in Docker to get reproducibility? – user1717828 Jan 11 '21 at 21:08
@user1717828 Of course you can. Will I ever be able to convince my colleagues to do so and archive docker images instead of scripts? Nope :( – Erik A Jan 11 '21 at 21:46

score 9 · Answer 5 · answered Jan 11 '21 at 13:28

Preserving/translating existing `R` packages

Probably the greatest present advantage of R over other statistical computing programs is that it has a huge repository of well-developed packages that perform a broader class of statistical tasks than is available in other programs. In the event that there were any attempt to reprogram a new version from scratch, it would be important to preserve as much of this as possible as valid code that would be compatible with a new program. Consequently, in the event that there is any change in the base program that would render later code obsolete, it would be useful to have a parallel method of "translation" of code into the new program.

score 7 · Answer 6 · answered Jan 11 '21 at 17:01

7

Bring data.table like syntax to data.frame

data.table's syntax (DT[i, j, by]) is so useful and such a faithful extension of data.frame that it should just be built in at this point. (If we are willing to entertain breaking changes).

answered Jan 11 '21 at 17:01

JTH

1,003
7
14

2

What's the advantage of having it built-in versus the current package distribution? In my opinion, base-R comes very bloated already. – Firebug Jan 11 '21 at 18:16
2

I'm only suggesting that it would be nice to have data.table's _syntax_ when working on data.frames. The full force of data.table (e.g. reference semantics) might not be suitable for base R. – JTH Jan 11 '21 at 19:06

score 6 · Answer 7 · answered Jan 11 '21 at 11:15

6

Object oriented programming

OOP tools had not been in initially included into language. Currently there are S3 and S4 objects which makes that there is lack of consistency among different code (a problem that is more general than just OOP).

answered Jan 11 '21 at 11:15

Sextus Empiricus

43,080
1
72
161

score 5 · Answer 8 · answered Jan 11 '21 at 13:21

Standard object classes/structures for common statistical outputs

There are some special object types that have been developed in R to represent particular kinds of statistical outputs. For example, there are objects of class htest that are used to represent the outputs of a hypothesis test, and objects of class lm, glm, etc., used for the outputs of statistical models. However, there are a number of common statistical outputs that do not have special classes/structures developed. As a result, they tend to be represented in an ad hoc manner. It would be useful for common statistical outputs to have a defined class and structure in the base program, with consistent elements and printing method.

Here are some examples of particular outputs that would benefit from having a developed class/structure, with associated custom print methods, etc. Giving these and other important statistical outputs a standard class/structure would allow users to develop, compute and print these outputs in a way that includes all required information and gives user-friendly print output.

Sets could be represented by appropriate objects such as is presently in the sets package. Having sets as objects in the program would be useful for a number of statistical outputs.
Confidence intervals/sets could be represented as an object of class ci that includes a set object giving the confidence interval/set, the confidence level, the name/description of the parameter or quantity for the interval, and any other required information.
Highest density regions (HDRs) could be represented as an object of class hdr that includes the set object giving the HDR, the coverage probability, and any other required information.

score 5 · Answer 9 · edited Jan 11 '21 at 16:31

5

Multithreading by default

R was built as a single threaded application, but we can do better these days. Sadly Microsoft R is pretty much discontinued now...ir had many benefits over the original. https://mran.microsoft.com/documents/rro/multithread

edited Jan 11 '21 at 16:31

Firebug

15,262
5
60
127

answered Jan 11 '21 at 16:21

niemiro

101
3

4

Note that MRO is mainly faster because it uses Intel MKL, and you can compile R using Intel MKL yourself (see [Intel (Linux)](https://software.intel.com/content/www/us/en/develop/articles/using-intel-mkl-with-r.html) and [Stack Overflow (Windows)](https://stackoverflow.com/questions/38090206/linking-intels-math-kernel-library-mkl-to-r-on-windows) on it). Of course, having a nice prebuilt binary that would get updated often, or even better, an installer that has this as an option would be great, but this would by no means require a reprogramming from scratch. – Erik A Jan 11 '21 at 17:45

score 5 · Answer 10 · answered Jan 11 '21 at 16:39

Less reliance on C/C++/Fortran, aka solve the Two-Language Problem

One of the major drawbacks of R is that the actual performant code is mostly written in other languages (C/C++ and even Fortran). This makes development and tinkering way harder (since now new users need to learn at least two, not one, language).

Julia, for example, is Julia all the way down to the LLVM layer. This makes a novice Julia user proficient in both high and low-level functionalities necessary to actually develop a package or simply help improving other packages (not to say that the low-level complexity is easy, but you at least know the language already, to the point that it's not uncommon for newbies to contribute features to the core language).

So, if pure R could be made performant enough, the Two-Language problem would be overcome. How to make that? This is harder said than done. Julia took a stance regarding type inference and JIT (just in time, aka at runtime) compilation, so R could need to give away some of its features to achieve that. Luckily, R (and some other languages) followed the footsteps of JIT compiled languages, and part of it is already featured.

Ben · Answer 11 · 2021-01-12T08:06:23.160

2

Build wrangling functions and labelled data into the base program

As a general rule, it would be nice to move some of the important functionality in key packages into the base program (as was done for the stats package at one stage). In particular, the base objects in the program should be programmed to use some of the useful wrangling functions for "tidy data" (e.g., per Wickham 2017), should allow easy descriptive labels for all variables, and should handle time as a special variables in a way that is useful for tidy analysis of time-series data.

Some of the wrangling functions for tidy data analysis similar to what exists in the tidyverse should be built into the base program. There are a number of functions in that field that assist in wrangling data frames (but their names can be quite odd, owing to the fact they are not in base). All base objects and functions should be programmed with the principles of tidy data in mind, and with core functions for important wrangling steps. I concede that there is a trade-off here --- you don't want to add too many functions and increase complexity, but you want to add enough functions to do key wrangling steps.
Objects such as vectors, matrices and data frames should allow descriptive labels for their variables, similar to what exists in other languages such as Stata. The labels should be in addition to variable names, to allow variable description or labels for printing.
The base program should handle time variables in a way that allows simple ordering of time-series, and standard operations you want to do with time variables. Presently most of this is in packages in R such as lubridate and zoo.

edited Jan 12 '21 at 08:06

answered Jan 11 '21 at 12:53

Ben

91,027
3
150
376

11

[IMO, the tidyverse is an active disvalue to R](https://chat.stackexchange.com/transcript/18?m=53969787#53969787), so while I'm all in favor of descriptive labels, I'm *not* in favor of the additional baggage the tidyverse entails (and that adds to R's learning curve). – Stephan Kolassa Jan 11 '21 at 16:34
5

@StephanKolassa, I agree. IMHO, Norman Matloff does a good job of discussing the issues [here](https://github.com/matloff/TidyverseSkeptic/blob/master/READMEFull.md). – gung - Reinstate Monica Jan 11 '21 at 16:58
5

@StephanKolassa It's rare to find someone that shares that sentiment about the tidyverse (at least this is my impression). [Normal Matloff](https://github.com/matloff/TidyverseSkeptic) is a prominent tidyverse-critic who makes a lot of good points in my opinion. Personally, I'm a bit torn: On one side, some tasks are really conventient to achieve with the tidyverse but it's a bit of a chore to learn all the functions (which are not easy, imo). I also don't like that there is a cultural split among R users. – COOLSerdash Jan 11 '21 at 16:59
As I mentioned, when regarding `data.table`, what's the advantage of incorporating tidy-stuff in base-R? Having something like a pipe operator built-in is cool, but the whole tidy thing could be left for packages. – Firebug Jan 11 '21 at 18:19
4

(Actually, I'm a little surprised to see here that others have a similar opinion on this that I do. You definitely have to be into the tidyverse if you want to be one of the cool kids these days.) – gung - Reinstate Monica Jan 11 '21 at 18:47
In hindsight, my answer probably did overstate exactly what you should take from the "tidyverse", and I agree that some aspects can add to the learning curve. Nevertheless, what I had in mind was having base functions that do the major data-wrangling steps for data frames, which would be very useful. I have edited the answer to make a smaller claim with respect to these items. – Ben Jan 11 '21 at 21:53
1

IMHO, many of the criticisms of the "tidyverse" (e.g., concerns about complexity) are things that would actually be solved if the key functions were programmed at the base level of ```R``` using simple names. For example, there are a number of useful wrangling functions that change the shape/structure of a data-frame in useful ways, but are laborious to program in loops in base ```R```. Having wrangling functions in the base program would seem to me to reduce the complexity, rather than increasing it. – Ben Jan 12 '21 at 01:49
1

Ben, I think [your last comment](https://stats.stackexchange.com/questions/504386/if-r-were-reprogrammed-from-scratch-today-what-changes-would-be-most-useful-to/504409?noredirect=1#comment932345_504409) is spot on. Descriptive labels and built-in data wrangling functions would be a step forward. Then again, there are already a number of such functions, like `reshape()`. I always wonder whether the fact that I can't get `reshape()` to work at the first try reflects a shortcoming in its design, or in my ossified brain. Probably the latter. – Stephan Kolassa Jan 12 '21 at 07:55
1

@StephenKolassa: ```reshape(ossified brain)``` – Ben Jan 12 '21 at 08:03
@StephanKolassa, IMHO `reshape()` is a great function. But the arguments have names that are a little opaque & the documentation is really not up to the task. – gung - Reinstate Monica Jan 16 '21 at 13:15

score 1 · Answer 12 · answered Jan 11 '21 at 17:58

1

Add more protected names

pi <- 3 should probably not be allowed.

answered Jan 11 '21 at 17:58

JTH

1,003
7
14

2

You don't need to redesign R to achieve that though – Firebug Jan 11 '21 at 18:17

score 0 · Answer 13 · answered Jan 11 '21 at 23:16

Replace packages by standardized functions

There are so many packages and the definitions of functions differ between packages. For the same problem there are different functions from different packages, with similar names but different details. Actually you do not know what happens if you apply a function and you loose control about your code. If you want to know what a function does then the help is very scarce or only a paper is given as a reference. A function without a documentation is a risk for any user.

It would be better to choose the most useful functions in a selection process, standardize and modify them, and put them all in a default system with standardized help. This would also reduce redundant functioncs and increase the order. Currently R looks like a multiworld construction kit that needs refurbishing.

If R were reprogrammed from scratch today, what changes would be most useful to the statistics community?

13 Answers13

Useful error messages

Optional static typing

Some (maybe optional) way to get rid of bugs caused by scoping issues

Native support of C++ extensions

Standalone executable

Built-in reproducible environments

Preserving/translating existing `R` packages

Bring data.table like syntax to data.frame

Object oriented programming

Standard object classes/structures for common statistical outputs

Multithreading by default

Less reliance on C/C++/Fortran, aka solve the Two-Language Problem

Build wrangling functions and labelled data into the base program

Add more protected names

Replace packages by standardized functions

Linked

If R were reprogrammed from scratch today, what changes would be most useful to the statistics community?

13 Answers13

Useful error messages

Optional static typing

Some (maybe optional) way to get rid of bugs caused by scoping issues

Native support of C++ extensions

Standalone executable

Built-in reproducible environments

Preserving/translating existing R packages

Bring data.table like syntax to data.frame

Object oriented programming

Standard object classes/structures for common statistical outputs

Multithreading by default

Less reliance on C/C++/Fortran, aka solve the Two-Language Problem

Build wrangling functions and labelled data into the base program

Add more protected names

Replace packages by standardized functions

Linked

Preserving/translating existing `R` packages