What are some good interview questions for statistical algorithm developer candidates?

Question

I'm interviewing people for a position of algorithm developer/researcher in a statistics/machine learning/data mining context.

I'm looking for questions to ask to determine, specifically, a candidate's familiarity, understanding and fluidity with the underlying theory, e.g. basic properties of expectation and variance, some common distributions, etc.

My current go-to question is: "There is an unknown quantity $X$ which we would like to estimate. To this end we have estimators $Y_1, Y_2, \ldots, Y_n$ which, given $X$, are all unbiased and independent, and each has a known variance $\sigma_i^2$, different for each one. Find the optimal estimator $Y=f(Y_1,\ldots, Y_n)$ which is unbiased and has minimal variance."

I'd expect any serious candidate to handle it with ease (given some time to work out the calculations), and yet I'm surprised at how many candidates which are supposedly from relevant fields fail to make even the smallest bit of progress. I thus consider it a good, discriminative question. The only problem with this question is that it is only one.

What other questions can be used for this? Alternatively, where can I find a collection of such questions?

For many machine learning people (including good ones), that question is way out of their comfort zone. This is an obvious statistician question. — Marc Claesen, May 03 '16 at 14:44
This question is legitimately borderline on / off topic. However, it has many views, several upvotes, an answer w/ several upvotes, &, moreover, is CW. It could stay open, IMO. — gung - Reinstate Monica, May 03 '16 at 16:41
@MarcClaesen: I've seen statistics Phds fail on it too. Anyway, ML and stats are very closely related, and I do want to rule out those for which this question is outside their comfort zone (even if they may be good for other positions). — Meni Rosenfeld, May 03 '16 at 18:50
The go-to question might be worded in a confusing manner. For example, the use of $X$ with a capital would make $X$ seem random. But since you're mentioning minimum variance, it would seem like you want $X$ to be non-random (in which case, why does the variance of the estimators not have a written dependence on $X$?) — Batman, May 03 '16 at 19:04
A point of caution, Google did a big study of their internal HR process and found that interviewer scores didn't correlate at all with subsequent job performance!! My impression of the literature here is that (1) puzzle type questions are the absolute worst, serving only to make the interviewer feel smart (i.e. 0 forecasting power) and (2) resume, experience based questions may have predictive value. Past performance forecasts future performance & you may want to focus questions to ascertain what their past performance was, but the interview is far less informative than interviewers think. — Matthew Gunn, May 03 '16 at 19:44
Just ask calculus questions. You'll be surprised how many candidates can't take integrals. — Aksakal, May 03 '16 at 19:57
@meni I am intrigued, what you as an interviewer consider an A+ answer to your question ? — Repmat, May 03 '16 at 20:05
A lot of basic knowledge questions will filter out clearly incompetent candidates ("what is a confidence interval?"). However, the best way to see if a person will fit the position is to be transparent with the kinds of problems you want to solve. So just present a few things you want them to work on and ask for their ideas on how they would approach the problem if: 1) They have one week to code a solution. 2) They have a few months to code a solution. Don't be vague, actually show them a few tables of your data (instead of doing a thought experiment on the board). — Alex R., May 03 '16 at 23:03
@Batman: When I'm actually asking it I spend more time explaining the question, and I'm there to answer any questions and make sure everything is clear. — Meni Rosenfeld, May 04 '16 at 08:04
@Repmat: First, I expect them to realize that taking $Y=\sum_iw_iY_i,\ \sum w_i=1$ is a logical thing to do (I don't expect to prove that this is the best possible). Then I'd expect them to realize that $w_i$ should be smaller the larger $\sigma_i$ is. Then I expect them to write $\mathbb{V}[Y]=\sum_iw_i^2\sigma_i^2$. Then I expect them to find the $w_i$ for which this is minimal (answer is $w_i \propto 1/\sigma_i^2$), there are several ways to do that, some with more cumbersome calculations than others. They don't have to get everything instantly, they can take their time and get some hints. — Meni Rosenfeld, May 04 '16 at 08:10
@MeniRosenfeld: Well, you used the word optimal, which is the problem. Let $X$ be a parameter. If $Y_1 = X + S_1$ and $Y_2 = X+ 2 S_2$ where $S_1,S_2$ are both i.i.d. rademacher's, $Y_1,Y_2$ are independent given $X$ and both are unbiased estimators of $X$ with different variance. However, knowing $Y_2 - Y_1$ is 3,1,-1 or -3 gives you what $S_1$ and $S_2$ are, and thus you can estimate $X$ exactly (i.e. with variance zero). However, your linear estimator will have positive variance since $w_i^2, \sigma_i^2 >0$ for some $i$. — Batman, May 04 '16 at 09:52
@Batman: Interesting observation. But the point is that the function $f$ should give an unbiased estimator whatever the distributions of $Y_i$ are (under the constraints). You can construct a better estimator for this particular distribution, but it will not generalize. I'm fairly confident a weighted average is the only way to guarantee unbiasedness (I'll be happy to be corrected). And anyway, the candidates that fail don't do so because they're sophisticated enough to think of edge cases - it's because they don't know basic stuff such as variance being additive for independent variables... — Meni Rosenfeld, May 04 '16 at 11:49
Unbiasedness is guaranteed by having the weights sum to unity. However, even limiting your solution to linear combinations of the estimators, it's almost always going to be the case that multiple estimators based on the same data will be *highly* correlated. (If they are truly independent, then they would be applied to disjoint, independent subsets of the data.) It's not at all evident that a linear combination of estimators will be optimal, though. — whuber, May 05 '16 at 21:14
Find out if the candidate to develop statistical algorithms knows anything about the fundamentals of numerical mathematical calculation and software, i.e., numerical analysis, to include effects of finite precision roundoff error, and tolerances and convergence for iterative processes. Would you believe std being computed as sqrt(abs(var)) or sqrt(max(var,0)), as the fix to having gotten a sqrt of negative argument error using a numerically unstable method to calculate variance ... in multi-billion dollar weapon systems? Many R packages in CRAN are totally unsound and unreliable numerically. — Mark L. Stone, May 05 '16 at 21:32
using software libraries like R's glmulti or MPC would engage the question, but perhaps in a more "do the job" and not as much "first principles derivation" as the asker seems to be looking for. — EngrStudent, Mar 15 '19 at 07:29

score 12 · Answer 1 · edited Apr 13 '17 at 12:44

What do you want your statistical developer to do?

The US Army says "train you will fight, because you will fight like you were trained". Test them on what you want them to do all day long. Really, you want them to "create value" or "make money" for the company.

Boss 101

Think "show me the money."

Money grows on trees called employees. You put in a "dime" (their wages) and they pay you a "quarter" (their value).
If you can't relate their job to how they make money for the company then neither you nor they are doing their job correctly.

Note: If your symbolic manipulation question doesn't cleanly connect to the "money" then you might be asking the wrong question.

There are 3 things every employee has to do to be an employee:

Be actually able to do the job
Work well with the team
Be willing/motivated to actually do the job

If you don't get these down rock solid, no other answer is going to do you any good.

If you can replace them with a good piece of software or a well-trained teenager, then you will eventually have to do it, and it will cost you.

Data 101

What they should be able to do:

use your internal flavors of software (network, os, office, presentation, and analysis)
use some industry standard flavors of software (Excel, R, JMP, MatLab, pick_three)
get the data themselves. They should know basic data sets for basic tasks. They should know repositories. They should know which famous data is used for which task. Fisher Iris. Pearson Crab. ... there are perhaps 20 elements that should go here. UCI, NIST, NOAA.
They should know rules of handling data. binary data (T/F) has very different information content than categorical (A,B,C,D) or continuous. Proper handling of the data by data-type is important.
A few Basic statistical tasks include: are these two the same or different (aka cluster/classify), how does this relate to that
(regression/fitting including linear models, glm, radial basis,
difference equations), is it true that "x" (hypothesis testing), how many samples do I need (acceptance sampling), how do I get the most
data from few/cheap/efficient experiments (statistical Design of
experiment) - disclaimer, I'm engineer not statistician You might ask them the question "what are the different fundamental tasks, and how do you test that the statistician can do them efficiently and correctly?
access/use the data themselves. This is about formats and tools.
They should be able to read from csv, xlsx (excel), SQL, and
pictures. (HDF5, Rdata) If you have a custom format, they should
be able to read through it and work with the tools quickly and
efficiently. They should know strength/weakness of the format. CSV is quick use, been around forever, fast prototype, but bloated, inefficient and slow to run.
process the data properly, using best practices, and not committing sins. Don't throw away data, ever. Don't fit binomial data with a continuous line. Don't defy physics.
come up with results that are repeatable and reproducible. Some
folks say "there are lies, damn lies, and statistics" but not at my
company. The same good input gives the same good output. The output isn't a number, it is always a business decision that informs a
technical action and results in a business result. Different tests may set the dial at 5.5, or 6.5, but the capability is always above 1.33.
present findings in the language and at the level that the decision
makers, and/or minion-developers, and/or themselves in a year, can
understand with the least errors. A beautiful thing is being able to explain it so your grandma gets it. This (link) is my answer, but I like it.

Analytic zingers:

I think impossible questions are great. They are impossible for a reason. Being able to know whether something is impossible out the gate is a good thing. Knowing why, having some ways of engaging it, or being able to ask a different question can be better.

Other CV questions. (link) On reddit. (link) others (link)

BTW: this was a good question. I might have to update this answer over time.

This seems to be a good answer, for a different question than the one I asked. I didn't ask how to pick good employees (I'd probably ask something like that on workplace.se if I needed to), I asked about testing a specific qualification. — Meni Rosenfeld, May 03 '16 at 18:56

What are some good interview questions for statistical algorithm developer candidates?

1 Answers1