Statistical analysis of relational database: is it possible and how?

Question

I have been struggling with flat file databases and corresponding statistical packages for almost 20 years now (from Excel to SPSS, then Stata, and currently R).

However, I have always had to convert complex and multidimensional relational databases (eg in Access or MySQL) to often overly simplified flat sheet databases, which is at best time consuming (but often means reducing the amount of information available for each analysis).

Indeed, the approach I have always followed is the typical one of converting a relational database through specific queries into one or more flat file databases. While this is simple enough for most analyses, especially univariate and bivariate, it may become more confusing for multivariable and multivariate analyses, as it requires taking multiple and complex queries, and most importantly often oversimplifying the data themselves.

Now that I try to get more acquainted with big data and data science, I wonder whether the shift to big data will require also a shift to data analysis encompassing multiple tables and relations, without diluting the efficiency and power of a relational database when it is converted into multiple flat file databases.

So, my question is, simply: is it possible to directly perform complex (eg multivariable) analyses of relational databases? And if yes, how?

This is not a philosophical question (only). For instance, I am now working on a relatively large (reaching 2000 patients) observational study on transcatheter aortic valve implantation for severe aortic stenosis (RISPEVA). It is based on a MySQL electronic case report form which corresponds to 12 separate tables with complex relations and often multiple entries per each patient. My approach so far to try to identify predictors of long-term death (eg if looking for a score) has been, as usual, to create multiple tables through queries, and then distill the key features capable of predicting death. This means going through multiple stages of analysis and, at best, it is time consuming.

My fear is however that it might overlook one or more of the relational features of the data, and thus loosing precision or accuracy. Could it be done in a different fashion, directly analyzing the relational database as it stands?

Statistics is applied to a question you want to ask of data; you seem to be asking about doing an analysis in the absence of a question which doesn't make sense to me. A relational database consists of a collection of relations (tables) that essentially correspond in R to data frames; what encodes the "structure" of connections between those is the collections of variables that appear in more than one relation. As a result of the way a relational database works, almost any statistical question will be one you want to ask of a particular table ... ctd — Glen_b, Mar 25 '16 at 03:43
(whether in the database or one you construct via the various operations you apply to a database -- joins, projection, selection and so on); you choose or generate the table and then apply the relevant (to your particular question) analysis to that table. As a result I find your question unclear -- if I understand it the answer is obvious (R, for example, is heavily organized around applying statistical analysis to data frames); If I didn't understand it you should be clearer about what exactly it is you want to achieve. Specifically, what sort of a statistical question would you be answering? — Glen_b, Mar 25 '16 at 03:46
SQL Server 2016 will allow you to execute R on tables and queries directily within Management Studio, but I 100% agree with @Glen_b. You need a specific statistical question before you can move forwar — StatsStudent, Mar 25 '16 at 06:11
What exactly do you mean by "directly"? For instance, would using a data frame-like access to a View constitute a "direct" analysis or not? — whuber, Mar 26 '16 at 23:45
@GiuseppeBiondi-Zoccai, forgive my ignorance, but why can't you link your data tables together using common/unique identifiers? Several years ago, I worked for a government infectious disease surveillance program. The surveillance database was a relational database, which kept separate tables for patient reports, information on health providers and health facilities, lab results, death certificates, etc. There are common identifiers to link patients to lab results, patients to health providers, lab results to ordering facilities, and so on. Usually you won't use all of the data all (cont.) — Marquis de Carabas, Mar 27 '16 at 01:02
all at once. Merge the tables that contain the variables that you need and go from there. Of course you can merge all the tables together if you need everything, but most of the time you won't. I agree with @Glen_b and StatsStudent above that you need a statistical question. The procedure for merging the tables need not be time consuming. I think it would be easier for you to merge the "raw"/flat files if you can, instead of querying the DB for the different tables and go from there. That way you can have all of your relevant variables and observations too — Marquis de Carabas, Mar 27 '16 at 01:07
What you are talking about is called de-normalization in the database theory. The missing part in your workflow description is the relational schema for your tables. When you extract the data into one flat (denormalized) table you use the relations implicitly in your join logic. There are tools that can use this information in a similar way and run statistical calculations on what appears to be normal data form almost like on the database. Qlikview is one example. — Diego, Mar 30 '16 at 02:38

score 4 · Accepted Answer · answered Mar 29 '16 at 17:35

My understanding of your question is that you are interested in methods to uncover multidimensional relationships in data yet are reluctant to take low-dimensional slices of the data for analysis. This is, in a sense, the basis of many machine learning algorithms that use data in high dimensions to make predictions or classifications with often very complex rules that are learned directly from the data.

There are classes of relational methods which perhaps fit more neatly into what you are thinking of, however. For example, the infinite relational model is a Bayesian nonparametric framework for identifying hidden structure across many dimensions in a way that appears to conceptually match what you want. For a sample problem that this might be used for, consider a relational database which contains 3 tables with 3 different primary keys and containing information on a set of cases $S$, a set of patients $P$ and a set of doctors $D$ that performed procedures during these cases. I offer this as a low-dimensional example but all of this can be scaled up to include more data.

Then, suppose that you have an indicator variable denoting whether or not the patient had a good outcome. As shown in the paper I linked, you could simultaneously find partitionings of each of $S$, $D$ and $P$ such that each partition cell contained similar outcomes. This learning is done by performing optimization of the likelihood of the data under a Bayesian model. This might inform you as to which doctors are good or bad, or whether certain patients are particularly troublesome for a given procedure. Again, this framework is flexible and affords a range of generative models for the underlying process.

This may be more complex than what you desired - it's a bit of a jump from Excel or SPSS to writing custom inference code in another programming language. Still, it's how I would approach this problem.

score 1 · Answer 2 · answered Dec 06 '16 at 21:52

So, my question is, simply: is it possible to directly perform complex (eg multivariable) analyses of relational databases? And if yes, how?

I have a problem with this question. First, the role of a database is mainly efficient storage and retrieval for ETL (extract, transform, load), not statistical analysis. To ask what statistical method can be applied on the format of a database (or format of data) ignores the statistical question (see @Glen_b's comments).

If you received an Access file, or relational database extract, it's format was probably put in place by a software engineer who cared mostly about the efficiency storage and retrieval, not the analysis types of analysis. Now, depending on your statistical question/hypothesis this data will need to be transformed in one way or another. It should be noted that statistical models weren't constructed to be applied to data formats (e.g. text files, hdfs, relational databases, graph databases); that is not the concern of most statistical models.

My fear is however that it might overlook one or more of the relational features of the data, and thus loosing precision or accuracy. Could it be done in a different fashion, directly analyzing the relational database as it stands?

This is vague. What is meant by "relational features"? Are you referring to relational tables?

Let's assume you have a demographics table with patient id and other stuff. Now, let's assume you have your repeated measures for your patients by patient id. Then if you know enough SQL, these "relation features" can be combined using a simple JOIN (left, right, inner, ).

This is how you would combine your relational tables (which I assume is what you mean by "relational features"). Often, this can be challenging especially if you have a database with 100+ tables/views. But even then, you shouldn't need all those tables; if you have a well defined question, you should be able to narrow the problem down to 2-5 tables.

Now after merging your data, it may occur that the nature of your data is clustered by, say, patient id. For a scenario like this, a hierarchical model or mixed effects would be appropriate. This approach will then capture more of that context that certain observations occur for specific individuals. Reference: http://www.ats.ucla.edu/stat/mult_pkg/glmm.htm

In summary, a physical format shouldn't sway the type of analysis you conduct. As an analyst, it is your responsibility to account that the format of the data is appropriate for the analysis. For this reason there are tools for transforming your data so that it can capture the necessary context for your statistical analysis. Reference: http://vita.had.co.nz/papers/tidy-data.pdf

Statistical analysis of relational database: is it possible and how?

2 Answers2