I have been struggling with flat file databases and corresponding statistical packages for almost 20 years now (from Excel to SPSS, then Stata, and currently R).
However, I have always had to convert complex and multidimensional relational databases (eg in Access or MySQL) to often overly simplified flat sheet databases, which is at best time consuming (but often means reducing the amount of information available for each analysis).
Indeed, the approach I have always followed is the typical one of converting a relational database through specific queries into one or more flat file databases. While this is simple enough for most analyses, especially univariate and bivariate, it may become more confusing for multivariable and multivariate analyses, as it requires taking multiple and complex queries, and most importantly often oversimplifying the data themselves.
Now that I try to get more acquainted with big data and data science, I wonder whether the shift to big data will require also a shift to data analysis encompassing multiple tables and relations, without diluting the efficiency and power of a relational database when it is converted into multiple flat file databases.
So, my question is, simply: is it possible to directly perform complex (eg multivariable) analyses of relational databases? And if yes, how?
This is not a philosophical question (only). For instance, I am now working on a relatively large (reaching 2000 patients) observational study on transcatheter aortic valve implantation for severe aortic stenosis (RISPEVA). It is based on a MySQL electronic case report form which corresponds to 12 separate tables with complex relations and often multiple entries per each patient. My approach so far to try to identify predictors of long-term death (eg if looking for a score) has been, as usual, to create multiple tables through queries, and then distill the key features capable of predicting death. This means going through multiple stages of analysis and, at best, it is time consuming.
My fear is however that it might overlook one or more of the relational features of the data, and thus loosing precision or accuracy. Could it be done in a different fashion, directly analyzing the relational database as it stands?