3

I am currently working to design a MSSQL 2016 based platform to handle a dataset (OLTP based) that will grow above the PetaByte level. It will be used for specific types of analysis that will require trends to be discovered using various methodes and tool (incl. R). There will be various sources feeding the database(s) on a 'live' basis as well as batches of data that will be ingested on a batch basis. Due to the high volumes of transactions, that number of concurrent users projected (>250) and the way data will be consumed by the users (later more), we need this solution to be high performant and scalable. It is obvious the data needs to be partitioned on a few levels to support the data consumers.

The users will be running trend analysis type workloads on daily, weekly, monthly and multi year ranges. Most data will be supplied with date fields, but customer name, account numbers and transaction types, are also in scope for doing trend analysis.

My question to you all is as follows, what would your strategy be for designing a proper partitioning solution? What questions would you ask and what would you look for in the answers? How would you handle maintenance on indexes and such.... What would you factor into the design?

Oowww and dropping everything into a datalake (read: swamp) or going for a different platform is not an option. Also, I am not at liberty to discuss the particulars of the project or the data involved so please don't ask. Just know it is highly confidential financial and personal data and we will be doing forensic analysis (using R, PowerBI and/or other BI tooling) in compliance with lawful requirements that have been imposed on us. I will not share any other details beyond this, sorry.

Ian_H
  • 1,521
  • 9
  • 16
MvdMunnik
  • 138
  • 9
  • I would like to know retention of online (on production) data you will keep, growth/day and number of columns in the table before I suggest you something. – Rajesh Ranjan Jan 23 '17 at 08:23
  • We are currently in a process of collecting those details. But just to give everyone an indication of the scale of things a few 'highlights'.
  • - 150+ batched ETL flows
  • - 20+ live streams of data
  • - ingress volume of data in excess of 350GB per month
  • - Retention will be 7 years for most data
  • – MvdMunnik Jan 23 '17 at 08:33
  • This would be warehouse(OLAP) or production (OLTP)?? – Rajesh Ranjan Jan 23 '17 at 08:36
  • OLTP based system (yeah, I know.... ) – MvdMunnik Jan 23 '17 at 08:41
  • Will the data be updated and deleted frequently? Or is your data quite stale? – Stijn Wynants Jan 23 '17 at 10:05
  • Most data will not be modified after it is loaded. A good portion of the data however will be stored with a build up of history (active/in-active dates) as it gets reloaded/updated (think client information, address changes and such). If I had to put a ratio on it, 30% of data will be susceptible for changes but with a low rate of data churn. – MvdMunnik Jan 23 '17 at 10:18
  • You have your work cut out for you! Sounds like you need an OLAP to do the trend analysis though, although storing the data twice will be MASSIVE, having any sort of performance on queries with this much data on an OLTP solution is unlikely. Is there a possibility to split the workload? Do you have time/resources to create an OLAP solution as well? – blobbles Jan 23 '17 at 10:32
  • I would look into operational analytics(columnstore) have had some great results on a multi-tb database with that, and look for a good partition key. I would say on a date range. Your maintenance will be done by partition. This will not be used for performance enhancement. So actually the feed of your data is an OLTP system, but the destination is more a OLAP? – Stijn Wynants Jan 23 '17 at 10:36
  • @blobbles No, loads cannot be split and yes, it is sort of a 'hybrid' platform where typical OLAP type workloads are combined with OLTP type loads and data flows. Combine that with significant amounts of users that are running concurrent workloads, and you can see why I looking around for people to speak and share ideas with. – MvdMunnik Jan 23 '17 at 10:59
  • 1
    350GB per month * 7 years << petabyte. Where does the rest of the volume come from? – Michael Green Jan 23 '17 at 11:02
  • @StijnWynants ColumnStore is something I intend on using but not for everything as that is not feasible. So partitioning is key, and hence my question. Any thoughts on how to select design a platform to maximize performance using partitioning? My current thoughts are leaning towards the following:
      - 70 daily partitions
        - 42 weekly partitions
          - 12 monthly partitions
            - 5+1 yearly partitions
    – MvdMunnik Jan 23 '17 at 11:06
  • @MichaelGreen We will be secure storing source files with the database and the 350GB/m is just for a few of the sources in phase 1 of the project. My current expectation for the entire platform is well above the TB/Month but it is growing. Next to this we are planning an initial load onto the solution of a >350TB (depending on the amount of history we want to load this may increase significantly). – MvdMunnik Jan 23 '17 at 11:14