How to design a partitioning strategy for a VLDB OLTP solution (SQL Server 2016, >100TB init. size)

Question

I am currently working to design a MSSQL 2016 based platform to handle a dataset (OLTP based) that will grow above the PetaByte level. It will be used for specific types of analysis that will require trends to be discovered using various methodes and tool (incl. R). There will be various sources feeding the database(s) on a 'live' basis as well as batches of data that will be ingested on a batch basis. Due to the high volumes of transactions, that number of concurrent users projected (>250) and the way data will be consumed by the users (later more), we need this solution to be high performant and scalable. It is obvious the data needs to be partitioned on a few levels to support the data consumers.

The users will be running trend analysis type workloads on daily, weekly, monthly and multi year ranges. Most data will be supplied with date fields, but customer name, account numbers and transaction types, are also in scope for doing trend analysis.

My question to you all is as follows, what would your strategy be for designing a proper partitioning solution? What questions would you ask and what would you look for in the answers? How would you handle maintenance on indexes and such.... What would you factor into the design?

Oowww and dropping everything into a datalake (read: swamp) or going for a different platform is not an option. Also, I am not at liberty to discuss the particulars of the project or the data involved so please don't ask. Just know it is highly confidential financial and personal data and we will be doing forensic analysis (using R, PowerBI and/or other BI tooling) in compliance with lawful requirements that have been imposed on us. I will not share any other details beyond this, sorry.

I would like to know retention of online (on production) data you will keep, growth/day and number of columns in the table before I suggest you something. — Rajesh Ranjan, Jan 23 '17 at 08:23
We are currently in a process of collecting those details. But just to give everyone an indication of the scale of things a few 'highlights'.
Will the data be updated and deleted frequently? Or is your data quite stale? — Stijn Wynants, Jan 23 '17 at 10:05
Most data will not be modified after it is loaded. A good portion of the data however will be stored with a build up of history (active/in-active dates) as it gets reloaded/updated (think client information, address changes and such). If I had to put a ratio on it, 30% of data will be susceptible for changes but with a low rate of data churn. — MvdMunnik, Jan 23 '17 at 10:18
You have your work cut out for you! Sounds like you need an OLAP to do the trend analysis though, although storing the data twice will be MASSIVE, having any sort of performance on queries with this much data on an OLTP solution is unlikely. Is there a possibility to split the workload? Do you have time/resources to create an OLAP solution as well? — blobbles, Jan 23 '17 at 10:32
I would look into operational analytics(columnstore) have had some great results on a multi-tb database with that, and look for a good partition key. I would say on a date range. Your maintenance will be done by partition. This will not be used for performance enhancement. So actually the feed of your data is an OLTP system, but the destination is more a OLAP? — Stijn Wynants, Jan 23 '17 at 10:36
@blobbles No, loads cannot be split and yes, it is sort of a 'hybrid' platform where typical OLAP type workloads are combined with OLTP type loads and data flows. Combine that with significant amounts of users that are running concurrent workloads, and you can see why I looking around for people to speak and share ideas with. — MvdMunnik, Jan 23 '17 at 10:59
350GB per month * 7 years << petabyte. Where does the rest of the volume come from? — Michael Green, Jan 23 '17 at 11:02
@StijnWynants ColumnStore is something I intend on using but not for everything as that is not feasible. So partitioning is key, and hence my question. Any thoughts on how to select design a platform to maximize performance using partitioning? My current thoughts are leaning towards the following:
- 70 daily partitions
- 42 weekly partitions
- 12 monthly partitions
- 5+1 yearly partitions — MvdMunnik, Jan 23 '17 at 11:06
@MichaelGreen We will be secure storing source files with the database and the 350GB/m is just for a few of the sources in phase 1 of the project. My current expectation for the entire platform is well above the TB/Month but it is growing. Next to this we are planning an initial load onto the solution of a >350TB (depending on the amount of history we want to load this may increase significantly). — MvdMunnik, Jan 23 '17 at 11:14

Rajesh Ranjan · Accepted Answer · 2019-07-12T03:40:34.443

I would suggest you to go through the article describes some important prerequisites and suggestion for OLTP databases.

http://nerdtechies.com/2016/12/05/improve-write-performance-sql-server-database/

For the loading process use BULK INSERT and for normal insert user WITH(ROWLOCK).

https://technet.microsoft.com/en-us/library/dd425070.aspx

Partitioning

What you need to know.

What would be retention of online data?
What is growth/day?
On what basis reports users pull maximum (daily/weekly/Monthly or Yearly)?
Archiving Policy.

--I have experience on 2 TB table with 50GB growth/day, one month data on production rest on WH. So suggesting accordingly.

If 70-80 % Usage of daily basis analytical reports. I would suggest to go for daily basis partition as there would be huge amount of data. It will perform faster but to generate weekly, monthly and yearly report you'll have lengthy query.

If there is 50-50 ratio between daily,weekly and monthly analytical then go for monthly partitioning. In this case Daily and Weekly basis reports will perform slower than day basis partitioning, because there would be lot of records to filter from the month. But you'll have quite simple query.

Partitioning by considering retention of online data make Archiving Policy easier.

Index

As table will be partitioned you should create partitioned indexes on the table.To create partitioned index you need to include partition base column in the indexes. Until you don't create partitioned indexes on the partitioned table you wouldn't get performance benefits.

Creating Indexes on separate filegroups will result in good performance for reports. So create separate Partition scheme,function, on separate filegroupes for indexes same as created for table.

Better to go for Column Stored Index on (Base_Partition_Column,customer name, account numbers, transaction types, Financial Column) on Index_Partition_Scheme.

Create indexes with FILLFACTOR=80

Creating partitioned indexes make index maintenance easier. Instead of rebuilding or reorganizing complete index you can perform maintenance task for particular partition of the index which minimize maintenance duration for big tables.

To do it you can track index fragmentation and row count of the partitions. It will help you to find out indexes on which partition should be rebuild.

Maintenance schedule depends on data size, how much off duration you have to perform the maintenance activities and how long SQL Server takes to finish the task. It would be better to test you maintenance plan on test environment first with same amount of data then move for production if it finish within the off hour you have.

Thanks

Thanks for your reaction @Rajesh-Ranjan, I was hoping I would get more new insights, but you mentioned pretty much the same things I was considering and accounting for in the plans. Good to know, I am apparently not missing anything big (so far, lol). Oowww and I actually know and speak with a few of the people that wrote that MS article you mentioned. Small world! :-) — MvdMunnik, Jan 24 '17 at 13:24

How to design a partitioning strategy for a VLDB OLTP solution (SQL Server 2016, >100TB init. size)

1 Answers1