2

Which is easier to maintain as Disaster Recovery on a remote site for a 2-node SQL Server Cluster

  1. Log Shipping
  2. Single Instance of Always on Availability Group Instance.

Any suggestion will be highly appreciated.

SQL_NoExpert
  • 1,005
  • 10
  • 29
  • You should clarify on the AG configuration you are speaking about. An SQL Server FCI cluster with AG replication or an WSFC hosting SQL Server AG with 2 replicas? –  Jun 15 '18 at 00:18

4 Answers4

2

My experience is that an AlwaysOn high availability group, HAG, is easier to maintain than log shipping. Lets set some variables for our scenario.

1.) Your version and edition of SQL Server is SQL Server 2016 Enterprise Edition. We are not on the latest SQL Server patch level. We need to apply a newly released service pack.

2.) The SQL Server database environment exists at a typical small to medium sized company (The active passive AlwaysOn build that you described in your question leads me to this conclusion). This company has a nightly or weekly maintenance window but like most companies the service level agreement is either four or five nines of up time (The database must be available either 99.99% or 99.999% of the time).

Lets look at how each feature allows you, the DBA, to keep the I.T. manager happy by maintaining those up time numbers.

An AlwaysOn HAG gives you the flexibility of being able to fail over back and forth between your AlwaysOn replicas. My assumption is that the primary and secondary replicas are not in close physical proximity to each other. Perhaps they are in different data centers or different racks within a data center. Regardless, If I need to perform maintenance on the primary I can set the availability mode from asynchronous to synchronous, fail over, perform my maintenance, fail back over, and set the availability mode back to asynchronous (https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/availability-modes-always-on-availability-groups). With a properly configured AlwaysOn HAG listener the end user will experience little to no interruption and you'll get to easily maintain that SLA.

Now lets take a look at Log Shipping. Without doing a cut and paste from the Microsoft books online there are five steps that have to be manually performed before you can failover. They can be found here... https://docs.microsoft.com/en-us/sql/database-engine/log-shipping/fail-over-to-a-log-shipping-secondary-sql-server .

Considering that most incidents and or maintenance windows occur after hours would you rather have an AlwaysOn high availability group that you can easily failover back and forth between replicas or try and get the primary and secondary databases into sync by applying transaction log backups?

h24601
  • 130
  • 2
  • 10
  • Appreciate your answer. In our current environment we already have a 2 node active passive cluster in place. We want to add a dr server that will be located on a remote location. Our options are either log shipped from the cluster or set up a always on cluster with the existing 2 mode cluster that will act as a Dr site. I understand both options dont support automatic failover from cluster but failing over to always on from sql failover cluster is much easier then failing over to a secondary log shipped instance . – SQL_NoExpert Feb 28 '18 at 19:50
  • @user9394033 Thanks for the additional information. I would recommend AlwaysOn asynchronous mode for the DR. Is the DR database server joined to the same domain as the other two database servers? – h24601 Feb 28 '18 at 19:56
  • All will be under same domain. My only concern is what if my cluster totally breaks and i manually fail over to my always on replica, will my application be automatically connected like a typical cluster instance or i will need to change the connection stings manually to point towards the always on replica – SQL_NoExpert Feb 28 '18 at 19:59
  • @user9394033 The HAG Listener will redirect the database connection to the primary replica. By cluster do you mean the windows failover cluster or both of the cluster nodes, DB1 and DB2, fail? If DB1 and DB2 nodes go down and you failover to the DR node and your application is using the listener host name in its database connection string then the connection will go to the DR site. If the WSFC goes up in smoke then you'll need to change the application's database connection string to specifically point at the DR database instance. – h24601 Feb 28 '18 at 20:10
  • Well noted so I will need a listener, and then if the whole wfcs breaks then the connection will need to be manually configured. – SQL_NoExpert Feb 28 '18 at 20:13
  • If your database client supports the multi subnet Failover feature you could transparently failover. Otherwise you could also manually change the DNS name (and restart application).however keep in mind that if you connect your application. Across long distance to the dB instance in a dr site latency might kill you. And a automatically applied change will replicate human errors automatically to your dr site (I.e. you don’t have a delayed apply) – eckes Mar 01 '18 at 00:57
  • One more question for example if the cluster nodes are down then there still be a manual intervention as the connection string should need to be pointed towards the listener or the dr node since there is no automatic failover, – SQL_NoExpert Mar 02 '18 at 18:01
1

Log shipping is by far easier to maintain than an Availability Group for the scenario you described.

The ability to fail over back and forth quickly using an AG is great for high availability, but often disaster recovery RTO (recovery time objective) is much longer because hey, it's a disaster. While the high availability goal might be to have it all back up in 60 seconds, the disaster recovery goal may be 60 minutes. Log shipping should be able to handle that easily.

Adding a remote, third AG node in asynchronous commit mode may itself be an easy task, but maintaining that can get tricky. Things I've seen go wrong with my own AGs include:

With Availability Groups, there are a lot more things that can go wrong and a lot more places (DMVs, error logs) to look for troubleshooting information: https://blogs.msdn.microsoft.com/sql_server_team/troubleshooting-high-hadr_sync_commit-wait-type-with-always-on-availability-groups/

Also, here are some issues related with just patching Availability Groups (granted this list is a little older now): https://www.brentozar.com/archive/2015/02/patching-sql-server-availability-groups/

Oh, and there's the added complexity/setup/maintenance of Windows Server Failover Clusters with AGs, too. (Note: SQL Server 2017+ doesn't always require a WSFC.)

Consider not only the technology part of maintenance, but also the staffing. For full production coverage, you ought to have at least two DBAs with AG experience. That kind of experience is hard to find and commands a higher salary. Compare that to log shipping, where most DBA candidates will have that experience and there's no salary premium that comes with it.

In short, practically everything about Availability Groups is higher maintenance except the process of failing over to the DR node. If you have a tight RTO for your DR node, then you may have to go with async mirroring or even an AG. Likewise, if you have an experienced DBA team that can handle AGs going wrong, then maybe go that route. Otherwise, log shipping will be the easier technology to maintain.

Doug Lane
  • 608
  • 6
  • 14
0

Log shipping is by far the easier to maintain and troubleshoot technology for DR. I've used both extensively and would much prefer log shipping.

Why?

Because with a little scripting you can easily automate failover procedures. Both a nice failover where you take a tail-log backup and copy it over (using the agent jobs already in existence) restore it (again with jobs that already exist) and restore with recovery.

There is little that can go wrong with log shipping. And when it does it is easy to fix.

Sure there is no fancy GUI to monitor log shipping, but it isn't hard and there are agent jobs that do indeed check your jobs. Out of the box.

AOAG can go down and be down and you won't know about it until after you need it UNLESS you know how to setup the alerting. I know this because I inherited servers that had AOAG that had stopped syncing.. months before I even knew these servers existed.

Log Shipping sets up alert jobs, you make sure your email alerts are working and bam, it bothers you endlessly until you fix it.

I've used Log Shipping, Mirroring, AOAG all for DR.. I much prefer Log Shipping.

tinlyx
  • 2,735
  • 4
  • 29
  • 53
TedS
  • 1
  • You can script AlwaysOn as well. I helped script an automated restore process that restores databases and adds them to AlwaysOn. – Alen Jun 14 '18 at 17:56
  • Note that availability groups will beat log shipping on RPO in cases where the primary is unrecoverable (i.e. the D in DR). By how much depends on how frequently you have the log shipping jobs running. – db2 Jun 14 '18 at 18:10
  • How long does the backup and restore of the tail-log backup take? Most organizations want a fail-over done in a few seconds. – Alen Jun 14 '18 at 18:13
  • Alen, the question was for DR. In that case you don't need instant failover. it is, after all, a disaster. This isn't for HA. – TedS Jun 14 '18 at 19:32
0

AlwaysOn. If the application is configured properly for the listener, you can have a fail-over every hour and no one is going to care and no complaints from users in 99% of cases. I've also seen legacy apps that don't support it work almost flawlessly during a fail-over. And you can separate your databases into several groups each with it's own listener and fail them over independently of each other. Or if you have virtually no overhead on your VM environment, you can put a problem database into it's own group so a fail-over won't affect other databases.

AlwaysOn also supports Read-Only. If your developers code a read only connection into their apps you can point applications at your secondary for selects. Handy feature if your primary is on an overloaded hyper-visor with other applications.

I haven't used mirroring in a few months, but from what I remember if you failed over a single database then you needed to make manual changes to DNS.

The one thing I miss about mirroring is the GUI they had that told you how far it was behind if you had some issue. There were a few times we had replication issues I couldn't figure out. And then I looked at the mirroring admin tool and saw that some database was 50GB or whatever behind in sending log data. And it kept a history. Pretty good if you're working in an organization with a tight budget and no monitoring software you can use the mirroring history as part of your troubleshooting or answer questions to management about issues.

Alen
  • 519
  • 3
  • 9