In my previous post in my ongoing series on SQL Server database corruption I covered a list of best practices for responding to database corruption when it happens. In that post I mentioned that I’d provide a follow-up post on the specifics of how to execute page-level restore operations from within SQL Server. Page Level Restore Operations As documented in Books Online, SQL Server’s RESTORE DATABASE command has an optional parameter or switch that instructs it to only restore one or more pages instead of an entire database – which is very handy and helpful in dealing with situations in which only a hand-full of pages have been corrupted by some sort of problem with the IO subsystem. Accordingly, the basic order of operations for this kind of operation is as follows Kick the database into single user mode. You can set the database to RESTRICTED_USER, but if you have end-users or applications connecting using logins that map to the db_owner role, they’ll still be able to connect to the database. So use SINGLE_USER mode instead and do NOT close the query window you use to set to SINGLE_USER. To set the database to SINGLE_USER mode, execute the following: ALTER DATABASE
SET SINGLE_USER WITH ROLLBACK AFTER 10 SECONDS GO And note that the WITH ROLLBACK AFTER n SECONDS option can/will kill connections and on-going operations. It’s documented in here (towards the end of the page). Likewise note that to pull this recovery operation off you’re, OBVIOUSLY, kicking everyone out of the database. Then, START your recovery process by backing up the tail-end of the log file – or the portion of your log file that hasn’t already been backed up. (This is CRITICAL as it ensures that all operations up to a certain point have been accounted for. And you’ll want to do this AFTER switching the database to SINGLE_USER mode – otherwise you risk allowing a few operations to ‘sneak’ by – which means that end-users or ...Read the rest of entry >>
In both of the sessions that I covered in SQL Server Pro’s eLearning Event entitled Practical SQL Server Improvements for Businesses, I made mention of additional context, guidance, and links that could be used to better get a handle on SQL Server 2012 AlwaysOn functionality as well as Contained Databases and Indirect Checkpoints. SQL Server 2012 AlwaysOn One of the key things I tried to iterate in both of my sessions is that while it’s a bit silly for Microsoft to have called improvements/successors to both Clustering (AlwaysOn Failover Cluster Instances) and Mirroring (AlwaysOn Availability Groups) by the same AlwaysOn ‘moniker’, the reality IS that BOTH successors do, indeed, provide some amazing benefits in the sense that they BOTH offer not only HA benefits, but DR benefits as well. Moreover, what makes AlwaysOn Availability Groups so powerful is the fact that DBAs can now use this single technology (and/or this single technology in conjunction with AlwaysOn Failover Cluster Instances) to achieve solutions that enable High Availability and which provide Disaster Recovery coverage as well. All from the same set of tooling. But, as cool as both AlwaysOn Failover Cluster Instances (FCI) and Availability Groups are, they represent a number of changes and new potential pitfalls and gotchas that you really need to be comfortable with before just ‘willy nilly’ rolling these solutions out into your environment. Accordingly, the goal of both of my sessions was to call out the BENEFITs of these new technologies and provide a quick overview of some of the LOGICAL shortcomings that you’ll encounter when using them. (And to me, the BIGGEST problem with AlwaysOn Availability Groups isn’t a technical issue (though there are plenty of those). Instead, it’s the fact that AlwaysOn Availability Groups are ONLY supported in SQL Server 2012 Enterprise Edition.) Still, having the ability to create synchronous-commit availability replicas that can be used for automatic failover...
Read the rest of entry >>
In my recent SQL Server Pro Webcast (sponsored by Red Gate), I blurred through a number of details regarding how to Avoid 5 Common SQL Server Backup Mistakes. The event is/was free, and will be online for on-demand viewing for roughly 3 months after the date given. (Though it typically takes a day or so for it to become available for on-demand viewing after the live presentation date.) Accordingly, I wanted to provide some additional, follow-up, resources to provide some additional context on many of the things that I addressed. SQL Server Backup APIs Early on in my presentation, I mentioned that it’s essential that third party backup solutions (or, more specifically: ‘plugins’ for SQL Server that are provided as ‘adapters’ for backup solutions that are typical NON SQL Server focused – but which want to provide some additional features and options) directly leverage the APIs provided by Microsoft for handling SQL Server backups. And again, this ISN’T an issue with all of the big/major 3rd party SQL Server backup solutions out there. Instead, it’s a concern with OLDER non-SQL Server backup solutions that also offer ‘plugins’ that ‘fake’ SQL Server backups without using the VDI and VSS APIs that are outlined in a bit more detail here. And the reason I keep stressing the ‘plugin’ angle here, is that the use of 3rd party SQL Server backups is typically a BEST PRACTICE in most environments. RPOs and RTOs It’s impossible to stress the importance of Recovery Point Objectives and Recovery Time Objectives. If you don’t have them, you haven’t communicated to management what they should expect in the case of a disaster. Consequently, even if you pull off the most spectacular and herculean recovery known to man following a disaster, you’ll still end up having to deal with cranky end-users AND management if you haven’t communicated RPOs and RTOs. In fact, RPOs and RTOs are such a big deal that I decided to make them the focus of my very first blog post on my P...
Read the rest of entry >>
In a previous post we looked at the importance of backups when it comes to dealing with logical (human or application caused) or physical (IO subsystem caused) corruption. In this post we’ll look at some best practices for responding to physical corruption when it happens or is detected. Responding to Corruption When it Happens When corruption is encountered, there’s a definite order of operations to follow as a means of both decreasing the potential for data loss and as a means for decreasing down-time while remedying the situation. That said, all forms of corruption are NOT created equally and are NOT the same in potential scope or impact. For examples, sometimes corruption occurs within typical data pages that store the actual data needed by your database to keep it running and viably storing data. Or, at other times, corruption might simply occur within an index or other simple structure that is effectively an ‘ancillary’ feature of your database – meaning that you MIGHT be able to recover from this form of corruption by merely dropping the index and recreating it (with effectively no down time and only a brief ‘disruption’ in the form of the locks/blocks necessary to replace the index). Yet, at other times, corruption COULD happen within key (allocation) pages used by the database to MAP where your data, indexes, or other KEY structures are found – and in cases where corruption occurs in these highly specialized pages (such as page 9 – the boot page), this corruption can be FATAL – meaning that there is NO option for recovery other than to revert to backups. As such, the following, ordered, list outlines key considerations to address as a means of ‘escalating’ through available options in order to try and minimize the scope and down-time associated with recovering from corruption. No, REALLY: Don’t Reboot Your Server. While some weird Windows issues will go away by rebooting the box, the presence of corruption will NOT go away if you reboot your ser...
Read the rest of entry >>
In Part VII of this series on SQL Server Database corruption we touched upon how backups can be used as an additional means of early detection for corruption – by making sure to validate the checksums of all data being backed up. The Importance of Backups In Recovering from Corruption But, did you know that backups are your BEST line of defense in recovering from MOST forms of database corruption? (If you’ve been following along in previous posts, then you already knew this.) Granted, there is one form of corruption that I’ve encountered that CAN safely be corrected without the use of backups (or, at least, I’ve seen cases where I was able to correct it without the absolute NEED for backups – even though I had them standing by as a precaution and even though I FIRST tested these corrections on copies of my databases made from backups). But this specific type of corruption really isn’t related to the kinds of IO subsystem corruption that have been the focus of this series of posts – and, instead, this type of ‘corruption’ really deals with a question of meta-data getting a bit out of sync and needing to be updated or refreshed via the use of DBCC UPDATEUSAGE(). Otherwise, for an excellent overview of what your options are for dealing with corruption WHEN you don’t have viable backups available, check out Paul Randal’s excellent Tech Ed Session where he covers various mechanisms of recovery without using backups. (And note that while I’m making the case from Paul’s video that you SHOULD have backups available – given what’s required otherwise (and given how low your potential for success can be without backups) – Paul makes similar arguments but ALSO addresses VERY advanced scenarios where UP-TIME is more important than DATA INTEGRITY). Hopefully, though, in watching that presentation you’ll come away with a much more healthy respect for how important backups can be, and for how you need to be regularly testing your backups to make sure that they’re viable fo...
Read the rest of entry >>
In previous posts we looked at how to enable regular checks for corruption and how to set up alerts for IO subsystem problems when they occur. But, as covered previously, one of the key means for properly dealing with corruption is to detect it early – so that you have more options for correcting it at your disposal that you would if it’s allowed to go unnoticed for long periods of time. Consequently, in this post we’ll take a peek at how you can use backups as an additional means of detecting corruption. Bonus Points: Leveraging Backups in the War on Corruption Simply stated: you need backups. While Mirroring and AlwaysOn do provide cool features that will allow you to recover from some forms of corruption auto-magically, the reality is that backups are still your best line of defense against certain forms of disaster. Consequently, since you need backups, there’s no reason you couldn’t, or shouldn’t, leverage them in the war on corruption – as a way to more quickly discover physical corruption after it happens. So, given that it’s already assumed that you’ve got regular FULL/DIFFERENTIAL backups plus regular transaction log backups on hand to meet your target RPOs and RTOs (and assuming that you’re regularly testing those backups to make sure that they’re viable), the fact that backups – by their very nature – are coming into direct contact with data that has been modified means that they’re a great, additional, means of detecting corruption that might have occurred or gone unnoticed during write operations. Enabling CHECKSUMs as part of your Backup Process When it comes to using backups as a means of detecting corruption, the key thing to note about backups is that they’re designed to copy pages (or blocks of data) from one file location to another – as a means of creating a duplicate copy (or backup) of SQL Server data. Accordingly, whenever a FULL backup kicks off, IF it could be directed to CHECK all pages and make sure they were valid, that would be...
Read the rest of entry >>
Previously we looked at how to set up alerts for cases where SQL Server encounters issues with the storage subsystem. Setting those up is a key component to staying alerted to when problems happen – but another great mechanism for detecting corruption early is to set up regular checks for corruption. Regular Checks for Corruption Of course, in some environments, you may find that SQL Server will write data to disk and then may not use it for long periods of time afterwards. Therefore, a best practice in addressing the possibility of corruption is to run regular checks (typically once or twice a week as possible) to aggressively verify data integrity and validate many forms of constraints present in the database itself. To do this, you’ll just want to take advantage of an insanely powerful and robust tool that ships with SQL SERVER, known or accessed via DBCC CHECKDB (and associated) commands. And while the scope of this tool and its capabilities, optimizations, and functionality are really beyond the scope of this document (simply because an entire book could easily be written about DBCC CHECKDB), documentation for this tool can be found here. (Note too that DBCC CHECKDB is also mentioned/used later in this document not just as a means of detecting corruption, but REPORTING on the exact specifics about the scope, nature, impact, and type of corruption found when it is indeed found.) Otherwise, when it comes to regularly detecting corruption, a best practice is simply to run DBCC CHECKDB to run on a regular basis – as a scheduled job. So, for example, assume that you wanted to run DBCC CHECKDB against your database(s) every Sunday and Wednesday nights (periods where performance isn’t critical and, for example, periods right before making FULL/DIFFERENTIAL backups). To institute these checks, you’d just need to create a new SQL Server Agent Job as follows: Then, after giving the new Job a name in the General tab (and assigning an owner – SysAdmin is the bes...
Read the rest of entry >>
Continuing on from our last post where we looked at setting up CHECKSUM verification as one means of helping more readily detect corruption early-on, we’ll now take a look at how to have SQL Server notify you of any instances where it runs into problems with the IO subsystem. Configuring Alerts for Storage Problems In addition to being able to quickly and easily detect corruption problems once CHECKSUM verification has been enabled, SQL Server is also robust enough to be able to watch out for other kinds of disk errors. In this regard, SQL Server can be set up to pro-actively send alerts when it encounters simple errors such as read-write failures, or even when it proactively attempts to issue multiple re-read attempts against ‘sticky’ or problematic data (that is typically a clear sign of impending problems at the IO subsystem level). Therefore, to take advantage of SQL Server’s innate ability to track these kinds of errors, all you need to do is: Ensure that you’ve got SQL Server’s Database Mail functionality set up and properly configured. (Setting this up is outside the scope of this document, but is easy enough to do by right-clicking on the node and walking through a wizard where you establish connectivity and settings that enable SQL Server to ‘talk’ to an SMTP server within your organization). Create at least one SQL Server Agent Operator – as per the screenshot below: And, again, the creation of operators is outside the scope of this document – but is easy enough to do by right-clicking the Operators node and walking through the process of creating at least a ‘catch all’ operator such as ‘Admins’ or ‘General’ that can be mapped to something like alerts@yourcompany.com or some other monitored email address where alerts can be sent. Configure SQL Server to forward instances of common IO errors ‘up’ to a SQL Server Agent Operator so that you can be notified of errors AS they occur. To implement this ...
Read the rest of entry >>
In previous posts within this series, we’ve already looked at how you can’t really prevent corruption. Instead, in order to be able to better deal with corruption, you need to be able to detect it early. To that end, there are actually a number of different ways to enable early (or early-ish) detection of corruption when it happens. And, in this post, we’ll look at the use of CHECKSUM Page Verification as the first of these methods. Enabling CHECKSUM for Page Verification When configured correctly, SQL Server can instantly spot when data that it is pulling off of disks was saved improperly – or corrupted. To enable this functionality, you just need to ensure that desired databases are configured to use the CHECKSUM option for page verification – which you can set on SQL Server 2005 and above. To do this, simply query your server as follows, and review the names of all databases returned – as they are not configured for CHECKSUM verification:
.sqlcode {
font-size: 13px;
font-weight: bold;
font-family: monospace;;
white-space: pre;
-o-tab-size: 4;
-moz-tab-size: 4;
-webkit-tab-size: 4;
}
.sqlcomment {
color: #00aa00;
}
.sqlstring {
color: #aa0000;
}
.sqlfunction {
color: #aa00aa;
}
.sqlkeyword {
color: #0000aa;
}
.sqloperator {
color: #777777;
} USE master
GO
SELECT NAME
,page_verify_option_desc
FROM master.sys.databases
WHERE page_verify_option_desc != 'CHECKSUM'
GO
Then, to enable (or toggle) these database to use CHECKSUM verification, execute the following statement per database that you wish to ‘convert’ to using CHECKSUM verification:
.sqlcode {
font-size: 13px;
font-weight: bold;
font-family: monospace;;
white-space: pre;
-o-tab-size: 4;
-moz-tab-size: 4;
-webkit-tab-size: 4;
}
.sqlcomment {
color: #00aa00;
}
.sqlstring {
color: #aa0000;
}
.sqlfunction {
color: #aa00aa;
}
.sqlkeyword {
color: #0000aa;
}
.sqloperator {
color: #777777;
}
...
Read the rest of entry >>
In previous posts we touched upon an overview of what causes corruption and looked at ways to simulate corruption. In this post we’ll talk about more about corruption – in terms of addressing how you can prevent it. (Though, if you’ve been paying attention in previous posts, you’ll probably guess what’s coming up). How to Prevent SQL Server Database Corruption In answer to the question: “How can I prevent corruption?” the simple, and honest, answer is: “You really can’t”. As covered in previous posts, corruption is ‘more or less’ a random problem that occurs at the disk subsystem level and really can’t be prevented. (Or as I like to tell all of my consulting clients: “There’s a reason that EVERY major Operating System out there ships with something like CHKDSK.”) So, while disk/storage errors should be rare within any production system, the fact is that these kinds of errors aren’t so much a question of IF they’ll happen, but WHEN they’ll happen – simply because the reality is that magnetic storage is insanely complex and subject to the occasional ‘hiccup’ that can result in corruption. However, given that corruption SHOULD be a relatively rare problem that you will encounter in your environment, it IS safe to say that if you keep bumping into seemingly-regular or semi-regular problems with corruption, it’s fair to say that you may need to start becoming concerned about whether there’s a problem with your hardware or possibly even drivers and so on. (The problem, however, is that troubleshooting these kinds of problems is insanely hard – so if you do start to recognize regular problems with corruption it may be time to start thinking about migrating critical systems to new hardware.) However, even though you can’t prevent corruption, that doesn’t mean that you are powerless against its effects. In fact, not only does SQL Server readily accept or anticipate that disk subsystem errors can and WILL occur, but it also provides a number of great tools and function...
Read the rest of entry >>