Database corruption – DBCC checkDB for Very large database
We know SQL server data is stored in a filesystem storage. There has been always an (I/O) input and output interaction between SQL server and storage subsystem both in the memory and disk. IO subsystem plays a major role, 99% of the time database corruption can happen with IO subsystem (Such as in the controllers, disk and driver level etc.)
In this post, I am sharing few things.
1. How important is the CHECKDB. 2. How to fine tune and use the checkDB for VLDBs.
3. Methods of troubleshooting the corruption issues.
Storage / VM Admin: Sent a graph states that, IOPS for the server is very peak weekly once between this day to day.
DBA Admin: Yes, we are running the checkDB job for VLDB weekly once between the days. It might cause, since it reads every allocated page in the database will take a lot of IOPS.
Storage / VM Admin: It is a huge spike to the VM machines, can you disable for next week, if it reduces the IOPS and you can run it monthly once.
DBA Admin: No, this is very important for the data consistency and integrity check.
Changed the checkDB to monthly once. All are going good, but there was a day, when the database reported a corruption.
Now, what: Restore the latest full backup with different database name, run the checkDB, surprise, that is also got corrupted. The corruption is severe either restore from backup or run repair allow data loss. The application will not work for repair allow data loss. We used another method of application that is a different story.
The point is checkDB is very important, run at least before taking a full backup, it gives us a minimum level of production from the corruption.
Best options for very large database – VLDB. The database is 10 TB+ and the checkDB is running more than two days, how to reduce the run time. I had this for one of my database. I used different approach and got some good point from Paul Randal.
Me: My 10TB database checkDB runs more than two days, it took 8+ hours if I excluded the non-clustered index. Hope, I can go with it, since I can recreate the NCI, if it gets corrupted.
A response from Paul Randal: Sure – you can do that, but you won’t know when your indexes are corrupt until queries start failing or getting wrong results. I don’t recommend it. Backup, copy, restore, checkdb is the way to do it, or split the checks up using DBCC CHECKTABLE.
1. Initially, I skipped the non-clustered index, it can be drop and create in case if it gets corrupted. – This will definitely reduce the run time. My case: From 2 days to 8 Hours.
2. Use Ola Hallengren checkDB script. It has more parameters you can use those.
If you have a 10 TB database with a table 500 GB, if it is not critical you can skip that, since it’s a very old dump data and can import from the original source file. (It is different case)
It has two methods. 1. File and filegroup checkDB 2. Default, one MDF filegroup checkDB.
Method 1: File and file group, you can run “DBCC CHECKFILEGROUP”. It is easy one and you need to make sure the size of the files needs to run each day. If the size is not same for all the files, then plan it to combine and run accordingly.
Method 2: Single file VLDBs, use a split checkDB.
“Figure out your largest tables (by number of pages) and split the total number into 7 buckets, such that there are a roughly equal number of database pages in each bucket.”
Example: Find out the larger tables in the first list and the remaining tables in 2, 3,4,5,6 bucket and need to run dbcc per the above post. If you have a larger table, it has 100000 Pages, all bucket should almost equal to 100000 pages in each day.
4. One more interesting case by Argenis Fernandez. A non clustered index with sparse column, make checkDB run time worse.
5. A post by Aaron Bertrand covers with trace flag usages and more.
You can use and combine the 1 to 5 methods for the VLDBs, it will reduce the time. Test and make sure which option is good for your business and use that one, but do not leave the checkDB run.
Steps to identify the corruption:
Step 1: Run the DBCC checkDB at least weekly once in the agent job, this will report the corruption.
Step 2: Check the error logs daily, if you have a centralised server automate an email by checking an error log every one hour for a critical error report (OR) create an alert notification by using an operator.
Note the database name from the error message, if any corruption in the database.
Step 3: If you find any error, run consistence check with an option to get an exact corruption message – DBCC checkdb (‘DBname’) with no_infomsgs,all_errormsgs
The checkDB will give you the error message with hint that, what option can fix this corruption, it just a suggestion given by SQL server, some memory level corruption cases, a recycle fix it without an actual REPAIR_ALLOW_DATA_LOSS run. But, you should know, which case needs a reboot.
Step 4: If you have good experience in the error and you think you can fix without a data loss, you can try. Like a non-clustered index corruption – drop and create, some memory corruption – recycle of the SQL service etc.
There are cases, a recycle of SQL service fix the inconsistency.
My case: I had a database shows online in the “sys.master_files” the data and log files are available in the physical filesystem. But, I cannot see any tables.
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.
Msg 5030, Level 16, State 12, Line 1
The database could not be exclusively locked to perform the operation.
Msg 7926, Level 16, State 1, Line 1
Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked. See Books Online for details of when this behavior is expected and what workarounds exist. Also see previous errors for more details.
Msg 9001, Level 21, State 1, Line 1
The log for database ‘DB’ is not available. Check the event log for related error messages. Resolve any errors and restart the database.
The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x00001e49c26000 in file ‘F:\Microsoft SQL Server\DATA\BI.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
We had some glitches in the subsystem, after it was fixed and SQL service has been rebooted.
If you have no idea of the error or need some help from SQL database corruption masters, yes you can get a help from them –>https://twitter.com/#sqlhelp. I did many times.
Read Gail Shaw’s post http://www.sqlservercentral.com/articles/Corruption/65804/
All it matters, we should have good non-corrupted and up to date backup in hand for all the production servers.
We have identified the database corruption, what are the basic steps we can run. There are different levels of corruption happened, each needs its own steps. Let me put some basic steps that you can try it out.
1. Restore the database in a different server and storage subsystem and run the checkDB. For the VLDBs, this option does not work, since it needs a large additional storage and time taken for restore.
2. If you have up to date backup (Including a tail log) restore it, in the sequence.
3. No option, you do not have a backup and minimum levels of corruption fix are not supported, Then Last resort, use “Repair Allow Data Loss”, which will repair the database with data loss.
4. There are corruptions, that cannot fix by the repair allow data loss. So, the only good way is to restore a good backup. It is very important to back up the database after checkDB and do a restore test in frequent time.
How can we prevent database corruption? There is no way to prevent this, but we can do a proactive DBA work.
Run weekly checkDB, before a full backup
Do a rotational test restore for the database
Schedule a DBCC CHECKDB 🙂