AlwaysON database not synchronizing suspect mode

by Muthukkumaran kaliyamoorthy Published on: September 17, 2016
Comments: No Comments
Tags:
Categories:AlwaysON

 

I got a call the database, not online/available mode, we suspect there was a corruption. One of my AlwaysON secondary database went suspect mode, it’s because of the log file and drive was full. I tried to resume the database. ALTER DATABASE dbname SET HADR RESUME;.

It went “in recovery“ phase and fails. Since, it does not have even 1 MB space in the log file and drive to do a recovery phase.

The drive has a good capacity for log, even though there was a huge transaction from application and the important one transaction log backup failed for five hours. I have a transaction log backup every five minutes. Since it is a high load critical OLTP database. The log full caused by a transaction log backup failure and the active massive open transactions. We had issues with NetBackup storage.

Error 3414 is a generic error, it’s triggered by SQL, when the database is going to suspect mode. It does not mean that we had a corruption and need to run and follow the corruption procedure. All I want to say read the error log and understand the issue, work on accordingly. This was fixed after adding a storage space on the drive with HADR resume and “7407 transactions rolled forward in database”.

In the primary:

Error: 9002, Severity: 17, State: 2.

The transaction log for database ‘DB’is full due to ‘LOG_BACKUP’.

 

From MSDN: https://msdn.microsoft.com/en-us/library/ff877972.aspx

SUSPEND_FROM_REDO = An error occurred during the redo phase

SUSPEND_FROM_APPLY = An error occurred when writing the log to file (see error log)

In the secondary:

Error: 3414, Severity: 21, State: 1.

An error occurred during recovery, preventing the database ‘DBName’ (7:0) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

For more on 3414: https://support.microsoft.com/en-in/kb/2015741

In our case, it is a different issue and error: 3414 is not applicable.

Errors from the error log

Error: 18210, Severity: 16, State: 1.

BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device ‘VNBU0-94692-3252-1473459870′. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).

Error: 17053, Severity: 16, State: 1.

L:\SQL_Log\DBNAME_LOG.LDF: Operating system error 112(There is not enough space on the disk.) encountered.

Error: 5149, Severity: 16, State: 3.

MODIFY FILE encountered operating system error 112(There is not enough space on the disk.) while attempting to expand the physical file ‘L:\SQL_Log\DBNAME_LOG.LDF’.

AlwaysOn Availability Groups data movement for database ‘DBName’ has been suspended for the following reason: “system” (Source ID 4; Source string: ‘SUSPEND_FROM_APPLY‘). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.

Error: 3041, Severity: 16, State: 1.

ALTER DB param option: RESUME

AlwaysOn Availability Groups data movement for database ‘DBName’ has been resumed. This is an informational message only. No user action is required.

AlwaysOn Availability Groups connection with primary database established for secondary database ‘DBName’ on the availability replica ‘SERVER’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

Error: 35285, Severity: 16, State: 1.

The recovery LSN (256072:1024000:1) was identified for the database with ID 7. This is an informational message only. No user action is required.

 

Error: 3313, Severity: 21, State: 1.

During redoing of a logged operation in database ‘DBName’, an error occurred at log record ID (256072:1023498:1). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.

AlwaysOn Availability Groups data movement for database ‘DBName’ has been suspended for the following reason: “system” (Source ID 2; Source string: ‘SUSPEND_FROM_REDO’). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.

Error: 3414, Severity: 21, State: 1.

An error occurred during recovery, preventing the database ‘DBName’ (7:0) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

Error: 926, Severity: 14, State: 1.

Database ‘DBName’ cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.

 

ALTER DB param option: RESUME

AlwaysOn Availability Groups data movement for database ‘DBName’ has been resumed. This is an informational message only. No user action is required.

Nonqualified transactions are being rolled back in database DBName for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

State information for database ‘DBName’ – Hardended Lsn: ‘(256072:1024000:1)’    Commit LSN: ‘(0:0:0)’    Commit Time: ‘Jan  1 1900 12:00AM’

AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘DBName’ on the availability replica ‘SERVER’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

State information for database ‘DBName’ – Hardended Lsn: ‘(256072:1024000:1)’    Commit LSN: ‘(0:0:0)’    Commit Time: ‘Jan  1 1900 12:00AM’

Starting up database ‘DBName’.

Recovery of database ‘DBName’ (7) is 0% complete (approximately 903 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.

Recovery of database ‘DBName’ (7) is 0% complete (approximately 902 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.

Recovery of database ‘DBName’ (7) is 0% complete (approximately 902 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

 

After adding a storage space on the partition/drive. The database went throw the recovery phase and back in online ALTER DATABASE dbname SET HADR RESUME;.

AlwaysOn Availability Groups connection with primary database established for secondary database ‘DBName’ on the availability replica ‘SERVER’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

Error: 35285, Severity: 16, State: 1.

The recovery LSN (256072:1024000:1) was identified for the database with ID 7. This is an informational message only. No user action is required.

7407 transactions rolled forward in database ‘DBName’ (7:0). This is an informational message only. No user action is required.

Recovery completed for database DBName (database ID 7) in 24 second(s) (analysis 1561 ms, redo 13390 ms, undo 0 ms.) This is an informational message only.

No user action is required.

AlwaysOn Availability Groups connection with primary database established for secondary database ‘DBName’ on the availability replica ‘SERVER’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

Error: 35285, Severity: 16, State: 1.

The recovery LSN (256072:1024000:1) was identified for the database with ID 7. This is an informational message only. No user action is required.

CHECKDB for database ‘DBName’ finished without errors on 2016-09-03 04:56:19.650 (local time). This is an informational message only; no user action is required.

 

At the same time, I had WSFC issue, my secondary node went offline in the fail over cluster manager. But, the node is online and ping. I tried to bring online, but no luck. I had some automatic fail-overs and node was up and down. The fail over cluster manager was not working and even does not show the cluster name. I manually tried to connect the cluster by clicking –> connect to clusters –> Typed the name –> Timeout and cannot able to connect.

I validated the cluster status and vote by PowerShell. The vote was zero, which means node cannot able to talk to the cluster. The cluster works fine, since it is a windows server 2012 with dynamic quorum.

I tried to validate the cluster by clicking –> validate configuration –> next –> Enter name

–> Click browse button –> the node of secondary has not added and other nodes are added.

See the image, the cluster name does not come in in the cluster manager and validation does not add the secondary node. It only added a primary and witness/quorum node.

I have no clue and the cluster errors are generic and reboot the secondary server, which shows the cluster name and all the nodes online in the WSFC. There was some glitch in the secondary node, which made the disconnection in the WSFC.

Errors from the alwaysON health check:

Message: A connection timeout has occurred while attempting to establish a connection to availability replica ‘server’ with id [3284E6A0-CA68-41ED-92CA-759757477E54]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.

Statement: ALTER AVAILABILITY GROUP [Groupname] FAILOVER;

From Cluster log

The state of the local availability replica in availability group ‘GroupName’has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_PENDING_FAILOVER’.

The state changed because of a user initiated failover.  For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘DB’on the availability replica ‘server’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

AlwaysOn: The local replica of availability group ‘GroupName’is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

The state of the local availability replica in availability group ‘GroupName’has changed from ‘RESOLVING_PENDING_FAILOVER’ to ‘RESOLVING_NORMAL’.

The state changed because the availability group is coming online.  For more information, see the SQL Server error log,

Windows Server Failover Clustering (WSFC) management console, or WSFC log.

AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘DB’ on the availability replica ‘server’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

Muthukkumaran kaliyamoorthy

I’m currently working as a SQL server DBA in one of the top MNC. I’m passionate about SQL Server And I’m specialized in Administration and Performance tuning. I’m an active member of SQL server Central and MSDN forum. I also write articles in SQL server Central. For more Click here

More Posts - Website

TempDB database is Full and Optimization

by Muthukkumaran kaliyamoorthy Published on: September 13, 2016
Comments: 2 Comments
Tags: , , ,
Categories:DBA

What is TempDB and best practice for TempDB

TempDB is the system database and it is per instance. It is a common and shared by all other databases. All the temporary activities are done here and yes, definitely the TempDB will become full and occupy more space depends on the temporary tasks, which we are running. There are many activities can happen in tempDB.

Best practice, create a TempDB in separate disk with the estimated initial file size, those are old days and now most of us using disk array. The spindles and HDDS/SSDs are striped through RAID and shared across LUNs and pools, check with your infra team about the disk configuration from the storage. How the disk is mounted or presented to the server. Is it a dedicated drive (commonly in DAS) or a disk array (commonly in SAN). DAS or SAN, SAN is most common industry standard that is being used. We had both DAS and SAN, it depends on the database application. Definitely a RAID level has been used, check with that as well. A common recommendation will be RAID 1+0, which is costly. It is always good to know, what kind of storage and storage vendor, we are using it to plan and test that perfectly. Some servers we use advanced Automated Tiered Storage, which is good some without tired etc. DBAs have limited knowledge in the storage, but it is good to learn from our SAN & VM admins :-) .

What are all the activities are done in Tempdb and which is occupying the space

The list is very big (User Objects, Internal Objects & Version Stores) are stored in temporarily in tempDB.  Read the TechNet article for more.

If it gets full often, we need to capture the tasks that are hitting to tempDB and need to plan accordingly. 

Find out the Tempdb physical file usage

SELECT SUM(unallocated_extent_page_count) AS [free pages],
SUM(unallocated_extent_page_count
+ user_object_reserved_page_count
+ internal_object_reserved_page_count
+ mixed_extent_page_count
+ version_store_reserved_page_count) * (8.0/1024.0/1024.0) AS [Total TempDB SizeInGB]
, SUM(unallocated_extent_page_count * (8.0/1024.0/1024.0)) AS [Free TempDB SpaceInGB]
,unallocated_extent_page_count
,user_object_reserved_page_count
,internal_object_reserved_page_count
,mixed_extent_page_count
FROM tempdb.sys.dm_db_file_space_usage
--where [FreeTempDBSpaceInGB]>50
group by unallocated_extent_page_count,user_object_reserved_page_count,internal_object_reserved_page_count,mixed_extent_page_count

TempDB DMV:

Monitor the disk space used by the user objects, internal objects, and version stores in the tempdb files.

sys.dm_db_file_space_usage – Returns space usage information for each file in the database.

Allocation or deallocation activity in tempdb at the session or task level

sys.dm_db_session_space_usage - Returns the number of pages allocated and deallocated by each session for the database.

sys.dm_db_task_space_usage – Returns page allocation and deallocation activity by task for the database.

These views can be used to identify large queries, temporary tables, or table variables that are using a large amount of tempdb disk space.

For more:TechNet article diagnosing tempdb Disk Space Problems

 

Useful queries to find out who is using my TempDB

Following is code originally from Gianluca Sartori and Deepak Biswal. I just copied here. The MSDN article has more code, have a look at it and use the same based on your case. This code really cool and helped me a lot.

The following query will only show the Active request joining from sys.dm_exec_requests DMV. If the query finished, you cannot get that by using this.

;WITH task_space_usage AS (
-- SUM alloc/delloc pages
SELECT session_id,
request_id,
SUM(internal_objects_alloc_page_count) AS alloc_pages,
SUM(internal_objects_dealloc_page_count) AS dealloc_pages
FROM sys.dm_db_task_space_usage WITH (NOLOCK)
WHERE session_id <> @@SPID
GROUP BY session_id, request_id
)
SELECT TSU.session_id,
TSU.alloc_pages * 1.0 / 128 AS [internal object MB space],
TSU.dealloc_pages * 1.0 / 128 AS [internal object dealloc MB space],
EST.text,
-- Extract statement from sql text
ISNULL(
NULLIF(
SUBSTRING(
EST.text,
ERQ.statement_start_offset / 2,
CASE WHEN ERQ.statement_end_offset < ERQ.statement_start_offset THEN 0 ELSE( ERQ.statement_end_offset - ERQ.statement_start_offset ) / 2 END
), ''
), EST.text
) AS [statement text],
EQP.query_plan
FROM task_space_usage AS TSU
INNER JOIN sys.dm_exec_requests ERQ WITH (NOLOCK)
ON TSU.session_id = ERQ.session_id
AND TSU.request_id = ERQ.request_id
OUTER APPLY sys.dm_exec_sql_text(ERQ.sql_handle) AS EST
OUTER APPLY sys.dm_exec_query_plan(ERQ.plan_handle) AS EQP
WHERE EST.text IS NOT NULL OR EQP.query_plan IS NOT NULL
ORDER BY 3 DESC, 5 DESC

The following query will show the session space usage. If the session is closed, you cannot get that by using this. 

SELECT DES.session_id AS [SESSION ID],
Db_name(DDSSU.database_id) AS [DATABASE Name],
host_name AS [System Name],
program_name AS [Program Name],
login_name AS [USER Name],
status,
( user_objects_alloc_page_count * 8 ) AS
[SPACE Allocated FOR USER Objects (in KB)],
( user_objects_dealloc_page_count * 8 ) AS
[SPACE Deallocated FOR USER Objects (in KB)],
( internal_objects_alloc_page_count * 8 ) AS
[SPACE Allocated FOR Internal Objects (in KB)],
( internal_objects_dealloc_page_count * 8 ) AS
[SPACE Deallocated FOR Internal Objects (in KB)],
cpu_time AS [CPU TIME (in milisec)],
total_scheduled_time AS
[Total Scheduled TIME (in milisec)],
total_elapsed_time AS
[Elapsed TIME (in milisec)],
( memory_usage * 8 ) AS [Memory USAGE (in KB)],
CASE is_user_process
WHEN 1 THEN 'user session'
WHEN 0 THEN 'system session'
END AS [SESSION Type],
row_count AS [ROW COUNT]
FROM tempdb.sys.dm_db_session_space_usage AS DDSSU
INNER JOIN sys.dm_exec_sessions AS DES
ON DDSSU.session_id = DES.session_id
ORDER BY [space allocated for internal objects (in kb)] DESC

If you find the session ID, which is using more temp space. Pass the session ID to find the code which is taking more TempDB. One of my other case, A poorly written query used more than 100GB of space.

SELECT TEXT
FROM sys.dm_exec_connections
CROSS APPLY sys.dm_exec_sql_text(most_recent_sql_handle)
WHERE session_id = (181)

 

How to solve the Tempdb contention and performance

Easy way to find the tempdb contention by using Whoisactive and more workout for TempDB.

You can also see the overall system wait type.

Latch contention can occur in tempDB allocation pages of GAM, SGAM and PFS. It is a common one, but when we have a lot of hits, it creates a performance issue, since the wait queue increases.

If you see a lot wait_type is PAGELATCH or PAGEIOLATCH with tempDB: PFS, GAM and SGAM ([wait_type]:[database_name]:[file_id](page_type)), then we have contention that needs to be fixed to improve the performance.

 

We can fix the contention by creating more data file with equal size, more data file will give more allocation pages (GAM, SGAM and PFS per data file).

If you cannot create a file equal size, since the one file can be very big, we can use trace flag 1117. It forces other files in the filegroup to grow it. It applies to other databases as well.

The easiest way to alleviate tempdb allocation contention is to enable trace flag 1118 and to add more tempdb data files. 

http://www.sqlskills.com/blogs/paul/correctly-adding-data-files-tempdb/

 

The number of data files is it depends. Recommendation from Paul Randal.

Then if you have less than 8 logical cores, create the same number of data files as logical cores. If you have more than 8 logical cores, create 8 data files and then add more in chunks of 4 if you still see PFS contention. Make sure all the tempdb data files are the same size too.

Temporary and permanent fix

We can reduce the file size by shrinking the files. Try to shrink the files first, if it is not shrinking, free the procedure cache and shrink again. You can FREEPROCCACHE more than one time until, you see or want to reduce space used by files.

It is a temporary fix and it does not need a recycle of SQL service. Shrinking the tempDB is fine, but not more frequently, since, it may leads an external disk fragmentation.

USE TEMPDB;
GO
DBCC SHRINKFILE (TEMPDEV,10000) -- Initial size with 10 GB
DBCC FREEPROCCACHE
DBCC SHRINKFILE (TEMPDEV,10000)

When we run more FREEPROCCACHE with a shrink, sometimes you can see the following error, Not sure the error looks like an internal file allocation, it can be fixed by increasing the file size + some MB. Ex: 1000MB file can increased by 1005MB.

File ID 1 of database ID 2 cannot be shrunk as it is either being shrunk by another process or is empty.

Msg 0, Level 11, State 0, Line 0

A severe error occurred on the current command. The results, if any, should be discarded.

Sometimes the free cache will not help, when you do not have a free space in the file. Use the following query and check the free space.

USE [tempdb]
SELECT
[name]
,CONVERT(NUMERIC(10,2),ROUND([size]/128.,2)) AS [Size]
,CONVERT(NUMERIC(10,2),ROUND(FILEPROPERTY([name],'SpaceUsed')/128.,2)) AS [Used]
,CONVERT(NUMERIC(10,2),ROUND(([size]-FILEPROPERTY([name],'SpaceUsed'))/128.,2)) AS [Unused]
FROM [sys].[database_files]

Another case, I had a row versioning enabled for the database which prevent the TempDB shrinking, and disabled it temporarily and shrink it. It helped me.

--check which dbs are in snapshot isoation mode
select name,is_read_committed_snapshot_on,snapshot_isolation_state, snapshot_isolation_state_desc,* from sys.databases
--where is_read_committed_snapshot_on =1
order by 1
--Disable snapshot isolation
alter database DBname set READ_COMMITTED_SNAPSHOT off with rollback after 30 seconds
go
use tempdb
go
dbcc shrinkfile (tempdev, 10000) --Shrink with 10GB
go
select (size*8)/1024.0 as FileSizeMB from sys.database_files --check new size
go
--Enable snapshot isolation
alter database DBname set READ_COMMITTED_SNAPSHOT on with rollback after 30 seconds
go

The permanent fix is hard one, we need to get the query or tasks, which all are hitting the tempDB and need to tune those. In my case, it is an 8 TB database and when the predefined maintenance plan runs, it occupies all the tempDB space. I used different method Ola Hallengren’s script. It was a fantastic script, we can have more control than the maintenance plan. I have skipped the non- clustered index and some historical VLT table, which is not a critical one. It reduced the tempDB space and job run time as well. I had a small contention and I did not create more data files.

In another case, one of the developer code uses all tempDB space and it has more contention as well and yes it got reduced after creating a more data files. Use the whoisactive and other script to find out the exact issue and fix based on what you have on the system. Since, I mixed with two more issues in this post.

 

Muthukkumaran kaliyamoorthy

I’m currently working as a SQL server DBA in one of the top MNC. I’m passionate about SQL Server And I’m specialized in Administration and Performance tuning. I’m an active member of SQL server Central and MSDN forum. I also write articles in SQL server Central. For more Click here

More Posts - Website

How to solve the LSN mismatch in SQL server

 

There are many times, we face the LSN mismatch issue in alwaysON and other HA technologies. It is a bit hard to find the missing transaction log backup to apply. Since, there are hundreds of thousands log generated, depends on the transaction log frequency and it can be run in any secondary alwaysON database server.

Think about “the VLDB” and “the backup is in different data center” and database are out of sync in DR site because of LSN mismatch. For VLDB 8 TB database, we cannot take a full or differential backup to fix this. Since, it will take more and more time. Backup is in different data center in CIFS share restore over the WAN will kill the performance and time.

 

LSN: Every record in the SQL Server transaction log is uniquely identified by a log sequence number (LSN). The restore database will work with the sequence of LSN order, no break in log chain.

It is easy if we understand the LSN chains a bit internally. Let me try to show in this post.

Example:

For the first log backup, the first LSN is 100 and Last LSN is 200, second log backup the first LSN should be 200 and last LSN is 300 and third log backup the first LSN should be 300 and last LSN will be some number and the chain goes on.

How to track, where the LSN breaks out and how to fix it. There are many methods, I used. All it depends on the issue and situation. Let me show all the methods and try it out.

Try is in testing not in production: Easy way to break the LSN in AlwaysON database.

Remove a database from an AlwaysON group in a secondary.

ALTER DATABASE [dba3] SET HADR OFF;

Take couple more log backup in primary.

DECLARE @MyFileName varchar(200)
SELECT @MyFileName='\\Sharepatht\dba3_' + REPLACE(convert(nvarchar(20),GetDate(),120),':','-') + '.trn'
--select @MyFileName
BACKUP log dba3 TO DISK=@MyFileName

Try to add the database back to secondary.

ALTER DATABASE [dba3] SET HADR AVAILABILITY GROUP = [AG-Test];

Error:

Msg 1478, Level 16, State 211, Line 1

The mirror database, has insufficient transaction log data to preserve the log backup chain of the principal database.  This may happen if a log backup from the principal database has not been taken or has not been restored on the mirror database.

Use the backup table in MSDB to retrieve the backup history with dynamic SQL.

Note: The filter where clause is important and it will be modified based on my steps. I just uncommented all conditions.

Dynamic Backup script:

 

SELECT  'restore database dba3 from disk= ''' +f.physical_device_name+''' with norecovery',
b.backup_finish_date,b.first_lsn,b.backup_size /1024/1024 AS size_MB,b.type,b.recovery_model,
b.server_name ,b.database_name,b.user_name, f.physical_device_name,b.*
FROM MSDB.DBO.BACKUPMEDIAFAMILY F
JOIN MSDB.DBO.BACKUPSET B
ON (f.media_set_id=b.media_set_id)
WHERE database_name like'dba3'
--and b.backup_finish_date>='2016-08-29 00:00:00.000'
--and b.first_lsn<=40000000007600001
-- and last_lsn= 39000000047800001
--and b.checkpoint_lsn =39000000047800001
--and database_backup_lsn =39000000047800001
--AND B.type='L'
--ORDER BY b.backup_finish_date desc
--order by b.first_lsn  desc

 

First method – A little tough, but good to understand the first and last LSN and the restore failures of LSN mismatch. This can be used for any other DR LSN issues, not only for alwaysON.

Step 1:

Run the dynamic backup script to get the latest transaction log backup from the server. Run this in all alwaysON group replica servers and make sure, you got the latest one.

Uncomment “ORDER BY b.backup_finish_date desc” to display in first line. Get the backup file and pass on it to the following restore command.

restore database dba3 from disk= '\\share\dba3_2016-08-29 09-43-58.trn' with norecovery

It will error and display the required LSN – “includes LSN 41000000017300001 can be restored”, which means, we need the one before the LSN of includes this.

 

To understand better, following is the example of restoring error, which is a very recent log backup and some old log backup restore.

An Old log backup error:

Msg 4326, Level 16, State 1, Line 1

The log in this backup set terminates at LSN 39000000023900001,

which is too early to apply to the database. A more recent log backup that includes LSN 41000000017300001 can be restored.

Msg 3013, Level 16, State 1, Line 1

RESTORE DATABASE is terminating abnormally.

The Latest log backup error:

Msg 4305, Level 16, State 1, Line 1

The log in this backup set begins at LSN 41000000017600001,

which is too recent to apply to the database. An earlier log backup that includes LSN 41000000017300001 can be restored.

Msg 3013, Level 16, State 1, Line 1

RESTORE DATABASE is terminating abnormally. 

I just run the two restore command with the latest and old log backup, one says it is “too early to apply” which means it is a before LSN and other says “too recent to apply” which means it is a after LSN.

Example:

39000000023900001 – Before/early LSN

41000000017300001Required/includes LSN

41000000017600001 – After/Recent LSN

 

It is a little hard to find which backup set holds the required LSN which is included.

If we know the time when it breaks, you can run restore command one by one from that time or before and can find out the backup file. OR run “restore headeronly from disk = ‘\\share\dba3_2016-08-28 07-22-32.bak’” to match closer to the required LSN comparing with first LSN.

Again, it is tough when it generated lots of transaction log backup. The above example the LSN are from the “restore headonly” and “restore database” both are showing the same.

Step 2:

We know the required LSN set”41000000017300001” from the restore error.

Use the dynamic backup script add the condition “and b.first_lsn<=41000000017300001” to get the LSN which is early or smaller than the needed (OR) which includes the required LSN with “order by b.first_lsn  desc” clause.

Copy the first_LSN in the first row “41000000016700001”. It is the breaking point. (Error: log backup that includes LSN “41000000017300001″ can be restored.)

 

Step 3:

Use the dynamic backup script add “and b.first_lsn>=41000000016700001” with “ORDER BY b.backup_finish_date”. It will take all the required backup files to be applied in a sequence.

 

Another Method: It is a bit easy one for alwaysON.

Use the following command it will display the exact required LSN in the “s.truncation_lsn”, which is same as “First_LSN” =41000000016700001. The “s.recovery_lsn & s.last_hardened_lsn” will show the recovery LSN which is same as “includes LSN” = 41000000017300001.

Find the required LSN of using “DMV of alwaysON” and note the truncation_lsn.

 

select database_id,db_name(database_id),name,replica_server_name,synchronization_health_desc,synchronization_state_desc,
s.truncation_lsn,s.recovery_lsn,s.last_hardened_lsn,s.last_sent_time,s.last_received_time,s.last_redone_lsn,
s.end_of_log_lsn,s.last_commit_lsn,s.last_sent_lsn,s.last_received_lsn,s.last_commit_lsn
from sys.dm_hadr_database_replica_states S join  sys.availability_groups G on (s.group_id=g.group_id)
join sys.availability_replicas   GR on (GR.group_id=g.group_id)
--where synchronization_health_desc <>'HEALTHY'
and db_name(database_id)='dba3'
and synchronization_health_desc ='NOT_HEALTHY'
-- change the DB name and health status

For more about columns read the MS DMVs link:

https://msdn.microsoft.com/en-us/library/ff877972.aspx

 

Run the dynamic backup script with the filter clause of the “s.truncation_lsn” “41000000016700001”

SELECT  'restore database dba3 from disk= ''' +f.physical_device_name+''' with norecovery',
b.backup_finish_date,b.first_lsn,b.last_lsn,b.database_backup_lsn,b.checkpoint_lsn,
b.type,b.is_copy_only,b.recovery_model,b.backup_size /1024/1024 AS size_MB,
b.server_name ,b.database_name,b.user_name, f.physical_device_name,b.*
FROM MSDB.DBO.BACKUPMEDIAFAMILY F
JOIN MSDB.DBO.BACKUPSET B
ON (f.media_set_id=b.media_set_id)
WHERE database_name like'dba3'
--and b.backup_finish_date>='2016-08-29 05:50:00.000'
AND b.first_lsn>=41000000016700001
AND B.type='L'
ORDER BY b.backup_finish_date

Copy the restore script and execute in the LSN break server, if you know the log backup only configured and run on the server. Mostly in two node alwaysON, it is configured and run more server see the following and use it.

Join the database into the alwaysON group.

ALTER DATABASE [dba3] SET HADR AVAILABILITY GROUP = [AG-Test];

 

How to fix if the log backup runs more than one secondary replica, when the break happens.

I had a case in more than one replica server, which are two in the primary data center and two are in DR data center.

We can use the above command and steps and can run on all the available replica server and compare the results and can prepare the restore command. But, it will take a little time and maybe some confusion as well. I used “Xp_cmdshell” to get the files from the backup folder and insert into a table to prepare the restore command.

To find the required transaction backup file, we need the required LSN:

1. Get the required LSN either from “DMV of alwaysON” OR from restore error 2. Get the backup file name by using “Dynamic backup script” by passing the required LSN. Filter and compare it with the “Xp_cmdshell” results.

It is a different test, so the LSN will differ from the previous one. 

1. You can use the above command “DMV of alwaysON” to find the required LSN, but it will show all the replica server, you just note the unhealthy of database server of required LSN from “s.truncation_lsn”

2. Use “Dynamic Backup script” and filter it “AND b.first_lsn>=41000000021500001”  with “ORDER BY b.backup_finish_date”. Note the “f.physical_device_name”. That is “\\share\dba3_2016-08-30 05-10-00.trn

Step 1: Enable the xp_cmdshell

 

EXEC sp_configure 'show advanced options', 1; 
RECONFIGURE; 
EXEC sp_configure 'xp_cmdshell', 1; 
RECONFIGURE;

Step 2: Create table to export the backup files

--drop table tbl_backup_filename
create table tbl_backup_filename (id int identity, Bak_filename varchar(500))
insert into tbl_backup_filename
exec xp_cmdshell 'dir \\share\SQL-Backup\QA\DBA_Test /b /O:D' -- Bare format sort by date

Step 3: Compare the backup file name with the export backup file name and note the identity ID.

select * from tbl_backup_filename
--\\share\dba3_2016-08-30 05-10-00.trn
select * from tbl_backup_filename where Bak_filename like '%dba3_2016-08-30 05-10-00%'

Step 4: Run the dynamic SQL restore command to generate the restore script with the condition of identity number, which you got it from above step.

select 'restore headeronly from disk=''\\sclfilip13\MFGProcess\SQL-Backup\QA\DBA_Test\'+Bak_filename+'''', *
from tbl_backup_filename where id >=76
-- DB
select 'restore database dba3 from disk=''\\sclfilip13\MFGProcess\SQL-Backup\QA\DBA_Test\'+Bak_filename+''' with norecovery', *
from tbl_backup_filename where id >=76

Step 5: Disable the xp_cmdshell

 

EXEC sp_configure 'xp_cmdshell', 0; 
RECONFIGURE; 
EXEC sp_configure 'show advanced options', 0; 
RECONFIGURE;

 

If you have a backup folder separate for a replica server, then you need to export all and can compare it with all replica servers.

If anyone interested in the test script, please drop me a note. It is always great to share the my findings and learning with you all.

Happy learning!

 

Muthukkumaran kaliyamoorthy

I’m currently working as a SQL server DBA in one of the top MNC. I’m passionate about SQL Server And I’m specialized in Administration and Performance tuning. I’m an active member of SQL server Central and MSDN forum. I also write articles in SQL server Central. For more Click here

More Posts - Website

AlwaysON database NOT SYNCHRONIZED and RECOVERY PENDING

AlwaysON database NOT SYNCHRONIZED and RECOVERY PENDING

 

I have recently come across an issue that, one of my alwaysON secondary replica databases are went into “NOT SYNCHRONIZED and RECOVERY PENDING” state. It is a geo cluster alwaysON with 4+2 nodes configured in both synchronous and asynchronous mode. 

The issue is when the database removed from the primary replica, with the secondary disconnection the higher database IDs on the secondary went into “NOT SYNCHRONIZED and RECOVERY PENDING” state, but the lower database IDs are good and synchronous state only.

I just checked with one of the Microsoft MSFT and he said it considers as a bug. “Since, it occurs a DDL change on the Availability Group as the primary, while a secondary replica server is down”.

I just reproduced the same. It is a two node synchronous replica server.

Version: Microsoft SQL Server 2012 (SP1) – 11.0.3000.0

I have not tested it on SQL 2014.

 

Testing:

1. Create five databases.

CREATE DATABASE dba

CREATE DATABASE dba1

CREATE DATABASE dba2

CREATE DATABASE dba3

CREATE DATABASE dba4

SELECT * FROM SYS.SYSDATABASES

2. Created a test availability group and added a five databases in synchronous mode.

3. Stop SQL Instance for Secondary Replica.

4. On the Primary Replica issue TSQL to remove a database from availability group.

 

ALTER AVAILABILITY GROUP [AG-Test] REMOVE DATABASE dba2;

 

5. Start SQL Instance for the Secondary Replica.

You can see that, databases whose IDs are higher than the one removed dba2 are going to “NOT SYNCHRONIZED / RECOVERY PENDING” mode.

DB ID – 9, 10 for dba,dba1 – Good

DB ID – 11 for dba2 – Removed DB

DB ID – 12,13 for dba3, dba4 went into SYNCHRONIZED / RECOVERY PENDING.

 

SELECT DATABASE_ID,DB_NAME(DATABASE_ID),NAME,REPLICA_SERVER_NAME,

SYNCHRONIZATION_HEALTH_DESC,SYNCHRONIZATION_STATE_DESC,S.*

FROM SYS.DM_HADR_DATABASE_REPLICA_STATES S JOIN  SYS.AVAILABILITY_GROUPS G ON (S.GROUP_ID=G.GROUP_ID)

JOIN SYS.AVAILABILITY_REPLICAS   GR ON (GR.GROUP_ID=G.GROUP_ID)

–WHERE SYNCHRONIZATION_HEALTH_DESC <>’HEALTHY’

–AND REPLICA_SERVER_NAME =”

 

The fix and work around is resume other databases that are stuck in secondary replica.

6. On the Secondary replica that has stuck databases, for each stuck database remove (will make it DB into restoring mode) and add (will make it DB into synchronized mode).

Remove a Secondary Database from an AG –> https://msdn.microsoft.com/en-us/library/hh231120.aspx

 

ALTER DATABASE [DBA3] SET HADR OFF

ALTER DATABASE [DBA4] SET HADR OFF

Join a Secondary Database to an AG –> https://msdn.microsoft.com/en-us/library/ff878535.aspx

 

ALTER DATABASE [DBA3] SET HADR AVAILABILITY GROUP = [AG-TEST]

ALTER DATABASE [DBA4] SET HADR AVAILABILITY GROUP = [AG-TEST]

Repeat for each stuck database.

The database that had been removed <dba2> could be in any number of states, but after it finishes its recovery process, it should return to a RESTORING state.

7. On primary replica, Add the removed database –> primary replica –> right click AG databases–>select database –> join only –> connect the secondary –> finish.

In case, if you get any following error, check the backup table and restore a log backup with norecovery mode in all the secondary replica. Which are all missed or run during that time.

Msg 1478, Level 16, State 211, Line 1

The mirror database, ‘DBA2’, has insufficient transaction log data to preserve the log backup chain of the principal database.  This may happen if a log backup from the principal database has not been taken or has not been restored on the mirror database.

 

Added: Thanks Amit Banerjee for your reply. I just reported to the connect.

https://connect.microsoft.com/SQLServer/feedback/details/3022019

 

 

Muthukkumaran kaliyamoorthy

I’m currently working as a SQL server DBA in one of the top MNC. I’m passionate about SQL Server And I’m specialized in Administration and Performance tuning. I’m an active member of SQL server Central and MSDN forum. I also write articles in SQL server Central. For more Click here

More Posts - Website

AlwaysON move database without breaking HADR

Move database without breaking alwaysON

 

This post is going to show the database movement from one drive to another drive, without breaking the database from alwaysON configuration. An application has created the many databases to both primary and secondary replica servers to the default location of C drive.

There are methods like detach/attach, backup/restore & alter database. For alwaysON HADR servers, the best method is Alter database. Since it is in the mirror/sync mode. Detach/attach will not work and backup/ restore, we need to break databases from the HADR.

Note:

It’s a two node synchronous replica, if you have more replica, you should plan each well.

If you have any standalone database (Without adding DBs into HADR), you need to plan for the downtime. Since, the secondary replica servers need to take down.

In primary, for high transaction system, make sure you have good space for transaction files.

Steps:

Before going to start the database movement, write the script for each step and you can write a dynamic SQL for larger number databases. That’s what I did, since I had many databases. It will minimize the time.

 

ALTER DATABASE <DB name> SET HADR suspend

GUI: Expand the AG group and right click the DB –>suspend data movement

  • Change the Readable secondary to ‘NO’ for all the secondary replicas, otherwise you will get an error.
  • To use ALTER DATABASE, the database must be in a writable state in which a checkpoint can be executed.

Right click the primary replica alwaysON group –> properties –> Readable secondary  –> No

 

  • Note down all the files and file location from the system tables.

 

select db_name(a.database_id),a.name,a.physical_name,size/128.0 AS CurrentSizeMB,*

from sys.master_files a join sys.databases b

on a.database_id =b.database_id

–where a.physical_name like ‘c%’

–and a.database_id >4

–and type_desc <>’rows’

order by a.name

 

  • On the mirror server run the “ALTER DATABASE <DB name> MODIFY FILE” command. You need to run this for each database.

 

use master

ALTER DATABASE <DB_name> MODIFY FILE (NAME =<logical name>

,FILENAME =’F:\SQL_DATA\DB_name.mdf’)

 

  • On the secondary replica server stop the SQL Server instance.

 

  • Move the database file (MDF & LDF) files to the changed location (Cut & Paste).
  • Start the SQL Server instance and check the file locations using the above query.

 

  • In primary replica server resume the database by using the following ALTER DATABASE statement:

 

ALTER DATABASE <DB name> SET HADR Resume

GUI: Expand the AG group and right click the DB –>resume data movement

 

  • Change the Readable secondary to ‘Yes’

Fail over and repeat the steps for the partner server.

Additionally, if you add any files in the primary and the folder name is incorrect in the secondary, the database will go into suspend mode.

Just check the error log, you can get more info on, why the database is suspended mode.

Error: 5123, Severity: 16, State: 1.

CREATE FILE encountered operating system error 3(The system cannot find the path specified.) while attempting to open or create the physical file ‘L:\SQL_log\DB_name.ndf’.

 

Created the folder in secondary replica and resume the database, the resume database command will create a data file.

ALTER DATABASE <db name> SET HADR RESUME;

 

Added:

After a reply from Richard L. Dawson – I thought to add this as well. Thanks for that. This will help others as well.

You cannot take the database offline in both primary and secondary replica, unless we remove from alwaysON. Removing the database in the primary will make your secondary database into the recovery mode it is a default configuration.

Same as for secondary, you cannot remove it, it will make the databases into recovery mode and moreover the database will not remove from AG group.

Msg 1468, Level 16, State 1, Line 1

The operation cannot be performed on database “DB” because it is involved in a database mirroring session or an availability group. Some operations are not allowed on a database that is participating in a database mirroring session or in an availability group.

Msg 5069, Level 16, State 1, Line 1

ALTER DATABASE statement failed.

So moving physical database files needs a down/offline, we cannot move database and it requires the database need to be standalone not in the AG. We can run the alter database command and maybe can try this “DBCC SHRINKFILE (A, EMPTYFILE)”. We know, how tough this is. http://www.bobpusateri.com/archive/2013/03/moving-a-database-to-new-storage-with-no-downtime/

 

We can remove the database from AG and can do this like, the process we follow for the normal database. If you have only one database and can afford downtime. Yes, we can do this. (OR) we can completely start from the scratch. There are many methods.

(1.Remove database from AG 2. Alter database and modify the file 3. Set offline 4. Cut & paste the physical file to the new location 5. Set back to the database online 6. Add the database into AG)

Note: If any log backup run during this process the LSN will not match with all the secondary replica, you need to restore with norecoery and need to add the database.

http://www.sqlserverblogforum.com/2016/08/how-to-solve-the-lsn-mismatch-in-sql-server/

 

 

Muthukkumaran kaliyamoorthy

I’m currently working as a SQL server DBA in one of the top MNC. I’m passionate about SQL Server And I’m specialized in Administration and Performance tuning. I’m an active member of SQL server Central and MSDN forum. I also write articles in SQL server Central. For more Click here

More Posts - Website

Performance Tuning Series – Main Part

 

Performance tuning is a tricky part. We need to analysis all points of view. Sometimes it is an easy fix and can be solved in a minute and sometimes it take a day or more to find and fix.

I received many emails for performance tuning help. It has a variety of questions, that how can we tune a query and how to find out what is going on in the server, since my application team is does not give a slow running code. The query only slowdown in some time period. My server is banging all time above 95% of resources.

Nowadays, every business has a tool that tiger an alert that resource (CPU, RAM) goes beyond the threshold, that’s mostly from server/Infra team and it’s easy to pull a month or a year report from the server and can ask DBAs, why would the server running always with high resources for the past few days/ months.

It’s always a changeling job for DBAs to tune their system, we even keep all maintenance jobs like re-indexing and statistics update OR adding more RAM, CPU will not help, there are many things can come up. All we need to know, how to gather the information from server, database and code point of view and analysis them one by one.

Most of the time poorly written and designed query and schema will be the issue. Why I am saying so, you will come to know after reading this post.

(I mean the poorly build code is, it is not written in a set based, bad schema design, no indexes used in the where, join clause, pulling all the data even if it has a some column for filtering, high fragmentation, out of stats & parameter sniffing etc.)

We need to use techniques and need to capture the exact issue.

 

To cover this post, will make me to write some other post as well. Please refer the following post as well to get some more idea on the tuning.

Capturing the SQL server queries by using Whoisactive

Capturing the SQL server queries by using server side trace  (Upcoming post: The link will be updated)

Tune query by looking execution plan and effective of the index creation (Upcoming post: The link will be updated)

 

The technique I/we DBAs mostly use:  F&F – Find and Fix.

  1. Whoisactive powerful script – This will capture everything, what are all running on the server/ DB side, if we did not have much information from the application team use this.
  2. SQL server execution plan – Once, you have found the code, which is taking more time, go and run in the SSMS with actual execution plan and see what is going on behind the screen. Even, the Whoisactive has plan info, if you pass the @get_plans = 1 parameter.

I am classifying in three parts to get a clear picture on this.

  1. Server level analysis and tuning
  2. Database level analysis and tuning
  3. Query level analysis and tuning

Before Whoisactive and sp_blitz, I tend to check perfmon and the size of the database, server and database level parameter changes, more DMVs etc. The two scripts give more control to DBAs.

There is no rule that, we should check first something and second something so on. Like (Execution plan, index used or not, fragmentation, count (*) table, stats ect..)

It’s always depends on environment. In some cases, the query might run for more than 10 hours or SP has 100 lines of code or the query might behave differently in runtime etc. We might take a help of an estimated plan to know the query without actual run OR take a look at query line by line.

 

Server and Database level analysis and tuning

More or less the server and database level checks are same. In the deeper it will take to query level only. That is what we keep saying the best practice of t-SQL writing will help the system at the top.

In common, we mostly get unknown parameter, I mean we need to find why the resource usages are peak, without having any information.

 

  • Check the SQL server error log

EXEC Sp_readerrorlog 0

The error log is important to check, which logs all the information.

This is a fantastic script from Paul Randal, which shows the overall wait type percentage and additionally the explanation of each wait URL.

 

WITH [Waits] AS
(SELECT
[wait_type],
[wait_time_ms] / 1000.0 AS [WaitS],
([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
[signal_wait_time_ms] / 1000.0 AS [SignalS],
[waiting_tasks_count] AS [WaitCount],
100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
FROM sys.dm_os_wait_stats
WHERE [wait_type] NOT IN (
N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR',
N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH',
N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE',
N'CHKPT', N'CLR_AUTO_EVENT',
N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE',
-- Maybe uncomment these four if you have mirroring issues
N'DBMIRROR_DBM_EVENT', N'DBMIRROR_EVENTS_QUEUE',
N'DBMIRROR_WORKER_QUEUE', N'DBMIRRORING_CMD',
N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE',
N'EXECSYNC', N'FSAGENT',
N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX',
-- Maybe uncomment these six if you have AG issues
N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE',
N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE',
N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP',
N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT',
N'ONDEMAND_TASK_QUEUE',
N'PREEMPTIVE_XE_GETTARGETSTATE',
N'PWAIT_ALL_COMPONENTS_INITIALIZED',
N'PWAIT_DIRECTLOGCONSUMER_GETNEXT',
N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_ASYNC_QUEUE',
N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP',
N'QDS_SHUTDOWN_QUEUE', N'REDO_THREAD_PENDING_WORK',
N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE',
N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH',
N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP',
N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY',
N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP',
N'SLEEP_SYSTEMTASK', N'SLEEP_TASK',
N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT',
N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH',
N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS',
N'WAITFOR', N'WAITFOR_TASKSHUTDOWN',
N'WAIT_XTP_RECOVERY',
N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG',
N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN',
N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT')
AND [waiting_tasks_count] > 0
)
SELECT
MAX ([W1].[wait_type]) AS [WaitType],
CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S],
CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S],
CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S],
MAX ([W1].[WaitCount]) AS [WaitCount],
CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage],
CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S],
CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S],
CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S],
CAST ('https://www.sqlskills.com/help/waits/' + MAX ([W1].[wait_type]) as XML) AS [Help/Info URL]
FROM [Waits] AS [W1]
INNER JOIN [Waits] AS [W2]
ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum]
HAVING SUM ([W2].[Percentage]) - MAX( [W1].[Percentage] ) < 95; -- percentage threshold
GO

 


See the results: SOS_SCHEDULER_YIELD – 34%, CXPACKET - 13% waits, Do you think which means we need to add more CPU, Since the wait type indicates there is a high CPU, No, it does not mean that we need to go and add more CPU, It means further analysis needed to find the query which is waiting for those waits and why it is waiting and how can we make that do not wait. Yes, it can be tuned and the wait type will automatically reduce.

  • Check the currently running sessions
Use Exec Whoisactive OR DMVs 
select db_name (r.database_id),r.command,r.wait_resource,s.status,s.session_id,
r.blocking_session_id,r.last_wait_type,r.open_transaction_count,
* from sys.dm_exec_sessions s join sys.dm_exec_requests r
on s.session_id =r.session_id
where s.session_id >=51 and s.status <>'sleeping'and s.session_id <>@@SPID
--and r.blocking_session_id<>0

 

  • Check the server and database level parameter default and changes

Use, Brent Ozar’s script https://www.brentozar.com/blitz/

It is an awesome script, which will help us to quickly analysis the server and database level non default parameter checks with suggestion etc. They have improved a lot. Please check the link and provide the email address and other details. They will send you the ZIP file location to your email address.

 

  • Check the database level running query by DMVs OR Whoisactive

The best way is to use Whoisactive or there are many DMVs, just do some googling.

  • Use the server side traces to capture the information of particular events

Upcoming post: I am going to write, how to collect the queries by using server side trace.

 

Query level tuning and code re-write

Query level tuning is more important, most of the time poorly built queries eats up all the resources, even if you have loads of CPU and RAM and no fragmentation and with updated stats.

 

There are some cases that, we get in real time we need to speed up the code without an actual run.

The following step is not exactly same that, you should follow in all the time. I just used this to show more information for this particular post.

I captured this query by using “Whoisactive”. And there are many like this. And all cannot be shown here. See the wait type and it is almost matching with server over all wait.

I just took the third query, see how it performs and can be tuned and rewrite better.

It is a simple union all with join two tables. It does not have a clustered index, since there is no candidate/ high selectivity column for a clustered index.

 

SELECT DISTINCT j.* FROM
(SELECT temp81.*,[54-80_table].[Wafer Sort Test Program (High Temp)], [54-80_table].[Wafer Sort Test Program (High Temp)_REV],
[54-80_table].[Wafer Sort Test Program (Low Temp)], [54-80_table].[Wafer Sort Test Program (Low Temp) Rev]
FROM temp81 LEFT OUTER JOIN [54-80_table]
ON temp81.[54-80] = [54-80_table].Component
WHERE [54-80_table].[BOM_STATUS]  IN (‘ACT’,'C-ACT’,'CPROD’,'PROD’,'PPROD’,'QUAL’,'RISK’)

UNION ALL

SELECT temp62.*,Null,Null, [54-80_table].[Wafer Sort Test Program (High Temp)], [54-80_table].[Wafer Sort Test Program (High Temp)_REV],
[54-80_table].[Wafer Sort Test Program (Low Temp)], [54-80_table].[Wafer Sort Test Program (Low Temp) Rev]
FROM temp62 LEFT OUTER JOIN [54-80_table] ON temp62.[54-80] = [54-80_table].Component
WHERE [54-80_table].[BOM_STATUS]  IN (‘ACT’,'C-CT’,'CPROD’,'PROD’,'PPROD’,'QUAL’,'RISK’)) j

 

The general query process. The optimization phase is an important one, since we can do something to make the SQL optimizer to right choice.  (By updating stats, recompilation, filtered statistics & query hint etc). I will cover that in advanced tuning.

Query parsed –> Normalized as a query tree –> Complied/optimized –> Executed.

The first important two factors need to be looked in plan 1. Cost 2. Actual vs execution number of rows. I will show this in an advanced tuning article in detail about the rows difference, data skew etc and how to solve that.

1. Check the execution plan, how the query runs behind the screen to get some idea, where the cost is high and scan of the objects, join types etc.

Execution plan shows high cost for parallelism and sort. If we reduce that, it could be run better.

The normal query run time: 46:57 minutes.

2. Check the tables involved in the code and get the count (*) – number of records of the table. This will help us to target the big table regarding the fragmentation and statistics up to date.

Total number of records in the table.

select count(*) from temp81 with (nolock) –4984266

select count(*) from temp62 with (nolock) — 6250470

select count(*) from [54-80_table] with (nolock) – 217

 

It is a heap and no non-clustered index as well – Statistics is nil maybe, some auto stats for column and fragmentation is low.

SELECT OBJECT_NAME(F.OBJECT_ID) obj,i.name ind,
f.avg_fragmentation_in_percent,
f.page_count,table_schema,
avg_page_space_used_in_percent, forwarded_record_count
FROM SYS.DM_DB_INDEX_PHYSICAL_STATS (DB_ID('Ops_matrix'),NULL,NULL,NULL,null) F
JOIN SYS.INDEXES I
ON(F.OBJECT_ID=I.OBJECT_ID)AND i.index_id=f.index_id
JOIN INFORMATION_SCHEMA.TABLES S with (nolock)
ON (s.table_name=OBJECT_NAME(F.OBJECT_ID))
where f.database_id=DB_ID('Ops_matrix')
and OBJECT_NAME(F.OBJECT_ID) in ('temp81','temp62','54-80_table')
/*WHERE f.avg_fragmentation_in_percent>15 and f.page_count>=1000
AND f.database_id=DB_ID('Ops_matrix')
AND OBJECTPROPERTY(I.OBJECT_ID,'ISSYSTEMTABLE')=0
and alloc_unit_type_desc = 'in_row_data'
order by f.avg_fragmentation_in_percent desc
*/

/*
ALTER TABLE temp81 REBUILD;

UPDATE STATISTICS dbo.temp81
*/

 

 

3. Check the table structure of the data type, clustered and non-clustered indexes. Select some top 10 rows and check the values stored in the column, match the data type as well, there are many cases I have seen the data type and table designed badly. Sometimes, the LOB datatype or the larger size has chosen, but the content stored in the column is very small.

Example: create table T10 (n ntext, n1 char( 700))

Note: SQL Server retains the 900-byte limit for the maximum total size of all index key columns.

 

sp_help ‘tablename’

Almost all of the column of the table is nvarchar(510) and it’s a Unicode characters for business needs.

 

4. Re-index the index and heap table.

There is no index to rebuild. I rebuilt the table/heap and updated the stats for table.

ALTER TABLE temp81 REBUILD;
UPDATE STATISTICS dbo.temp81

 

 

5. Create the index for the where clause, join columns and see the result.

Created index, but SQL optimizer is not using it. Since, the optimizer is smart and it thinks to retrieve all the data (select *), the table scan is best with some sort for “union all” and “distinct”. Thus, it always chooses a best plan based on cost, what it has from query tree and the statistics. This can happen many cases, the optimizer will skip the indexes.

create index ix_temp81_54_80 on temp81 ([54-80])
create index ix_temp62_54_80 on temp62 ([54-80])
create index ix_54_80_table_Component on [54-80_table] (Component)
create index ix_54_80_table_BOM_STATUS on [54-80_table] (BOM_STATUS)

 

6. Try using some hint like, MAXDOP, Index hint etc. I am not saying the parallel plan plan is bad, but there are cases that serial plan works best. And we have seen that, the query is waiting for CXPACKET.

 

Forcing the Index to use. Still, there are some lookups we can create some covered index, but here the case is different 20+ columns with nvarchar(510).

See the query run time is 16:11 minutes.

See the MAXDOP run time is 20:24 minutes.

 

7. Filter the data by some condition, generally by date column.

It does not have date column or any other column to limit the data return.

 

8. Check anything can be changed in the query like, select required column, remove some sort operation and add some filer option.

Removed the distinct – Parallel query run time is 13:14 and by using (MAXDOP 1) – 10:00 minutes.

Removed select * - select only a required columns. The run time is better 02:02 minutes.

9. For complex query, take out it as a batch and run it individually. Like, run the two queries separately, run with “union all” and run the full query with “distinct” and see the difference.

Query tuning steps with the execution time for better visibility. It can be reduced even lower than 2 minutes.

Steps Performed Execution Time in Minutes
Query execution time 47
Update Stats and Heap Rebuild 26
Force MAXDOP option 20
Force Index option 16
Without Distinct keyword 10
With Minimal column 2

If the tables temp62 & temp81 has the clustered index, then that is totally different. I am sure that it will run quickly. See my post: http://www.sqlserverblogforum.com/2016/03/how-important-is-clustered-index-for-a-table/.

As a rule of thumb, every table should have a clustered index and yes, it will speed up all operations are on the table. If you know the table is going to be growing bigger and it does not have a candidate for clustered index, I suggest to create a dumpy identity column with a clustered index. The important factor is how effectively we choose the column for the clustered index. Generally a Surrogate key – Ex: IDENTITY

 

I do not think that, the user is going to read “6562117” records. And it returns 50 columns as an output. This is what we always say that , do not run some dump code in the production. If you have no condition to limit, at least get some top 1000 records. Think about the query runs 45 minutes and per day it is called 10 times.

If we analysis all in query and database level. Make a benchmark of server resources (CPU, RAM) usage and storage disk and we can check with the infra team to analysis from their side. But, in the real world, that is not the case most of the time. If you really have a good team and they are ready to add resources and make some changes in storage layers, then you can test that. Since, it is very easy to change the storage layers like, putting into SSDs etc. There is nothing can see from end user point of view. Yes, I have a good team and had tested that.

Again, we need to investigate all the points of view and test everything and make a benchmark of it. The database engine is smart and it will pick and work effectively, if we have good schema and query design.

I cannot add all the images, since, already it is a long post. I hope the tuning steps will help. And there are many great links, I have added to the post. Happy learning!

Muthukkumaran kaliyamoorthy

I’m currently working as a SQL server DBA in one of the top MNC. I’m passionate about SQL Server And I’m specialized in Administration and Performance tuning. I’m an active member of SQL server Central and MSDN forum. I also write articles in SQL server Central. For more Click here

More Posts - Website

page 1 of 8»

Welcome , today is Friday, September 30, 2016