To mitigate a site failure at the primary site the 4 custom_application applications are failed over to the secondary site.

A windows 2008 tie-breaker server located at a third site monitors the primary cluster via SSH every 5 seconds.

If it detects a failure it asks the DR custom_application server to confirm that the primary site has failed.

If both the tie-breaker and the DR server agree that the primary site is down. Then the tie-breaker server executes the failover scripts located on the DR servers which perform the failover.

Operating Systems

  • Application servers are RHEL 5.9
  • Steward server is windows 2008

The tie-breaker server requires...

  • powershell
  • dotnet
  • user account
  • scheduled task for each monitored cluster
  • powershell script
  • ssh wrapper script (powershell to dotnet) ssh.net library
  • log directory
  • public ssh key of the primary and the DR servers (to enable paswordlSERVER login)

Files

Windows Steward server

id_rsa              ## ssh private key to allow loginto linux servers
id_rsa.pub          ## ssh public key to allow loginto linux servers
Renci.SshNet.dll    ## SSH.net library from http://sshnet.codeplex.com/ 
SSH-SSERVERions.psd1   ## powershell wrapper for SSH.net from http://www.powershelladmin.com/wiki/SSH_from_PowerShell_using_the_SSH.NET_library
SSH-SSERVERions.psm1   ## licence file
ssh-fail.ps1        ## powershell script to login and checl custom_application servers from Seamus Murray
schedule.xml        ## exported windows shedual task to run the script

Application Primary Server

PrimaryTest.sh      ## bash script locally execute on the Application Primary servers from Seamus Murray

Application DR Server

ConfirmPriDown.sh   ## remotely connects to Application Primary Server and runs PrimaryTest.sh from Seamus Murray
MountStart.sh       ## Starts the Application application
Break-n-Mount.sh    ## only on the index servers remotely connects to netapp controller, breaks the mirror, maps the lun and mounts it

NetApp controller

id_rsa.pub          ## ssh public key of SnapMirror-user on the 2 custom_application index DR servers

Diagrams


Script contents.................

ssh-fail.ps1

#ssh-fail.ps1        
#powershell script to login and check custom_application servers from Seamus Murray

Param(
  [string]$cluster  
)
Import-Module  SSH-SSERVERions


#Start-Transcript -path C:\ssh-fail2\logs\transcript.txt -append


##set dubug level

#$debug=1
#if ( $debug -eq 1 ) {
#               # Debug log file
#               $deboutfile=$scriptdir+"\outputlog."+$siteid+".log"
#               start-transcript -path $deboutfile -force
#               # Shows a trace of each line being run with variables as variables
#               set-psdebug -trace 1
#
#}

if ( $cluster -eq 'Indexer' -Or  $cluster -eq 'DataBase' -Or $cluster -eq 'FrontEndA' -Or $cluster -eq 'FrontEndB')
{

}
else{
Write-Host "you must specify the cluster to test using an argument"
break
}

##           IP                    Function             Hostname          Role    
Switch ($cluster)
{
FrontEndA {
$server_P1='10.10.10.51' #a    Application FrontEnd     SERVERFW1TS    QLD-APP-5
$server_P2='10.10.10.52' #a    Application FrontEnd     SERVERFW2TS    QLD-APP-6
$server_VIP='10.10.10.53'#a    Application FrontEnd     VIPA
$server_DR='10.10.11.32' #a    Application FrontEnd     SERVERFW3TS    SYD-APP-3
}
FrontEndB{
$server_P1='10.10.10.55' #b    Application FrontEnd     SERVERFW3TS    QLD-APP-7
$server_P2='10.10.10.56' #b    Application FrontEnd     SERVERFW4TS    QLD-APP-8
$server_VIP='10.10.10.57'#b    Application FrontEnd     VIPB 
$server_DR='10.10.11.33' #b    Application FrontEnd     SERVERFW2TS    SYD-APP-4
}
Indexer{
$server_P1='10.10.10.77' #c    Application Indexer      SERVERIX1TS    QLD-APP-1
$server_P2='10.10.10.78' #c    Application Indexer      SERVERIX2TS    QLD-APP-2
$server_VIP='10.10.10.76'#c    Application Indexer      VIPI
$server_DR='10.10.11.17' #c    Application Indexer      SERVERIX3TS    SYD-APP-1
}
DataBase{
$server_P1='10.10.10.85' #d    Application DataBase     SERVERHD1TS    QLD-APP-3
$server_P2='10.10.10.86' #d    Application DataBase     SERVERHD2TS    QLD-APP-4
$server_VIP='10.10.10.84'#d    Application DataBase     VIPD
$server_DR='10.10.11.18' #d    Application DataBase     SERVERHD3TS    SYD-APP-2
}
}

##set log file to local directory  eg..2012-01-1_0000_10.10.10.51_custom_applicationfail
$Logfile = "c:\ssh-fail2\logs\$(get-date -uformat %Y-%m-%d_%H%M)"+"_$server_P1"+"_custom_applicationfail.log"

#Write_Host $Logfile
#LogWrite  $env:PSModulePath
Function LogWrite
{
   Param ([string]$logstring)

   Add-content $Logfile -value $logstring
}

$stamped  = "$(Get-Date)" + " starting script "
LogWrite $stamped





#Remote Scripts executed on the linux servers but called from this script
#You must specify the argument "Up" case sensitive for this test to succeed 
#If you want to simulate this test failing just change the argument
$ApplicationPrimaryTest='/home/failover-user/custom_applicationfailover0.1/primary-test.sh Up'

#Specifiy which server the DR should test by assigning a single argument....P1 P2 or VIP
#the Various IPs are stored both locally in this file and in   ApplicationConfirmPriDown.sh on the respective DR servers
#If you want to simulate this test failing just change the argument to something else
$ApplicationConfirmPriDown='/home/failover-user/custom_applicationfailover0.1/primary-confirm-fail.sh VIP'

#this command needs to execute the start up script via sudo this either requires a tty which "SSH-SSERVERions" doesn't provide or..editing sudo to disable the requiretty in /etc/sudoers
#This script varies slightly between the Application FrontEnds and the Application Indexers
#On the Application indexers the NetApp mirrored lun's need to be broken and mounted this is handled by the..
#Break-n-Mount.sh script called from within the ApplicationMountStart.sh executed from the DR servers
$ApplicationMountStart='/home/failover-user/custom_applicationfailover0.1/initiate-dr.sh Start'

#called from within $ApplicationMountStart on the Indexer DR servers
#$Break-n-Mount='/home/failover-user/custom_applicationfailover0.1/break-snap-mirror.sh RESYNC'





while("forever")
{
    New-SshSSERVERion -ComputerName $server_P1 -Username 'failover-user'  -KeyFile 'C:\ssh-fail2\id_rsa' # | out-null 
    New-SshSSERVERion -ComputerName $server_DR -Username 'failover-user'  -KeyFile 'C:\ssh-fail2\id_rsa' # | out-null


try
{
 #Write_Host "Testing ApplicationPrimaryTest on $server_P1 1st loop"
 $stamped = "$(Get-Date)" + " Testing ApplicationPrimaryTest on $server_P1 1st loop"
 LogWrite  $stamped
 $CmdOutput1 = Invoke-SshCommand -ComputerName $server_P1 -Command $ApplicationPrimaryTest -Quiet
}
catch [Exception]
{ 
  $CmdOutput1 = "SSH_SSERVERION_FAILED" 
  #Write_Host"ERROR: $CmdOutput1 during 1st loop" -foregroundcolor white -backgroundcolor red
  $stamped  = "$(Get-Date)" + " ERROR: $CmdOutput1 during 1st loop"
  LogWrite $stamped
}


    #Check Primary Server for status
    if ( $CmdOutput1 -ne 'Primary_App_Is_Up' ) { 
        #Write_Host "ERROR: Primary Failure Detected"
        #Write_Host "Waiting 10 seconds before retrying"
        $stamped =  "$(Get-Date)" + " ERROR: Check Primary did not return Primary_App_Is_Up   Waiting 10 seconds before retrying"        
        LogWrite $stamped
        sleep 10


        try
       {
           #Write_Host "Testing ApplicationPrimaryTest on $server_P1 2nd loop"
           $CmdOutput1 = Invoke-SshCommand -ComputerName $server_P1 -Command $ApplicationPrimaryTest -Quiet
       }
       catch [Exception]
           {
             $CmdOutput1 = "SSH_SSERVERION_FAILED"
             #Write_Host"$CmdOutput1 during 2nd loop" -foregroundcolor white -backgroundcolor red
             $stamped =  "$(Get-Date)" + " $CmdOutput1 during 2nd loop"
             LogWrite $stamped
       }

                  #Check Primary Server for status after a previous failure
                  if ( $CmdOutput1 -ne 'Primary_App_Is_Up' ) { 
                  #Write_Host "ERROR: Primary Failure Detected 2 times"
                $stamped =  "$(Get-Date)" + " ERROR: Check Primary did not return Primary_App_Is_Up after 2 tries"
                LogWrite $stamped

           try
              {
                  $CmdOutput2 = Invoke-SshCommand -ComputerName $server3 -Command $ApplicationPrimaryTest -Quiet
              }
              catch [Exception]
                  {
                  #Write_Host "ERROR:ssh sSERVERion to $server_P1 has failed. Unable to execute ApplicationPrimaryTest"
               $stamped =  "$(Get-Date)" + " ERROR:ssh sSERVERion to $server_P1 has failed. Unable to execute ApplicationPrimaryTest"
               LogWrite $stamped
              }

                    #If this host fails 2 time to determine if Primary_App_Is_Up, then ask DR server to also run the check          
                    if ( $CmdOutput2 -ne 'Primary_App_Is_Up' ) {     
                    #Write_Host "ERROR: DR Server $server3 is also reporting Primary Failure....... Need to Initiate DR"
                    $stamped =  "$(Get-Date)" + " ERROR: DR Server is also reporting Primary Failure....... Need to Initiate DR"
                    LogWrite $stamped


                try
            {
             $CmdOutput3 = Invoke-SshCommand -ComputerName $server_DR -Command $ApplicationMountStart  -Quiet
            }
            catch [Exception]
            {
            $CmdOutput3 = "SSH_SSERVERION_FAILED"
              #Write_Host "ERROR:ssh sSERVERion to $server_DR has failed. Unable to execute ApplicationMountStart"
            $stamped =  "$(Get-Date)" + " ERROR:ssh sSERVERion to $server_DR has failed. Unable to execute ApplicationMountStart"
            LogWrite $stamped

                        }


                if ( $CmdOutput3 -ne 'App_Started' ) {     
                #Write_Host "ERROR: DR server $server_DR Failed to start the App"
                $stamped =  "$(Get-Date)" + " ERROR: DR server $server_DR Failed to start the App"
                LogWrite $stamped


                }
                else {
                   #Write_Host "DR server $server_DR has started the App"
                   $stamped =  "$(Get-Date)" + " DR server $server_DR has started the App"
                   LogWrite $stamped
                   $stamped =  "$(Get-Date)" + " Nothing else to do...Failover script self terminating"
                   LogWrite $stamped
                   break
                }

        }
        else {
            #Write_Host "DR server $server_DR is reporting Primary $server_P1 is OK: Nothing To Do"
            #Write_Host "Assuming the link between me and the Primary server has failed"
            $stamped =  "$(Get-Date)" + " Assuming the link between me and the Primary server has failed"
            LogWrite $stamped
        } 
    }
              else {
                #Write_Host "Primary server $server_P1 is OK: Nothing To Do             2nd test"   
                $stamped =  "$(Get-Date)" + " Primary server $server_P1 is OK: Nothing To Do             2nd test" 
                LogWrite $stamped
                   } 
    }
    else {
        #Write_Host "Primary server $server_P1 is OK: Nothing To Do             1st test"   
        $stamped =  "$(Get-Date)" + " Primary server $server_P1 is OK: Nothing To Do             1st test" 
        LogWrite $stamped
    }

    sleep 5

}

Remove-SshSSERVERion -RemoveAll

#Stop-Transcript

schedule.xml

#schedule.xml
#exported windows shedual task to run the script
<?xml version="1.0" encoding="UTF-16"?>
<Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task">
  <RegistrationInfo>
    <Date>2012-01-1T12:30:10</Date>
    <Author>ABCDEFG123\seamus</Author>
  </RegistrationInfo>
  <Triggers>
    <RegistrationTrigger>
      <Repetition>
        <Interval>PT15M</Interval>
        <StopAtDurationEnd>false</StopAtDurationEnd>
      </Repetition>
      <ExecutionTimeLimit>PT1H</ExecutionTimeLimit>
      <Enabled>true</Enabled>
    </RegistrationTrigger>
  </Triggers>
  <Principals>
    <Principal id="Author">
      <UserId>ABCDEFG123\Administrator</UserId>
      <LogonType>Password</LogonType>
      <RunLevel>HighestAvailable</RunLevel>
    </Principal>
  </Principals>
  <Settings>
    <MultipleInstancesPolicy>StopExisting</MultipleInstancesPolicy>
    <DisallowStartIfOnBatteries>false</DisallowStartIfOnBatteries>
    <StopIfGoingOnBatteries>true</StopIfGoingOnBatteries>
    <AllowHardTerminate>true</AllowHardTerminate>
    <StartWhenAvailable>true</StartWhenAvailable>
    <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>
    <IdleSettings>
      <StopOnIdleEnd>true</StopOnIdleEnd>
      <RestartOnIdle>false</RestartOnIdle>
    </IdleSettings>
    <AllowStartOnDemand>true</AllowStartOnDemand>
    <Enabled>true</Enabled>
    <Hidden>false</Hidden>
    <RunOnlyIfIdle>false</RunOnlyIfIdle>
    <WakeToRun>false</WakeToRun>
    <ExecutionTimeLimit>PT1H</ExecutionTimeLimit>
    <Priority>7</Priority>
  </Settings>
  <Actions Context="Author">
    <Exec>
      <Command>C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe</Command>
      <Arguments>-File "C:\ssh-fail2\slc-tie-breaker.ps1" FrontEndA -ExecutionPolicy RemoteSigned -noprofile -noninteractive</Arguments>
      <WorkingDirectory>C:\ssh-fail2\</WorkingDirectory>
    </Exec>
  </Actions>
</Task>

PrimaryTest.sh...bash script locally executed on the Application Primary servers

#!/bin/bash

    if [ "$1" = "Up" ]; then
                Is_App_Up="Primary_App_Is_Up"
    else
                Is_App_Up="Primary_App_Is_Down"
    fi
echo $Is_App_Up
echo `date` >> /tmp/log

ConfirmPriDown.sh....bash script remotely connects to Application Primary Server and runs PrimaryTest.sh

#!/bin/bash

VIP=10.10.10.53
P1=10.10.10.51
P2=10.10.10.52

if [[ "$1" = "VIP" ]]; then TEST="$VIP"
   elif [[ "$1" = "P1" ]]; then TEST="$P1"
   elif [[ "$1" = "P2" ]]; then TEST="$P2"
else
echo "Usage: You must define which node to test {VIP,P1,P2}"
exit 0
fi

if [[ `ssh $TEST -q -C /home/failover-user/custom_applicationfailover0.1/primary-test.sh Up` = "Primary_App_Is_Up"  ]]  ; then
  echo Primary_App_Is_Up
else
   echo test-failed
fi


#!/bin/bash

echo  >> /home/failover-user/custom_applicationfailover0.1/startlog
echo `date`  start >> /home/failover-user/custom_applicationfailover0.1/startlog

if [ "$1" = "Start" ] ;
   then
  # Start_App="App_Started"

       echo `date` " " $0" " $1 >> /home/failover-user/custom_applicationfailover0.1/startlog
        /usr/bin/sudo /etc/init.d/custom_application-heavy restart &> /home/failover-user/custom_applicationfailover0.1/startlog
        echo "sudo /etc/init.d/custom_application-heavy restart" >> /home/failover-user/custom_applicationfailover0.1/startlog

        if  [[ $? = "0"  ]]
                then
                 Start_App="App_Started"
                 echo $Start_App
                else
                 echo $0 Failed to restart app
                 echo $Start_App

        fi
   else
   echo `date` Usage $0 Start >> /home/failover-user/custom_applicationfailover0.1/startlog
fi

echo `date`  finsihed >> /home/failover-user/custom_applicationfailover0.1/startlog

#/usr/bin/sudo /etc/init.d/custom_application-heavy restart