To mitigate a site failure at the primary site the 4 custom_application applications are failed over to the secondary site.
A windows 2008 tie-breaker server located at a third site monitors the primary cluster via SSH every 5 seconds.
If it detects a failure it asks the DR custom_application server to confirm that the primary site has failed.
If both the tie-breaker and the DR server agree that the primary site is down. Then the tie-breaker server executes the failover scripts located on the DR servers which perform the failover.
Operating Systems
- Application servers are RHEL 5.9
- Steward server is windows 2008
The tie-breaker server requires...
- powershell
- dotnet
- user account
- scheduled task for each monitored cluster
- powershell script
- ssh wrapper script (powershell to dotnet) ssh.net library
- log directory
- public ssh key of the primary and the DR servers (to enable paswordlSERVER login)
Files
Windows Steward server
id_rsa ## ssh private key to allow loginto linux servers
id_rsa.pub ## ssh public key to allow loginto linux servers
Renci.SshNet.dll ## SSH.net library from http://sshnet.codeplex.com/
SSH-SSERVERions.psd1 ## powershell wrapper for SSH.net from http://www.powershelladmin.com/wiki/SSH_from_PowerShell_using_the_SSH.NET_library
SSH-SSERVERions.psm1 ## licence file
ssh-fail.ps1 ## powershell script to login and checl custom_application servers from Seamus Murray
schedule.xml ## exported windows shedual task to run the script
Application Primary Server
PrimaryTest.sh ## bash script locally execute on the Application Primary servers from Seamus Murray
Application DR Server
ConfirmPriDown.sh ## remotely connects to Application Primary Server and runs PrimaryTest.sh from Seamus Murray
MountStart.sh ## Starts the Application application
Break-n-Mount.sh ## only on the index servers remotely connects to netapp controller, breaks the mirror, maps the lun and mounts it
NetApp controller
id_rsa.pub ## ssh public key of SnapMirror-user on the 2 custom_application index DR servers
Diagrams
Script contents.................
ssh-fail.ps1
#ssh-fail.ps1
#powershell script to login and check custom_application servers from Seamus Murray
Param(
[string]$cluster
)
Import-Module SSH-SSERVERions
#Start-Transcript -path C:\ssh-fail2\logs\transcript.txt -append
##set dubug level
#$debug=1
#if ( $debug -eq 1 ) {
# # Debug log file
# $deboutfile=$scriptdir+"\outputlog."+$siteid+".log"
# start-transcript -path $deboutfile -force
# # Shows a trace of each line being run with variables as variables
# set-psdebug -trace 1
#
#}
if ( $cluster -eq 'Indexer' -Or $cluster -eq 'DataBase' -Or $cluster -eq 'FrontEndA' -Or $cluster -eq 'FrontEndB')
{
}
else{
Write-Host "you must specify the cluster to test using an argument"
break
}
## IP Function Hostname Role
Switch ($cluster)
{
FrontEndA {
$server_P1='10.10.10.51' #a Application FrontEnd SERVERFW1TS QLD-APP-5
$server_P2='10.10.10.52' #a Application FrontEnd SERVERFW2TS QLD-APP-6
$server_VIP='10.10.10.53'#a Application FrontEnd VIPA
$server_DR='10.10.11.32' #a Application FrontEnd SERVERFW3TS SYD-APP-3
}
FrontEndB{
$server_P1='10.10.10.55' #b Application FrontEnd SERVERFW3TS QLD-APP-7
$server_P2='10.10.10.56' #b Application FrontEnd SERVERFW4TS QLD-APP-8
$server_VIP='10.10.10.57'#b Application FrontEnd VIPB
$server_DR='10.10.11.33' #b Application FrontEnd SERVERFW2TS SYD-APP-4
}
Indexer{
$server_P1='10.10.10.77' #c Application Indexer SERVERIX1TS QLD-APP-1
$server_P2='10.10.10.78' #c Application Indexer SERVERIX2TS QLD-APP-2
$server_VIP='10.10.10.76'#c Application Indexer VIPI
$server_DR='10.10.11.17' #c Application Indexer SERVERIX3TS SYD-APP-1
}
DataBase{
$server_P1='10.10.10.85' #d Application DataBase SERVERHD1TS QLD-APP-3
$server_P2='10.10.10.86' #d Application DataBase SERVERHD2TS QLD-APP-4
$server_VIP='10.10.10.84'#d Application DataBase VIPD
$server_DR='10.10.11.18' #d Application DataBase SERVERHD3TS SYD-APP-2
}
}
##set log file to local directory eg..2012-01-1_0000_10.10.10.51_custom_applicationfail
$Logfile = "c:\ssh-fail2\logs\$(get-date -uformat %Y-%m-%d_%H%M)"+"_$server_P1"+"_custom_applicationfail.log"
#Write_Host $Logfile
#LogWrite $env:PSModulePath
Function LogWrite
{
Param ([string]$logstring)
Add-content $Logfile -value $logstring
}
$stamped = "$(Get-Date)" + " starting script "
LogWrite $stamped
#Remote Scripts executed on the linux servers but called from this script
#You must specify the argument "Up" case sensitive for this test to succeed
#If you want to simulate this test failing just change the argument
$ApplicationPrimaryTest='/home/failover-user/custom_applicationfailover0.1/primary-test.sh Up'
#Specifiy which server the DR should test by assigning a single argument....P1 P2 or VIP
#the Various IPs are stored both locally in this file and in ApplicationConfirmPriDown.sh on the respective DR servers
#If you want to simulate this test failing just change the argument to something else
$ApplicationConfirmPriDown='/home/failover-user/custom_applicationfailover0.1/primary-confirm-fail.sh VIP'
#this command needs to execute the start up script via sudo this either requires a tty which "SSH-SSERVERions" doesn't provide or..editing sudo to disable the requiretty in /etc/sudoers
#This script varies slightly between the Application FrontEnds and the Application Indexers
#On the Application indexers the NetApp mirrored lun's need to be broken and mounted this is handled by the..
#Break-n-Mount.sh script called from within the ApplicationMountStart.sh executed from the DR servers
$ApplicationMountStart='/home/failover-user/custom_applicationfailover0.1/initiate-dr.sh Start'
#called from within $ApplicationMountStart on the Indexer DR servers
#$Break-n-Mount='/home/failover-user/custom_applicationfailover0.1/break-snap-mirror.sh RESYNC'
while("forever")
{
New-SshSSERVERion -ComputerName $server_P1 -Username 'failover-user' -KeyFile 'C:\ssh-fail2\id_rsa' # | out-null
New-SshSSERVERion -ComputerName $server_DR -Username 'failover-user' -KeyFile 'C:\ssh-fail2\id_rsa' # | out-null
try
{
#Write_Host "Testing ApplicationPrimaryTest on $server_P1 1st loop"
$stamped = "$(Get-Date)" + " Testing ApplicationPrimaryTest on $server_P1 1st loop"
LogWrite $stamped
$CmdOutput1 = Invoke-SshCommand -ComputerName $server_P1 -Command $ApplicationPrimaryTest -Quiet
}
catch [Exception]
{
$CmdOutput1 = "SSH_SSERVERION_FAILED"
#Write_Host"ERROR: $CmdOutput1 during 1st loop" -foregroundcolor white -backgroundcolor red
$stamped = "$(Get-Date)" + " ERROR: $CmdOutput1 during 1st loop"
LogWrite $stamped
}
#Check Primary Server for status
if ( $CmdOutput1 -ne 'Primary_App_Is_Up' ) {
#Write_Host "ERROR: Primary Failure Detected"
#Write_Host "Waiting 10 seconds before retrying"
$stamped = "$(Get-Date)" + " ERROR: Check Primary did not return Primary_App_Is_Up Waiting 10 seconds before retrying"
LogWrite $stamped
sleep 10
try
{
#Write_Host "Testing ApplicationPrimaryTest on $server_P1 2nd loop"
$CmdOutput1 = Invoke-SshCommand -ComputerName $server_P1 -Command $ApplicationPrimaryTest -Quiet
}
catch [Exception]
{
$CmdOutput1 = "SSH_SSERVERION_FAILED"
#Write_Host"$CmdOutput1 during 2nd loop" -foregroundcolor white -backgroundcolor red
$stamped = "$(Get-Date)" + " $CmdOutput1 during 2nd loop"
LogWrite $stamped
}
#Check Primary Server for status after a previous failure
if ( $CmdOutput1 -ne 'Primary_App_Is_Up' ) {
#Write_Host "ERROR: Primary Failure Detected 2 times"
$stamped = "$(Get-Date)" + " ERROR: Check Primary did not return Primary_App_Is_Up after 2 tries"
LogWrite $stamped
try
{
$CmdOutput2 = Invoke-SshCommand -ComputerName $server3 -Command $ApplicationPrimaryTest -Quiet
}
catch [Exception]
{
#Write_Host "ERROR:ssh sSERVERion to $server_P1 has failed. Unable to execute ApplicationPrimaryTest"
$stamped = "$(Get-Date)" + " ERROR:ssh sSERVERion to $server_P1 has failed. Unable to execute ApplicationPrimaryTest"
LogWrite $stamped
}
#If this host fails 2 time to determine if Primary_App_Is_Up, then ask DR server to also run the check
if ( $CmdOutput2 -ne 'Primary_App_Is_Up' ) {
#Write_Host "ERROR: DR Server $server3 is also reporting Primary Failure....... Need to Initiate DR"
$stamped = "$(Get-Date)" + " ERROR: DR Server is also reporting Primary Failure....... Need to Initiate DR"
LogWrite $stamped
try
{
$CmdOutput3 = Invoke-SshCommand -ComputerName $server_DR -Command $ApplicationMountStart -Quiet
}
catch [Exception]
{
$CmdOutput3 = "SSH_SSERVERION_FAILED"
#Write_Host "ERROR:ssh sSERVERion to $server_DR has failed. Unable to execute ApplicationMountStart"
$stamped = "$(Get-Date)" + " ERROR:ssh sSERVERion to $server_DR has failed. Unable to execute ApplicationMountStart"
LogWrite $stamped
}
if ( $CmdOutput3 -ne 'App_Started' ) {
#Write_Host "ERROR: DR server $server_DR Failed to start the App"
$stamped = "$(Get-Date)" + " ERROR: DR server $server_DR Failed to start the App"
LogWrite $stamped
}
else {
#Write_Host "DR server $server_DR has started the App"
$stamped = "$(Get-Date)" + " DR server $server_DR has started the App"
LogWrite $stamped
$stamped = "$(Get-Date)" + " Nothing else to do...Failover script self terminating"
LogWrite $stamped
break
}
}
else {
#Write_Host "DR server $server_DR is reporting Primary $server_P1 is OK: Nothing To Do"
#Write_Host "Assuming the link between me and the Primary server has failed"
$stamped = "$(Get-Date)" + " Assuming the link between me and the Primary server has failed"
LogWrite $stamped
}
}
else {
#Write_Host "Primary server $server_P1 is OK: Nothing To Do 2nd test"
$stamped = "$(Get-Date)" + " Primary server $server_P1 is OK: Nothing To Do 2nd test"
LogWrite $stamped
}
}
else {
#Write_Host "Primary server $server_P1 is OK: Nothing To Do 1st test"
$stamped = "$(Get-Date)" + " Primary server $server_P1 is OK: Nothing To Do 1st test"
LogWrite $stamped
}
sleep 5
}
Remove-SshSSERVERion -RemoveAll
#Stop-Transcript
schedule.xml
#schedule.xml
#exported windows shedual task to run the script
<?xml version="1.0" encoding="UTF-16"?>
<Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task">
<RegistrationInfo>
<Date>2012-01-1T12:30:10</Date>
<Author>ABCDEFG123\seamus</Author>
</RegistrationInfo>
<Triggers>
<RegistrationTrigger>
<Repetition>
<Interval>PT15M</Interval>
<StopAtDurationEnd>false</StopAtDurationEnd>
</Repetition>
<ExecutionTimeLimit>PT1H</ExecutionTimeLimit>
<Enabled>true</Enabled>
</RegistrationTrigger>
</Triggers>
<Principals>
<Principal id="Author">
<UserId>ABCDEFG123\Administrator</UserId>
<LogonType>Password</LogonType>
<RunLevel>HighestAvailable</RunLevel>
</Principal>
</Principals>
<Settings>
<MultipleInstancesPolicy>StopExisting</MultipleInstancesPolicy>
<DisallowStartIfOnBatteries>false</DisallowStartIfOnBatteries>
<StopIfGoingOnBatteries>true</StopIfGoingOnBatteries>
<AllowHardTerminate>true</AllowHardTerminate>
<StartWhenAvailable>true</StartWhenAvailable>
<RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>
<IdleSettings>
<StopOnIdleEnd>true</StopOnIdleEnd>
<RestartOnIdle>false</RestartOnIdle>
</IdleSettings>
<AllowStartOnDemand>true</AllowStartOnDemand>
<Enabled>true</Enabled>
<Hidden>false</Hidden>
<RunOnlyIfIdle>false</RunOnlyIfIdle>
<WakeToRun>false</WakeToRun>
<ExecutionTimeLimit>PT1H</ExecutionTimeLimit>
<Priority>7</Priority>
</Settings>
<Actions Context="Author">
<Exec>
<Command>C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe</Command>
<Arguments>-File "C:\ssh-fail2\slc-tie-breaker.ps1" FrontEndA -ExecutionPolicy RemoteSigned -noprofile -noninteractive</Arguments>
<WorkingDirectory>C:\ssh-fail2\</WorkingDirectory>
</Exec>
</Actions>
</Task>
PrimaryTest.sh...bash script locally executed on the Application Primary servers
#!/bin/bash
if [ "$1" = "Up" ]; then
Is_App_Up="Primary_App_Is_Up"
else
Is_App_Up="Primary_App_Is_Down"
fi
echo $Is_App_Up
echo `date` >> /tmp/log
ConfirmPriDown.sh....bash script remotely connects to Application Primary Server and runs PrimaryTest.sh
#!/bin/bash
VIP=10.10.10.53
P1=10.10.10.51
P2=10.10.10.52
if [[ "$1" = "VIP" ]]; then TEST="$VIP"
elif [[ "$1" = "P1" ]]; then TEST="$P1"
elif [[ "$1" = "P2" ]]; then TEST="$P2"
else
echo "Usage: You must define which node to test {VIP,P1,P2}"
exit 0
fi
if [[ `ssh $TEST -q -C /home/failover-user/custom_applicationfailover0.1/primary-test.sh Up` = "Primary_App_Is_Up" ]] ; then
echo Primary_App_Is_Up
else
echo test-failed
fi
#!/bin/bash
echo >> /home/failover-user/custom_applicationfailover0.1/startlog
echo `date` start >> /home/failover-user/custom_applicationfailover0.1/startlog
if [ "$1" = "Start" ] ;
then
# Start_App="App_Started"
echo `date` " " $0" " $1 >> /home/failover-user/custom_applicationfailover0.1/startlog
/usr/bin/sudo /etc/init.d/custom_application-heavy restart &> /home/failover-user/custom_applicationfailover0.1/startlog
echo "sudo /etc/init.d/custom_application-heavy restart" >> /home/failover-user/custom_applicationfailover0.1/startlog
if [[ $? = "0" ]]
then
Start_App="App_Started"
echo $Start_App
else
echo $0 Failed to restart app
echo $Start_App
fi
else
echo `date` Usage $0 Start >> /home/failover-user/custom_applicationfailover0.1/startlog
fi
echo `date` finsihed >> /home/failover-user/custom_applicationfailover0.1/startlog
#/usr/bin/sudo /etc/init.d/custom_application-heavy restart