AdvancedTechnicalSupport–Systemp7/7/2008©2007IBMCorporationHACMPConceptandImplementationAdvancedTechnicalSupport–Systemp©2003IBMCorporation27/7/2008目录–HACMP基本概念–HACMP5.x新功能介绍–常见HA架构–日常管理–Q&AAdvancedTechnicalSupport–Systemp©2003IBMCorporation37/7/2008AlthoughHardwareisNowVeryReliable,HardwareFailuresAccountforaSmallMinorityofSystemOutagesSeveralstudiesplacetheproportionbetween20%and45%Humanerror,softwareerrorandplannedmaintenancecausethemajorityofserviceoutagesAdvancedTechnicalSupport–Systemp©2003IBMCorporation47/7/2008HACMP—(HighAvailabilityClusterMultiProcessing)–为什么需要高可用性?–什么是HACMP?HighAvailability:•系统可用性或运行时间昀大化•系统宕机时间昀小化multi-processing:•一个cluster里的各个节点上可以运行多个应用•共享数据或并发访问数据.–HACMP的目的•消除单点故障(SPOF),实现高可用–HighAvailabilityisfaultresilientnotfaulttolerantAdvancedTechnicalSupport–Systemp©2003IBMCorporation57/7/2008高可用&容错StandaloneHighAvailabilityClustersFaultTolerantComputersSolutionsAvailabilitybenefitsJournaledFileSystemDynamicCPUDeallocationServiceProcessorRedundantPowerRedundantCoolingECCMemoryHotSwapAdaptersDynamicKernelRedundantServersRedundantNetworksRedundantNetworkAdaptersHeartbeatMonitoringFailureDetectionFailureDiagnosisAutomatedFalloverAutomatedReintegrationLockStepCPUsHardenedOperatingSystemHotSwapEverythingContinuousRestartDowntimeCoupleofdaysDepends,buttypically3minsIntheory,noneDataAvailabilityGoodasyourlastfullbackupLasttransactionNolossofDataRelativeCost12-310+AdvancedTechnicalSupport–Systemp©2003IBMCorporation67/7/2008FundamentalHACMPConceptsConceptsTopology:Physical“networking-centric”componentsResources:EntitieswhicharebeingmadehighlyavailableResourcegroup:AcollectionofresourceswhichHACMPcontrolsasasingleunitResourcegrouppolicies:–startuppolicy:determinesonwhichnodetheresourcegroupisactivated–falloverpolicy:determinestargetwhenthereisafailure–fallbackpolicy:determinesfallbackbehaviorCustomization:TheprocessofaugmentingHACMP,typicallyviaimplementingscriptsAdvancedTechnicalSupport–Systemp©2003IBMCorporation77/7/2008HACMP'sTopologyComponentsThetopologycomponentsconsistofacluster,nodes,andthenetworktechnologywhichconnectsthemtogether.AdvancedTechnicalSupport–Systemp©2003IBMCorporation87/7/2008HACMP'sResourceComponentsAdvancedTechnicalSupport–Systemp©2003IBMCorporation97/7/2008NetworkingReview:IPATHACMPusesIPAddressTakeover(IPAT)tokeepnetworkingresources(serviceIPlabels,persistentlabels)highlyavailable•Thereare2typesofIPAT:–IPATviaIPAliasing:•HACMPaddstheserviceIPaddresstoan(AIX)interfaceIPaddressusingAIX'sIPaliasingfeature:ifconfigen0alias192.168.1.2–IPATviaIPReplacement:•HACMPreplacesan(AIX)interfaceIPaddresseswiththeserviceIPaddresses:ifconfigen0192.168.1.2AdvancedTechnicalSupport–Systemp©2003IBMCorporation107/7/2008NetworkingReview:ConfigurationRuleszNon-serviceIPaddresses–Definetheseaddressinthe/etc/hostfileandconfiguretheminHACMPtopologyascommunicationinterfaces–UsingheartbeatoverIPinterfaces•Toenableaccuratediagnosisofnetworkcomponentfailures,eachIPaddressdefinedonanode’sinterfacesmustbeinadifferentlogicalIPsubnet(thisaddressisconfiguredinAIX)•Theremustbeatleastonesubnetincommonwithallnodes–UsingheartbeatoverIPalias•RemovessubnetrestrictionsonalladdresseszServiceIPaddresses–Defineserviceaddressesin/etc/hostsandinHACMPresources•HACMPwillconfigurethemtoAIXwhenneeded–IPATviaIPAliasing:•TheymustnotbeinthesamelogicalIPsubnetasanyofthenon-serviceIPaddresses–IPATviaIPReplacement•EachserviceIPlabelmustbeinthesamesubnetasanon-servicelabelsubnet•TheremustbeatleastasmanyNICsoneachnodeasthereareserviceIPlabels•AllserviceIPlabelsmustbeinthesamesubnetAdvancedTechnicalSupport–Systemp©2003IBMCorporation117/7/2008JustWhatDoesHACMPDo?HACMPfunctions:–Monitorthestatesofnodes,networks,networkadapters/devices–Strivetokeepresourcegroupshighlyavailable–Optionally,HACMPcanmonitorthestateoftheapplication(s)andcanbecustomizedtoreacttoeverypossiblefailureAdvancedTechnicalSupport–Systemp©2003IBMCorporation127/7/2008WhatHappensWhenSomethingFails?Howtheclusterrespondstoafailuredependsonwhathasfailed,whattheresourcegroup'sfalloverpolicyis,andifthereareanyresourcegroupdependencies:–Typicallyanotherequivalentcomponenttakesoverdutiesofthefailedcomponent(forexample,anothernodetakesoverfromafailednode)AdvancedTechnicalSupport–Systemp©2003IBMCorporation137/7/2008WhatHappensWhenaProblemisFixed?Howtheclusterrespondstotherecoveryofafailedcomponentdependsonwhathasrecovered,whattheresourcegroup'sfallbackpolicyis,andwhatresourcegroupdependenciesthereare:–Typically,administratorsneedtoindicate/confirmthatthefixedcomponentisapprovedforuse.Somecomponentsareintegratedautomatically,forinstancewhenacommunicationinterfacerecovers.AdvancedTechnicalSupport–Systemp©2003IBMCorporation147/7/2008ResourceGroupBehavior?Non-concurrent–Standbywith/withoutfallback–Mutualtakeover(verypopular)Concurrent–Applicationmustbedesignedtorunsimultaneouslyonmultiplenodes–Thishasthepotentialforessentiallyzerodowntimeandisdesignedforfaulttoleranceandhighperformance–TheapplicationmustbespecificallywrittenfortheenvironmentAdvancedT