TwitterMessaging的架构演化之路1AgendaBackgroundLayeredArchitectureDesignDetailsPerformanceScale@TwitterQ&APublish-SubscribeOnlineservices-10sofmillisecondsTransactionlog,Queues,RPCNearreal-timeprocessing-100sofmillisecondsChangepropagation,StreamComputingDatadeliveryforanalytics-seconds~minutesLogcollection,aggregationTwitterMessagingat2012KestrelDeferredRPCKestrelGizzardKestrelDatabaseBookKeeperSearchMySQLKafkaHDFSScribeCoreBusinessLogic(tweets,fanouts…)OnlineServices-KestrelKestrelSimplePerformwell(aslongasqueuefitsinmemory)Fan-outQueues:OnepersubscriberReliablereads-peritemtransactionCrossDCreplicationOnlineServices-KestrelLimitationsKestrelLimitationsDurabilityishardtoachieve-EachqueueisaseparatefileAddingsubscribersisexpensiveSeparatephysicalcopyofthequeueforeachfanoutRead-behinddegradesperformance-ToomanyrandomI/OsScalespoorlyas#queuesincreaseStreamComputing-KafkaKafkaThroughput/LatencythroughsequentialI/OwithsmallnumberoftopicsAvoiddatacopying-DirectNetworkI/O(sendfile)BatchCompressionCrossDCreplication(Mirroring)StreamComputing-KafkaLimitationKafkaLimitationReliesonfilesystempagecacheLimit#topics:IdeallyoneorhandfultopicsperdiskPerformancedegradeswhensubscriberfallsbehind-ToomuchrandomI/ONodurabilityandreplication(0.7)ProblemsEachofthesystemscamewiththeirmaintenanceoverheadSoftwareComponents-backend,clientsandinteropwiththerestofTwitterstackManageabilityandSupportability-deployment,upgrades,hardwaremaintenanceandoptimizationTechnicalknow-howRethinkthemessagingarchitectureUnifiedStack-tradeoffsforvariousworkloadsDurablewrites,intra-clusterandgeo-replicationMultitenancyScaleresourcesindependently-CostefficiencyEaseofmanageabilityLayeredArchitectureDataModelSoftwareStackDataFlowLogStream0100100111010sequenceofbytes1010101110100010111011001010101101101011010110110101Entry:abatchofrecordsLogSegmentXLogSegmentX+1LogSegmentYDLSN:(LSSN,Eid,Sid)SequenceIDTransactionIDLayeredArchitectureBookieBookieBookieColdStorage(HDFS)MetadataStore(ZooKeeper)PERSISTENTSTORAGEAPPLICATIONSTATELESSSERVINGCOREBookieBookKeeperWriteClientsOwnershipTrackerWriteProxyWriterReadClientsRoutingServiceReadProxyReaderMessagingFlowWriteClientWriteProxyBookieBookieReadProxyReadClientReadClientReadClient1.writerecords4.acknowledge2.transmitbuffer3.Flush-Writeabatchedentrytobookies5.Commit-WriteControlRecordBookie6.Longpollread7.SpeculativeRead8.CacheRecords9.LongpollreadDesignDetailsConsistencyGlobalReplicatedLogConsistencyLastAddConfirmed=ConsistentviewsamongreadersFencing=Consistentviewsamongwriters01234789LastAddPushed101112Consistency-LastAddPushedWriterAddentriesConsistency-LastAddConfirmed01234789101112LastAddConfirmedReaderReaderLastAddPushedWriterWriterOwnershipChangedAddentriesAckAddsFencingInprogressConsistency-FencingNewWriterBookieBookieBookieBookieBookieCompletedLogSegmentXCompletedLogSegmentX+1InprogressLogSegmentX+20.OwnershipChangedOldWriter1.GetLogSegmentsCompletedLogSegmentX2.3CompleteInprogressLogSegmentInprogressLogSegmentX+12.1FenceInprogressLogSegment2.2writerejected3newinprogressConsistency-OwnershipTrackingOwnershipTracking(LeaderElection)ZooKeeperEphemeralZnodes(leases)AggressiveFailureDetection(withinasecond)TickTime=500(ms)SessionTimeout=1000(ms)GlobalReplicatedLogRegionAwareDataPlacementCrossRegionSpeculativeReadsBookieRegion1ZKWriteProxyBookieBookieRegion3ZKWriteProxyReaderGlobalReplicatedLogWriterOwnershipTrackerRegion2ZKWriteProxyRegionAwarePlacementPolicySpeculativeReadRegionAwareDataPlacementPolicyHierarchicalDataPlacementDataisspreaduniformlyacrossavailableregionsEachregionusesrackawareplacementpolicyAcknowledgeonlywhenthedataispersistedinmajorityofregionsCrossRegionSpeculativeReadsReaderconsultsdataplacementpolicyforreadorderFirst:thebookienodethatisclosesttotheclientSecond:theclosestnodethatisinadifferentfailuredomain-differentrackThird:thebookienodeinadifferentclosestregion...PerformanceLatencyvsThroughputScalabilityEfficiencySupportvariousworkloadswithlatency/throughputtradeoffsLatencyandthroughputunderdifferentflushpoliciesScalewithmultiplestreams(singlenodevsmultiplenodes)Under100krps,latencyincreasedwithnumberofstreamsincreasedonasinglehybridproxyEachstreamwrites100rps.Throughputincreasedlinearlywithnumberofstreams.ScalewithlargenumberoffanoutreadersAnalyticapplicationwrites2.45GBpersecond,whilethedatahasbeenfanoutto40xtothereaders.Messaging@TwitterUseCasesDeploymentScaleApplicationsatTwitterManhattanKey-ValueStoreDurableDeferredRPCReal-timesearchindexingSelf-ServedPub-SubSystem/StreamComputingReliablecrossdatacenterreplication...ScaleatTwitterOneglobalcluster,andafewlocalclusterseachdcO(10^3)bookienodesO(10^3)globallogstreamsandO(10^4)locallogstreamsO(10^6)livelogsegmentsPub-Sub:deliverO(1)trillionrecordsperday,roughlyaccountingforO(10)PBperdayLessonsthatwelearnedMakefoundationdurableandconsistentDon’ttrustfilesystemThinkofworkloadsandI/OisolationKeeppersistentstateassimpleaspossible...DistributedLogisthenewmessagingfoundationLayeredArchitectureSeparatedStatelessServingfromStatefulStorageScaleCPU/Memory