HPCC:HighPrecisionCongestionControlYuliangLiRuiMiao,HongqiangHarryLiu,YanZhuang,FeiFeng,LingboTang,ZhengCao,MingZhang,FrankKelly,MohammadAlizadeh,MinlanYu1High-performancenetworkingisdesiredHigh-performancestorage•HDDSSDNVMe•Higher-throughput,lowerlatency•1MIOPS/50~100us•CPUGPU,FPGA,ASIC•Fastercompute,lowerlatency•E.g.latency10usDistributeddeeplearning•Morenetworkload•Needultra-lowerlatency:3-5usResourcedisaggregation2computecomputecomputestoragestorageNetworkIOnetworkCCCMMMGGGM3TwoessentialstepstohighperformanceNetworkAppStackNICAppStackNICAppStackNICManyongoingeffortsonhardwareoffloading-RDMA,SmartNICMoreseverecongestionSoCongestioncontrol(CC)isveryimportant!CCCCCC4•State-of-the-artCCinhardware-offloadingsolutionsoDCQCN[SIGCOMM15]:ECN-basedoTIMELY[SIGCOMM15]:delay-based•Faceproblemsinoperatinglarge-scaleRDMAnetworks•RootcauseistheCCCCisimmatureinhigh-speednetwork•Stabilityissues:incast/failurecausePFCstormsordeadlockSlowconvergence•ChallengeinreconcilingBW-hungryappandlatency-sensitiveappStandingqueue•ParametertuningtakesmonthsCCparametersishardtotuneProblemsofstate-of-the-art5•Operators’problemoBufferoverflowhappensduringincast/failure►UsePFCtopreventloss:goodperformanceonaverage,threatfromPFCstormanddeadlock►DisablePFC:badperformanceonaverage.•Rootcause:CCcannotcontrolthebufferwell,becauseofslowconvergenceoImprecisefeedback(ECN,delay)cannottelltheratemismatchoCCresolvescongestionslowly,byhalvingtherateperRTToMoresevereinhigherspeednetworks,becausefillingbufferfaster•Ideally,adjusttotherightrateinoneRTToSolossisrare,anddonotneedPFCProblem1:slowconvergence6•Operators’problem:oHardtorunBW-hungryappandlatency-sensitiveappinthesameclusterProblem2:standingqueuesNetwork~5usbaseRTTMLMLMLStorageStorage100µslatency7•RootcauseisCC•Feedback(ECN/delay)reliesonthequeue•CCintentionallykeepsstandingqueuesProblem2:standingqueues8Add20~50usqueueingdelay4~10xbaseRTT!!!•Tradeoffs•ManyfactorsaffectthetradeoffsoTrafficpatterns,failurescenarios,andnetworkarch•Feedback(ECN/delay)isimpreciseoCChastouseheuristicstoguess(1)networkcondition;(2)rateadjustments•Manyparametersinsuchheuristicsoe.g.,15parametersinDCQCNoTuningtakesseveralmonthsProblem3:complexparametertuning9UtilizationStabilityThroughputLatencyProblemsofstate-of-the-art•Slowconvergence➢Noprecisefeedbackindicatinghowmuchtoincrease/decrease•Standingqueue➢Feedbackreliesonqueue•Complexparametertuning➢Noprecisefeedbackneedheuristics:lotsofparameters➢Onefundamentalissue:coarse-grainedfeedbackWhatifwehaveprecisefeedback?10•In-bandnetworktelemetry(INT)providesmanydetailsperpacket•Broadcom&BarefoothaveINTinrecentproducts.•WidelyusedfordiagnosisandmonitoringinproductionHPCC:useINTasprecisefeedbackpktpktpktINTINTSenderReceiverLink-1Link-21112HowgoodcanCCdowithINTasthefeedback?WedesignHPCCtoanswerthisquestion•Slowconvergence➢Noprecisefeedbackindicatinghowmuchtoincrease/decrease•Standingqueue➢Feedbackreliesonqueue•Complexparametertuning➢Noprecisefeedbackneedheuristics:lotsofparameters➢Onefundamentalissue:coarse-grainedfeedback13HPCCsolvesthe3problemsUsingINTastheprecisefeedback•FastconvergenceSenderknowsthepreciseratetoadjustto,oneveryACK•Near-zeroqueueFeedbackdoesnotrelyonqueue•FewparametersPrecisefeedback,sononeedforheuristicswhichrequiresmanyparametersHPCC:overviewandchallengesACKAdjustrateperACKChallenge1:Feedbackdelay:-Pkt/ACKmaygetsdelayedChallenge2:Overreaction:-DiffACKsbringoverlappingfeedback14pktpktpktINTINTSenderReceiverLink-1Link-2Challenge1:toleratefeedbackdelay15feedbackT02TRate(DCQCN/TIMELY)feedbackT02TNocongestionCongestionHighratepersistsCanbeTBecauseTisverylowandqueueingdelaydominates(TisthebaseRTT)HPCCfeedbackT02TEachsenderuseawindowtolimitinflightbytes:W=target_rate×TChallenge1:toleratefeedbackdelay16feedbackT02TCongestionfeedbackT02TNeedanewwayofmeasuringcongestion:Rate-basedscheme-UseratemismatchHPCC-Ratemismatchisbadmeasurement-Actualsendingrateisnon-constant,duetotheinflightbyteslimit•UsetotalinflightbytestomeasurecongestionoEachflowneedstoestimatethetotalinflightbytesoUseINTtoestimatethetotalinflightbytesEstimatinginflightbytessenderB(bottleneckbandwidth)T(baseRTT)Queue17totalinflightbytes≈congestion𝒕𝒙𝑹𝒂𝒕𝒆×𝑻𝒒𝒍𝒆𝒏+NoEachsenderestimatesindependentlyinadistributedwayAvailableinINT•UsetotalinflightbytestomeasurecongestionoEachflowneedstoestimatethetotalinflightbytesoUseINTtoestimatethetotalinflightbytes•ComputeadjustmentbasedinflightbytesoUseMIMDtoquicklyadjusttotherightrateControllinginflightbytes18Challenge2:Fastreactionwithoutoverreaction1934215W=W0/1.5Ack1Pipevolume=6Per-ACKreaction-OverreactionChallenge2:Fastreactionwithoutoverreaction20345W=W0/1.5Ack1Pipevolume=62/1.3Ack2OverlapoverreactionPer-ACKreaction-OverreactionChallenge2:Fastreactionwithoutoverreaction21Per-ACKreaction-OverreactionPer-RTTreaction-Nooverreaction34215W=W0/1.5Ack1Pipevolume=66Nexttoreact:6Challenge2:Fastreactionwithoutoverreaction22Per-ACKreaction-OverreactionPer-RTTreaction-Nooverreaction6W=W0/1.5Ack1Pipevolume=6/1.1Ack6NooverlapnooverreactionNexttoreact:6