Twitter从支撑千万到万亿级索引的搜索引擎架构演化

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

TheRoadtoaCompleteTweetIndexOutline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yzOutline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yzMorethan2billionsearchqueriesperday.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchHundredsofmillionTweetsareindexedperday.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearch@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchHundredsofbillionsofTweetshavebeensentsincecompanyfoundingin2006.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchOurCompleteTweetIndexisservedbythousandsofinstances,eachwith256GBRAMand2TBSSD.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchBut…oursearchinfrastructureiscurrentlysupportedbyonlyasmallnumberofengineersandSREs.Wearehiring!Outline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yz@yzTheRoadtoaCompleteTweetIndex2010RealtimeSearchPoweredbyreplicatedMySQLinstancesandMySQLtextmatching.@yzTheRoadtoaCompleteTweetIndex2010RealtimeSearchPoweredbyMySQL.HundredsofTweetspersecond.Afewthousandofqueriespersecond.Basictextsearch:nofancytokenization,nosearchassistance,slowgeosearchetc.Manyincidentsanddowntimes.MySQLmaster/slavedyingwasparticularlyproblematic.@yzTheRoadtoaCompleteTweetIndex2011LaunchedLucene-basedsearchengine:Earlybird*.LuceneAPI,butcustomdatastructuresoptimizedforin-memoryoperationsandRealtimesearch.Novelconcurrentandlockfreememorymodels:concurrentlywritingandsearchinganindexsegment.Containsabout7daysofTweets.*~jimmylin/publications/Busch_etal_ICDE2012.pdfEarlybirdLucene/ElasticSearchOptimizedforin-memorydatastructuresOptimizedforDisksOptimizedforRealtimeindexingandupdatesRelativelyslowRealtimeindexingandupdatesOptimizedforTweetsIndexgeneraldocumentsFacet&TermStatisticsSupportN/AwhenwebuiltEarlybirdHighlyoptimizedforJVMGarbageCollectionGeneratesrelativelymoregarbageThriftQuery/Schema/DocAPIsJSONQuery/Schema/DocAPIs@yzTheRoadtoaCompleteTweetIndexEarlybirdvsLucene/ElasticSearchEarlybirdEarlybirdEarlybird@yzTweetFirehose(JSON)Ingestion(Preprocessing,Analysis,Tokenization,Partitioning,etc)ReplicatedMySQLTheRoadtoaCompleteTweetIndex2011RetiredMySQLtextmatching,butstillutilizeMySQLtopipedataintoEarlybird.EarlybirdIndicesIndicesIndicesIndicesHashPartitioning:TweetID%numberofpartitionsIngestionTokenization,IngestionTokenization,Analysis,ReplicatedReplicatedEarlybird2012EliminatedSinglePointsofFailureviapartitioning,decreasingtheimpactofMySQLmaster/slavefailures.@yzTheRoadtoaCompleteTweetIndexTweetFirehose(JSON)Tokenization,EarlybirdIndicesIndicesEarlybirdIndicesEarlybirdIndicesHashPartitioning:TweetID%numberofpartitionsIngestionIngestion(Preprocessing,(Preprocessing,(Preprocessing,Partitioning,etc)(Preprocessing,Partitioning,etc)Tokenization,etc)ReplicatedMySQLMySQLReplicatedMySQLMySQLIngester(Preprocessing,Partitioning,etc)(Preprocessing,Tokenization,Earlybird@yzTheRoadtoaCompleteTweetIndex2013-2015EliminatingtheuseofMySQLasourdatabus.RawTweets(JSON)Tokenization,Partitioning,etc)EarlybirdIndicesIndicesEarlybirdIndicesEarlybirdIndicesTwitter’sPartitioned,Replicated,High-performanceMessagingSystem.IngesterIngester(Preprocessing,(Preprocessing,Tokenization,IngesterTokenization,Partitioning,etc)Partitioning,etc)DistributedLog(Twitter’sOpenSourcereplicatedlogservice)Outline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yzCompleteTweetIndexMotivationBeabletosearchforanyTweeteverpublished,notjustTweetfromthelatest7days.(approx.300xscaling)@yzTheRoadtoaCompleteTweetIndexSmallteam:limitednumberofengineersandSREs.Realtimesearchin-memoryarchitecturecannotholdhundredsofbillionsofTweetsinRAM,wejustdonothaveenoughRAM,andevenifwedo,itisnotcosteffective.Scalingisnon-trivial:Realtimesearcharchitecturehasroughlyfixsize(7daysofTweets),buttheCompleteTweetIndexneedstogrowbiggereachday.Ingestionparallelismislowandfixed---parallelismisachievedviapartitioning:20partitionsmeans20parallelingestionpipelines.@yzTheRoadtoaCompleteTweetIndexExistingArchitectureChallengesIndexeveryTweeteverpublished.Modularity:SharedsourcecodeandtestsbetweentheRealtimeandCompleteTweetIndexwherepossible,whichcreatedacleanersysteminlesstime.Scalability:expandsinplacegracefullyasmoreTweetsareadded.Costeffectiveness:UsingthesameRAMtechnologyforthecompleteindexwouldhavebeenprohibitivelyexpensive.Highlyparallelingestion:abilitytofullyrebuildtheindexinreasonableamountoftime.Simpleinterface:wantedasimpleinterfacethathidestheunderlyingpartitionssothatinternalclientscantreattheclusterasasingleendpoint.@yzTheRoadtoaCompleteTweetIndexCompleteTweetIndexDesignGoalsCompleteTweetIndexDesignOverview@yzTheRoadtoaCompleteTweetIndexBatch

1 / 51
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功