分布式搜索引擎原理

jupitertoy
0 ℃
2020-02-25

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

分布式搜索引擎技术原理孟涛Lucene技术原理索引&搜索基本概念分析过程并发和线程安全倒排索引FST数据结构分段索引评分机制.VSM和TF-IDFLucene文件结构基本概念TermFieldDocumentIndexQueryAnalyzerIndexWriterIndexReaderIndexSearcherDirectoryindexdemosearchdemo分析过程两个时间点：建立索引时，使用QueryParser建立Query时Token-语汇单元WhitespaceAnalyzer-通过空格来分割文本，并不对生成的语汇单元进行其他规范化处理SimpleAnalyzer-通过分字母字符来分割文本，并统一为小写。该分析器会去掉数字类型的字符StopAnalyzer-同SimpleAnalyzer类似，另外会去除常用停用词StandardAnalyzer-Lucene默认的核心分析器。包含大量逻辑操作识别某些语汇单元，如公司名称，Email和主机名，也会转为小写，去除停用词和标点符号。建立索引时，通过分析过程提取的Token就是被索引的Term。只有被索引的项才能被搜索到分析过程“Thequickbrownfoxjumpedoverthelazydog”WhitespaceAnalyzer:[The][quick][brown][fox][jumped][over][the][lazy][dog]SimpleAnalyzer:[the][quick][brown][fox][jumped][over][the][lazy][dog]StopAnalyzer:[quick][brown][fox][jumped][over][lazy][dog]StandardAnalyzer:[quick][brown][fox][jumped][over][lazy][dog]“XY&ZCorporation–xyz@example.com”WhiteSpaceAnalyzer:[XY&Z][Corporation][-][xyz@example.com]SimpleAnalyzer:[xy][z][corporation][xyz][example][com]StopAnalyzer:[xy][z][corporation][xyz][example][com]StandardAnalyzer:[xy&z][corporation][xyz@example.com]并发和线程安全多个只读的IndexReader可以打开同一个索引，一个索引，同时只能打开一个Writer。文件锁保证同一时刻只有一个Writer可以写入write.lock文件多线程可以安全共享IndexReader和IndexWriter倒排索引Incomputerscience,aninvertedindex(alsoreferredtoaspostingsfileorinvertedfile)isanindexdatastructurestoringamappingfromcontent,suchaswordsornumbers,toitslocationsinadatabasefile,orinadocumentorasetofdocuments(namedincontrasttoaForwardIndex,whichmapsfromdocumentstocontent).Thepurposeofaninvertedindexistoallowfastfulltextsearches,atacostofincreasedprocessingwhenadocumentisaddedtothedatabase.-wikipediaFST优点节省空间查找快速cat、deep、dogs分段索引-cache,flush&commit添加或删除一个文档时，首先写入内存刷新时才会写入磁盘。触发刷新操作：a.缓存占用空间超过一定值。默认开启b.缓存的文档数超过一定值。默认关闭c.删除项（term）的数目超过一定值时。默认关闭提交操作。提交后的内容才对IndexReader可见。提交的步骤：a.执行刷新b.对新创建的文件同步c.同步segments_N文件。同步完成之后即对IndexReader可见d.删除旧的提交每次提交产生一个新的segment段合并segMerge如果索引包含太多的段，IndexWriter会选择性对它们进行合并分组合并策略，相关参数：mergeFactor：当大小几乎相当的段的数量达到此值的时候，开始合并。minMergeSize：所有大小小于此值的段，都被认为是大小几乎相当，一同参与合并。maxMergeSize：当一个段的大小大于此值的时候，就不再参与合并。maxMergeDocs：当一个段包含的文档数大于此值的时候，就不再参与合并。评分机制–向量空间模型VSM将文档和查询表示成多维空间的向量V(d)和V(q)VSM评分–文档向量V(d)和查询向量V(q)的余弦距离评分机制TF-IDFTf*idfTF–termfrequency表示文档中term的出现频率IDF–inversedocumentfrequency表示多少个文档中出现了该term，越少文档出现说明约匹配lucene文件结构文件结构fnm-fieldname文件。段中文档包含的所有域名和相关选项tis-terminfos文件。首先按fieldname的字母序排列，然后同一field的term以值的顺序排列。每条记录包含它的文档频率，即包含该term的文档数。tii-terminfosindex文件。tis的索引，为提高搜索速度frq-frequency文件。包含term项对应的每个文档的条目，另外还包含文档中该term的出现次数（项频率）prx-position文件。存储每个term在每个document中出现的位置。omitTF如果为true则不存储nrm-normalization文件。索引时计算的归一化值。较短的域权重较大fdt-fielddata。存储的域的内容fdx-fieldindextvf&tvd&tvx-termvector相关的信息。思考：为什么不直接写入磁盘？分布式的ElasticSearchElasticSearchvsLucenerestapifailover&scalehorizontally分布式文档存储分布式搜索建立数据模型ElasticSearchvsLucenelucene是一个开源的信息检索库，需要自己实现索引和搜索操作elasticsearch是一个开源的搜索引擎，它基于lucene，使用lucene执行索引和检索操作。是一个完整的全文检索引擎。a.提供restfulapib.分布式存储，可以方便的扩展到几百台机器和PB级别的数据上restapi索引获取文档searchliteQueryDSLrestapi-indexmanagement创建索引删除索引DELETE/my_index创建别名restapiclusterhealthnodestats://localhost:9200/_nodes/statsnode-anodeisarunninginstanceofelasticsearchcluster-aclusterconsistsofoneormorenodeswiththesamecluster.namethatareworkingtogethertosharetheirdataandworkloadmasternode-Onenodeintheclusteriselectedtobethemasternode,whichisinchargeofmanagingcluster-widechangeslikecreatingordeletinganindex,oraddingorremovinganodefromthecluster.Themasternodedoesnotneedtobeinvolvedindocument-levelchangesorsearches,whichmeansthathavingjustonemasternodewillnotbecomeabottleneckastrafficgrows.Anynodecanbecomethemaster.shardingshardsettingsinglenodeclusterfailovertwo-nodeclusterThesecondnodehasjoinedthecluster,andthreereplicashardshavebeenallocatedtoit—oneforeachprimaryshard.Thatmeansthatwecanloseeithernode,andallofourdatawillbeintact.scalehorizontallythree-nodecluster—shardshavebeenreallocatedtospreadtheloadIncreasingthenumber_of_replicasto2错误处理masternodefailed:a.electanewmasterb.masternode将node2和node3上相应的replicashard提升为primaryshard索引过程1.client发送请求到Node12.Node1使用文档_id进行路由，找到它所在的分片0。将请求转到分片0的primaryshard所在的Node3上。3.Node3执行请求，如果成功，则并行的将请求转发到分片0的replicashard上。如果成功，则Node3报告成功返回给node1，node1返回给clientBydefault,theprimaryshardrequiresaquorum,ormajority,ofshardcopiestobeavailablebeforeevenattemptingawriteoperation.Theallowedvaluesforconsistencyareone(justtheprimaryshard),all(theprimaryandallreplicas),orthedefaultquorum,ormajority,ofshardcopies.获取过程1.TheclientsendsagetrequesttoNode1.2.Thenodeusesthedocument’s_idtodeterminethatthedocumentbelongstoshard0.Copiesofshard0existonallthreenodes.Onthisoccasion,itforwardstherequesttoNode2.3.Node2returnsthedocumenttoNode1,whichreturnsth