社区发现算法⼯工作简介!-机器学习算法班@sumnous_t2014.12.14主要内容!!•社区发现算法的发展、简介•我的社区发现算法相关⼯工作•⺴⽹网络的社交与信息连接度WebWebWebWebWeb1.0Web2.0Web3.0以信息为中⼼心SocialWeb-以⼈人为中⼼心SemanticWeb-让机器去理解⺴⽹网络上⼀一切数据、信息、内容的含义。•研究背景与研究意义•研究背景:复杂⺴⽹网络是复杂系统的抽象,现实中许多复杂系统都可以⽤用复杂⺴⽹网络的相关特性进⾏行描述和分析。图,⺴⽹网络中的节点表⽰示系统中的个体,边表⽰示个体之间的关系。如,社会关系⺴⽹网络,万维⺴⽹网,⾷食物链,基因⺴⽹网,城市交通⺴⽹网络,电⼒力⺴⽹网,电路⺴⽹网。对复杂⺴⽹网络的研究⼀一直是许多领域的研究热点,其中社区结构是复杂⺴⽹网络中得⼀一个普遍特征,整个⺴⽹网络是由许多个社区组成的。•研究意义:社区发现:更准确的定位社会群体,以便⽤用于⼲⼴广告的精准投放,推荐商品,推荐好友,等等。异构⺴⽹网络社区发现:如在学术⺴⽹网络中依据指定研究主题寻找权威作者、在商务⺴⽹网络中针对特定产品查找营销群体等,还可以在复杂的学术⺴⽹网络中划分出关系紧密的作者群体,在多种⼈人际关系中分析出志同道合的朋友圈等。⼩小世界⺴⽹网络SmallWorld•1998,邓肯·⽡瓦茨(DuncanWatts)和斯蒂⽂文·斯特罗加茨(StevenStrogatz),⽡瓦茨-斯特罗加茨模型(WS模型)•特征路径⻓长度短(两个节点的路径⻓长度的平均值)•⾼高集聚系数(⼀一个节点的集聚系数等于与它相连的节点中相互连接的点对数与总点对数的⽐比值)•六度分割理论,⼀一百五⼗十法则•凯⽂文⻉贝肯游戏(平均的“⻉贝肯数”是2.981,最⼤大的也仅仅是8)•Facebook六度分隔理论变为「四度」(4.74,7.21亿)⽆无标度⺴⽹网络Scale-free•⼀一个⺴⽹网络的度分布,是当随机地从⺴⽹网络中抽取⼀一个节点时,与这个节点相连的节点数(叫做这个节点的度)d的概率分布。•⽆无尺度⺴⽹网络的度分布满⾜足幂律分布,也就是说d=k的概率正⽐比于k的某个幂次(⼀一般是负的):•什么是社区?•Aprecisedefinitionofwhata“community”reallyisdoesnotexistyet.OneofthemostwidelyacceptedanduseddefinitionsisthatgivenbyNewmanandGirvan(2004):•Acommunityisasubgraphcontainingnodeswhicharemoredenselylinkedtoeachotherthantotherestofthegraphorequivalently,agraphhasacommunitystructureifthenumberoflinksintoanysubgraphishigherthanthenumberoflinksbetweenthosesubgraphs.•同⼀一社区内的节点与节点之间的连接很紧密,⽽而社区与社区之间的连接⽐比较稀疏。图⽚片来源于⺴⽹网络图⽚片来源于⺴⽹网络•0OldMethods!•1CliquePercolation!•2LocalExpansionandOptimization!•3DynamicalAlgorithm!•4LabelPropagationAlgorithm!•5Other0 Old Methods•Clustering(Nodesimilarity)•GraphCut(Mgroups,lessedgesintra-community)•Modularity-BasedMethod(isNP-hardtooptimize)[Newman,2006]-Greedy-SimulatedAnnealing•Girvan-NewmanAlgorithm(Betweenness,split)•SpectralMethod(在同⼀一个社区内的节点,它在拉普拉斯矩阵中的特征向量近似。将节点对应的矩阵特征向量(与特征值和特征向量有关的都叫谱)看成空间坐标,将⺴⽹网络节点映射到多维向量空间去,然后就可以运⽤用传统的聚类算法将它们聚集成社团。)EdgeBetweennessBetweennessforPartitioningBetweennessofNodesGirvan-NewmanPartitioningAlg.1 Clique Percola5on•CPM(CliquePercolationMethod)!•Identifyallthek-cliquesinnetwork.Cliqueisanintensivelystructure,eachtwonodesareconnectedinaclique.Connecttwonodesiftheirtwok-cliquestheybelongtosharek-1members.[Palla,2005,Nature]•TimeComplexity:polynomial(NP-completemaximalcliquesfinding)•CFinder•Failedinlargesocialnetworks2 Local Expansion and Op5miza5on -‐ LFM•LFM!•expandsacommunityfromarandomseednodetoformanaturalcommunityuntilfitnessfunctionislocallymaximal.[Lancichinetti,2009,NewJ.Phys.]•fitnessfunction:!•O(ncs2),wherencisthenumberofcommunities,sistheaveragesizeofcommunities,computationcomplexityisdependedonparameter\alpha.•worst-casecomplexity:O(n2)2 Local Expansion and Op5miza5on -‐ GCE•GCE(GreedyCliqueExpansion)!•takesallmaximumcliquesasinitialseedstogreedilyexpandthefitnessfunctiontofindoverlappingcommunities.[Lee,2010]•Greedyexpansioncomplexity:O(|E|M),Misthenumberofcliquestobeexpanded.•mergecomplexity:O(2(|C1|+|C2|)-1)(notsure)•最⼤大团问题(MaximumCliqueProblem,MCP)NP-complete2 Local Expansion and Op5miza5on -‐ EAGLE•EAGLE!•Allmaximalcliquesisasinitialcommunities,mergedbymaximumsimilarity-dendrogram.Theoptimalcutonthedendrogramisdeterminedbytheextendedmodularitywithaweightbasedonthenumberofoverlappingmemberships.[Shen,2008]•ExtendedModularity:!•whereOiisthenumberofcommunitiestowhichnodeibelongs.•O(n2+(h+n)s),wheresisthenumberofmaximalcliques,histhenumberofpairsofmaximalcliqueswhichareneighbors.2 Local Expansion and Op5miza5on -‐ OSLOM•OSLOM(OrderStatisticsLocalOptimizationMethod)optimizeslocallythestatisticalsignificanceinformationofaclusterwithrespecttorandomfluctuationwithExtremeandOrderStatistics.Itteststhestatisticalsignificanceofaclusterwithrespecttoaglobalnullmodel.Itcandealwithweighted,directededges,overlappingcommunities,hierarchiesanddynamiccommunities.[Lancichinetti,2011]•worst-casecomplexity:O(n2)!!3 Dynamical Algorithm•InfoMAP!•Themapequationframework!•randomwalk:optimalcompressingtheinformationonthestructureofthegraphbyoptimizingaqualityfunction,MinimumDescriptionLength.[Rosvall,2009] Label Propaga5on Algorithm•SLPA!•isageneralspeaker-listenerbasedinformationpropagationprocess.[Xie,2012]-setamemoryforeachnodetostorehistorylabels-eachneighborofselectednode(listener)randomlyselectsalabelwithprobabilityproportionaltotheoccurrencefrequencyofthislabelinitsmemoryandsendstheselectedlabeltothelistener-thelisteneraddsthemostpopularlabelreceivedtoitsmemory-usethresholdrtodeletelowerfrequencyseeinglabels,andoutputcommunities•O(t|E|),linear,tisthepredefinedmaximumnumberofiterations.•LPA-SLPA-LabelRank-LabelRankT[Xie,2013],linear,O(m)5 Other•MixtureModel:Content•Heterogeneousnetwork•Benchmark:LFRbenchmark•Real-worldSocialNetworks•EVALUATION(ground-truth):Q,NMI,Precision,RecallMywork•同构⺴⽹网络-重叠社区发现算法(PHSE)•异构⺴⽹网络:•改进的标签传递社区发现算法(iSLPA)•带有语义的社区发现算法e.g.基于社区发现的⽤用户⽂文本数据的话题模型•同构⺴⽹网络重叠社区发现算法•重叠社区发现算法,提出⼀一种基于适应度函数以及贪⼼心种⼦子扩展的并⾏行重叠社区发现算法。“AnImprovedParallelHybridSeedExpansion(PHSE)MethodforDetectingHighlyOverlappingCommunitiesinSocialNetworks,TingWang,XuQian