AssistantProfessorDSLeeFoundationFellowSchoolofInformationSystemsSingaporeManagementUniversityDec.11,2015FeidaZhuFoundingDirectorPinnacleLabforAnalyticsDBS-SMULabforLifeAnalyticsSingaporeManagementUniversity – Insight from scale • What can big data tell us that small data cannot? – Knowledge from enrichment • What important knowledge can we learn from enriching small data with big data? – Agility from real-‐?me responsiveness • What are the values of being real-‐?me? VOLUMEVARIETYVELOCITY企业内部数据通常只以交易纪录为基础 Transac?on-‐based 总量和覆盖有限 Limited coverage 只反映用户生活的局部和侧面 Fragmented par?al perspec?ve 静态,低频 Sta?c, low frequency 孤立单一的客户视图,只见人 Isolated view of individual user 外部社交媒体大数据能展现交易行为的上下文场景 Context-‐based 海量的社会级覆盖 Societal scale 提供用户的多角度全景式洞察 Mul?-‐facet insight 动态实时,高频 Dynamic, high frequency 能综合考虑丰富真实社交关系 Network-‐embedded user view 123123§ § § § § 我今天要联系50个潜在客户卖“少儿险”,告诉我打哪些电话? 为什么是这些人? 你今天最应该打这些人的电话! 因为这些人最关心孩子,而且最近有增长趋势,现在是最好时机! § § § § § § § !!#!$!%!&!!#$%&'()'$%*+',-,-(.'$/0,0*%.0-1.$+,2#%#*3,('*)4*5.3.%1'()*+,-.(*./0)11-23+401(34+,5'()*+,-.(*./0)+23+401(34+,5Figure1.MutualReachability.!!!#!!$!!%!!&!!'!!(!!!&!!!!!&!!!#$%'()*+&'+,#$$*-.(/$&./()01.2/#$3*,#4(/.%,5#&,##)Figure2.FriendshipRetainability.!#$%$&%!!'(!')!')(!')!')(!'))!'))(!'&!'&(!'&)!'&)(!'*!'*(!'*)!'*)(!'#!'#(!'#)!'#)(!'+!'+(!'+)!'+)($'!!#$%&'!Figure3.CommunityAffinity.nitymemberswhodonothavedirecttwo-wayfollowlinkswiththetargetuseryetdohavestrongconnec-tionswithotheroff-linecommunitymembers.Thefol-lowingexperimentfurtherillustratestheprinciple.Foreachofour65Twittertargetuseru,weexamineeachuservinu’sneighborhood,andcounthowmanyoff-linefriendsofuhavedirecttwo-wayfollowlinkswithv.WerankthembythecountandcomputeAUC(AreaUnderROCCurve)oftheranklistbasedontheground-truthoff-linefriendsofu.Figure3showsthatformostusers(52outof65),theAUCvalueisgreaterthan0.8.Thismeansoff-linefriendsindeedsharemoredirecttwo-wayfollowlinkswithotheroff-linefriends,exhibitingmuchstrongercommunityaffinitythanonlinefriends.Hereweusedirecttwo-wayfollowlinksasanindicationofgreaterconnectionstrength.Principle3.CommunityAffinity.Givenatar-getuseru,forauserv∈Nku,letS={w|w∈Cu!Nkv},thelargerthecardinalityofS,themorelikelywehavev∈CuwithrespecttoNku.ALGORITHMWithincorporatingthethreeprinciples,weproposeouralgorithmbasedontheideaofrandomwalkwithrestart(RWR).Itisdefinedin[6]withthefollowingequation.⃗ri=(1−c)˜W⃗ri+c⃗ei(1)Inourproblemsetting,giventheTwitternetworkG=(V,E),atargetuseru∈Vandanumberk,wefocusonG’ssubgraphGkuinducedbyNku{u},whichissim-plifiedasGuwhenkisfixed.AprobabilitytransitionmatrixWisdefinedforV(Gu)suchthat,fortwonodesv,w∈V(Gu),theentryW(v,w)denotestheprobabil-ityofvtransmittingtowatanystep.InaccordancewithPrinciple(II),wedefineW(v,w)asW(v,w)=#1|F1v→|ifw∈F1v→0ifw̸∈F1v→(2)InEquation1,˜WisthetransposeoftheprobabilitytransitionmatrixWasdefinedabove.⃗eiisthestartingindicatorvectorsuchthatei,i=1andei,j=0wherei̸=j.⃗riistheprobabilityvectorfornodeisuchthatri,jistheprobabilityoftransmittingtonodejfromi.cisrestartprobability.Ithasbeenshownthat⃗ricanbecomputediterativelyanditfinallyconverges[6].Whenitconverges,thesteady-stateprobabilityvector⃗rire-flectsthebandwidthofinformationfloworiginatedfromuseritouserjforeveryj∈V(Gu).Weusethissteady-stateprobabilitytodefinetheclosenessscoreci,jfortwousersiandj:ci,j=ri,j∗rj,i(3)TheclosenessscorethusdefinedsatisfiesPrinciple(I).Wenextexplorehowtotakeadvantageoftheoff-linecommunitytoidentifyotherunknownmembers,imple-mentingPrinciple(III).Theideaistodiscovertheoff-linecommunityiteratively,addingnewmembersintotheknownsetineachround.Forthatpurpose,weintroduceanauxiliarydummynode,ˆv,toprovideathresholdtocutthenewoff-linecommunitybound-aryforeachround.ˆvisconstructedasavirtualnodesuchthat(I).ˆvandthetargetuserufolloweachother,i.e.,ˆv∈F1u←!F1u→,(II).ˆvonlyassociateswithu,i.e.,foreachv∈(Nku\{u}),ˆv̸∈(F1v←F1v→),and(III).thenumberoffollowersofˆvissettobetheme-dianofthenumberoffollowersofallusersinu’sk-hopnetworkwiththehubusersexcluded,i.e.,|F1ˆv→|=medianv∈(Nku\H){|F1v→|}.Hubusers,denotedasH,re-fertothoseaccountswithmorethan2000followers,whichtypicallybelongtocelebrities,newsmedia,etc.Thedummynodeisdefinedinsuchawayastosetthelower-boundcaseforanoff-linefriend.Itsimulatesthescenarioinwhichthetargetuserufindsbychancethisrandomuserˆvwhohasnoconnectionswithu’soff-linecommunity.Findinghim/herinteresting,ufollowsˆv,whothenalsofollowsbacksomehow.Assuch,ˆvrepre-sentsaconnectiontoualmostasweakasanyoff-linereal-lifefriendshouldbe.Onahighlevel,thealgorithmworksiniterationsasfollows.Givenatargetuseru,computetheclosenessscorebetweenuandalltheotherusersaswellasˆv.Arankinglistofalltheuserstogetherwithˆvindecreas-ingorderoftheclosenessscoreisthusgenerated.AllProblem: Given a TwiNer follow network of a target user, iden?fy the user’s offline community by examining the follow linkage alone. Informa.on should be able to flow in both direc.ons within a small distance