MiningUsers’InterestfromMicroBlogsYinglinWangProfessorDepartofComputerScience&TechnologyShanghaiUniversityofFinanceandEconomics08August,20142014Sino-FinishSummerSchoolonSocial-MediaDataAnalysisShanghaiUniv.ofFinance&Economics2Representation3Topicmodelconclusions5MethodsExperiments41MotivationMotivationBigamountofcontentExplosivegrowthHugenumberofusersFindusefulinformation,products,collaboratorsForInformationproviderSendinformationtothetargetusersaccuratelyFornormalusersChallenges:Userinterestmodeling-coreofpersonalizedservicesMotivation(cont.)Recommendationsystems:Forsellingproducts(includingretailers,newsagencies,moviewebsites),forfindingthepersonwhohavethesameinterests(insocialactivities,oracademicresearches)Informationsearch/retrieval:tohelptheinformationproviderstoknowwhatinformationyoureallyneedMoviewebsiteNewsAgencyretailersSocialNetworksHowtoanalyzetheuserinterests?-Whatkindofdatacanweuse?-Howcanwerepresentuserinterests?-WhatiseffectivealgorithmtocalculateUis?Thedatathatcanbeusedforanalyzinguserinterests:Historicalactivities:purchaserecords,searchingrecordsexplicitinput:alistofkeywordsbyusers,ratingsofdocuments,movies,music,etc.Implicitfeedback:onlinebrowsingbehaviorofusers,E.g.,Mousemovement,Readingtime,Print,bookmark,copy&paste,scroll,visitlinksonapage.Problemsfaced:Cold-startproblem,whenlittleornothingisknownaboutanewuser.Datarenewslowly,notfullyreflectuserinterestsSomeinformationprovidedbythirdpartycanbeahelp,suchasMicroBlogs,orWikipediaViaintegratingtheresultsobtainedfrommultiplesourcesWhatkindofdatacanweuse?Socialmediadatacanbeused:MicroBlogsnewresourcestoobtainusers’interestrealtime,datarenewquicklyHugedata,andeasytoobtainFacebookhas500millionusersin2010901millionusersinApril,2012SurveyofGlobalWebIndexTencentQzone(腾讯QQ空间)286millionusers,66%ofChina’sinternetpopulationSinaWeibo(新浪微博)264millionusersThegeneralviewofmicroblogdataHowcanwerepresentuserinterests?Keywordsandweightskeywords(雕塑)Sculpture(雕刻)Carving(水彩)Watercolor(素描)SketchWeights0.580.780.450.481.VSM-VectorSpaceModel(Vector,Bagofwords)Vector:Eachdimensioninthevectorcorrespondstoaseparateterm.Ifatermreflectstheuser’sinterest,itsvalueinthevectorisnon-zero.Thevaluecanbeboolean,indicatingforinstancethatauserhasvisitedtheitemorunderstoodtheconcept,oritcanbeaninteger/fractionvalueindicatingthedegreeofconcernabouttheconcept.BagofWords.Anothersimilarapproachwidelyusedisthekeyword-basedusermodel,whichholdsbag-of-wordsrepresenting(usually)userinterests.UserInterests/UserModelRepresentationsKeywordsandweightskeywords(雕塑)Sculpture(雕刻)Carving(水彩)Watercolor(素描)SketchWeights0.580.780.450.48Weneedtoreducethedimensionofwordskeywordsextractionmethods:TFIDFTFIDF=TF*IDFTF:termfrequency,f(w,d):thenumberoftimesthattermwoccursindocumentdIDF:inversedocumentfrequency,isameasureofhowmuchinformationthewordwprovides,thatis,whetherthetermiscommonorrareacrossalldocuments.Idfcanbeobtainedbydividingthetotalnumberofdocumentsbythenumberofdocumentscontainingthetermw,andthentakingthelogarithmofthatquotient.UserInterests/UserModelRepresentationsTFIDFintuition:Ifawordorphraseappearmanytimesinandocument,andrarelyappearinotherarticles,thiswordhasaverygoodclassdistinctionability,suitableforclassification.AnotherkeywordsextractionmethodTextRank:InspiredbyPageRankmethod,calculatetheweightforeachword.2.Conceptbasedmodel(ontologybasedmodel,networkbasedmodel)AnillustrationoftheuserontologyApartialdomainontologyfortheItaliansoccerteamsExcerptedfromXingJiangandAh-HweeTan“LearningandInferencinginUserOntologyforPersonalizedSemanticWebServices”–26,2006,Edinburgh,UK.UserInterests/UserModelRepresentations3.TopicmodelThemostcommontopicmodel-LatentDirichletAllocation(LDA):atypeofstatisticalmodelfordiscoveringtheabstracttopicsthatoccurinacollectionofdocuments.Atopicisanabstractconcept,Whichischaracterizedbyadistributionoverwords:P(w1,w1,…,wn|t)Adocumentisrepresentedasrandommixturedistributionoverlatenttopics:P(t1,t1,…,tn|d)Advantages:•alow-dimensionalrepresentationofthedocument•Thesemanticinformationhiddenbehindthewordscanbediscoverede.g.,'computer'and'microcomputer‘,‘automobile’,‘car’,havethesamemeaning,hencebelongtothesametopic.UserInterests/UserModelRepresentationsThefollowingtwosentencewillberegardedcompletelyirrelevantifusingVSMmodel,butbytopicmodeltherelationshipcanbediscovered.Doc1:Ifitdatebackto2006,willYunMaandZhiyuanYangcooperate?Doc2:AlibabaGroupandYahoosignedasharerepurchaseagreements.YunMa(马云)Alibaba(阿里巴巴)Taobao(淘宝)……Topic1ZhiyuanYang(杨致远)Yahoo(雅虎)Portal(门户网站)……Topic2cooperateagreementcontract….Topic3Thus,doc1anddoc2arecloselyrelatedbasedonthetopics!AnexampletoshowtheadvantageoftopicmodelNote:YunMa–ThefounderofTaobao;ZhiyuanYang-ThefounderofYahoowecanusetopicstomodelusers’interestsP(t1,t2,…,tm|u)TheprobabilitydistributionovertopicsforusersUserinterestmodelTheprobabilitydistributionoverwordsfortopicsAnexamplefortherepresentationofuserinterestsbytopicmodelMininginterestfromMicroBlogs:Mostofthemusekeywordrepresentation,whichonlydifferfromthewaytheychoosethek