基于微博数据的用户兴趣挖掘

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

MiningUsers’InterestfromMicroBlogsYinglinWangProfessorDepartofComputerScience&TechnologyShanghaiUniversityofFinanceandEconomics08August,20142014Sino-FinishSummerSchoolonSocial-MediaDataAnalysisShanghaiUniv.ofFinance&Economics2Representation3Topicmodelconclusions5MethodsExperiments41MotivationMotivationBigamountofcontentExplosivegrowthHugenumberofusersFindusefulinformation,products,collaboratorsForInformationproviderSendinformationtothetargetusersaccuratelyFornormalusersChallenges:Userinterestmodeling-coreofpersonalizedservicesMotivation(cont.)Recommendationsystems:Forsellingproducts(includingretailers,newsagencies,moviewebsites),forfindingthepersonwhohavethesameinterests(insocialactivities,oracademicresearches)Informationsearch/retrieval:tohelptheinformationproviderstoknowwhatinformationyoureallyneedMoviewebsiteNewsAgencyretailersSocialNetworksHowtoanalyzetheuserinterests?-Whatkindofdatacanweuse?-Howcanwerepresentuserinterests?-WhatiseffectivealgorithmtocalculateUis?Thedatathatcanbeusedforanalyzinguserinterests:Historicalactivities:purchaserecords,searchingrecordsexplicitinput:alistofkeywordsbyusers,ratingsofdocuments,movies,music,etc.Implicitfeedback:onlinebrowsingbehaviorofusers,E.g.,Mousemovement,Readingtime,Print,bookmark,copy&paste,scroll,visitlinksonapage.Problemsfaced:Cold-startproblem,whenlittleornothingisknownaboutanewuser.Datarenewslowly,notfullyreflectuserinterestsSomeinformationprovidedbythirdpartycanbeahelp,suchasMicroBlogs,orWikipediaViaintegratingtheresultsobtainedfrommultiplesourcesWhatkindofdatacanweuse?Socialmediadatacanbeused:MicroBlogsnewresourcestoobtainusers’interestrealtime,datarenewquicklyHugedata,andeasytoobtainFacebookhas500millionusersin2010901millionusersinApril,2012SurveyofGlobalWebIndexTencentQzone(腾讯QQ空间)286millionusers,66%ofChina’sinternetpopulationSinaWeibo(新浪微博)264millionusersThegeneralviewofmicroblogdataHowcanwerepresentuserinterests?Keywordsandweightskeywords(雕塑)Sculpture(雕刻)Carving(水彩)Watercolor(素描)SketchWeights0.580.780.450.481.VSM-VectorSpaceModel(Vector,Bagofwords)Vector:Eachdimensioninthevectorcorrespondstoaseparateterm.Ifatermreflectstheuser’sinterest,itsvalueinthevectorisnon-zero.Thevaluecanbeboolean,indicatingforinstancethatauserhasvisitedtheitemorunderstoodtheconcept,oritcanbeaninteger/fractionvalueindicatingthedegreeofconcernabouttheconcept.BagofWords.Anothersimilarapproachwidelyusedisthekeyword-basedusermodel,whichholdsbag-of-wordsrepresenting(usually)userinterests.UserInterests/UserModelRepresentationsKeywordsandweightskeywords(雕塑)Sculpture(雕刻)Carving(水彩)Watercolor(素描)SketchWeights0.580.780.450.48Weneedtoreducethedimensionofwordskeywordsextractionmethods:TFIDFTFIDF=TF*IDFTF:termfrequency,f(w,d):thenumberoftimesthattermwoccursindocumentdIDF:inversedocumentfrequency,isameasureofhowmuchinformationthewordwprovides,thatis,whetherthetermiscommonorrareacrossalldocuments.Idfcanbeobtainedbydividingthetotalnumberofdocumentsbythenumberofdocumentscontainingthetermw,andthentakingthelogarithmofthatquotient.UserInterests/UserModelRepresentationsTFIDFintuition:Ifawordorphraseappearmanytimesinandocument,andrarelyappearinotherarticles,thiswordhasaverygoodclassdistinctionability,suitableforclassification.AnotherkeywordsextractionmethodTextRank:InspiredbyPageRankmethod,calculatetheweightforeachword.2.Conceptbasedmodel(ontologybasedmodel,networkbasedmodel)AnillustrationoftheuserontologyApartialdomainontologyfortheItaliansoccerteamsExcerptedfromXingJiangandAh-HweeTan“LearningandInferencinginUserOntologyforPersonalizedSemanticWebServices”–26,2006,Edinburgh,UK.UserInterests/UserModelRepresentations3.TopicmodelThemostcommontopicmodel-LatentDirichletAllocation(LDA):atypeofstatisticalmodelfordiscoveringtheabstracttopicsthatoccurinacollectionofdocuments.Atopicisanabstractconcept,Whichischaracterizedbyadistributionoverwords:P(w1,w1,…,wn|t)Adocumentisrepresentedasrandommixturedistributionoverlatenttopics:P(t1,t1,…,tn|d)Advantages:•alow-dimensionalrepresentationofthedocument•Thesemanticinformationhiddenbehindthewordscanbediscoverede.g.,'computer'and'microcomputer‘,‘automobile’,‘car’,havethesamemeaning,hencebelongtothesametopic.UserInterests/UserModelRepresentationsThefollowingtwosentencewillberegardedcompletelyirrelevantifusingVSMmodel,butbytopicmodeltherelationshipcanbediscovered.Doc1:Ifitdatebackto2006,willYunMaandZhiyuanYangcooperate?Doc2:AlibabaGroupandYahoosignedasharerepurchaseagreements.YunMa(马云)Alibaba(阿里巴巴)Taobao(淘宝)……Topic1ZhiyuanYang(杨致远)Yahoo(雅虎)Portal(门户网站)……Topic2cooperateagreementcontract….Topic3Thus,doc1anddoc2arecloselyrelatedbasedonthetopics!AnexampletoshowtheadvantageoftopicmodelNote:YunMa–ThefounderofTaobao;ZhiyuanYang-ThefounderofYahoowecanusetopicstomodelusers’interestsP(t1,t2,…,tm|u)TheprobabilitydistributionovertopicsforusersUserinterestmodelTheprobabilitydistributionoverwordsfortopicsAnexamplefortherepresentationofuserinterestsbytopicmodelMininginterestfromMicroBlogs:Mostofthemusekeywordrepresentation,whichonlydifferfromthewaytheychoosethek

1 / 57
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功