Automatic Categorization of Web documents Based on

xrd203301
0 ℃
2020-02-06

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

AutomaticCategorizationofWebdocumentsBasedonApplicationOntologyAThesisProposalPresentedtotheDepartmentofComputerScienceBrighamYoungUniversityInPartialFulfillmentoftheRequirementsfortheDegreeMasterofSciencebyLinusW.KwongApril7,19991IIntroductionWebusersrelyonWorldWideWebsearchengines,suchasYahoo!andAltaVista,toretrieveWebdocumentsofinterest.Whetherasearchengineprovidescategoriesforausertoclickonoraqueryfacilityforausertotypeinkeywords,theWebdocumentsretrievedstillsufferfrompoorprecision(i.e.,toomanyirrelevantdocumentsareretrieved)andpoorrecall(i.e.,toomanyrelevantdocumentsareomitted)[CGRU97].AlternativeapproachesoncategorizationofWebdocumentsthereforebecomenecessary.ThisthesisproposestheautomaticcategorizationofWebdocumentswithrespecttoanapplicationontology1.Anapplicationontology[ECLS98]hastwoparts:(i)anontologicalmodelinstance(derivedfromConceptualModel).Thisinstanceconsistsofsetsofobjects,relationshipsamongtheobjects,andconstraintsovertheobjects;and(ii)dataframes.Adataframerepresentseachobjectsetin(i)intheformofpossiblecontextualkeywordsandconstants.Thisthesisrestrictsitselftoapplicationontologieswhichare(i)datarich,i.e.,containanumberofidentifiableconstants,suchasdates,names,andaccountnumbers;and(ii)narrowinontologicalbreadth,i.e.,havearelativelysmallontology[ECJ+98].Inthisthesis,wefocusonfourapplicationontologies,namelycaradvertisements,jobadvertisements,obituaries,anduniversitycoursedescriptions,whichsatisfytherestrictions.WefocusonretrievingWebdocumentswithmultiplerecords,i.e.,eachoftheserecordsshouldcontainagroupofinformationrelevanttoadomainofinterest[YJ98].1Sinceanapplicationontologydefinesadomainofinterestanda(IR)categoryisadomainofinterest,applicationontologiesandcategorieswillbeinterchangeablyusedinthisthesisproposal.2Inthisthesis,wedeterminetherelevanceofaWebdocumentwithmultiplerecordstoaparticularapplicationontologybyusingtwomathematicalvector-spacebasedIRmodels:theVectorSpaceModel(VSM)andtheClusteringModel(CM).Anassumption[Salton88]appliedtothesetwoIRmodelsisthatthereexistsasetofndifferenttermswhichrepresentacategoryandadocumentinthecategory.ƒVSM[Salton88].TheVSMinterpretseachofthentermsinthecategoryasanaxisofann-dimensionalvectorspace.TheVSMrepresentskWebdocumentsaskn-dimensionalvectorsandthecategoryasann-dimensionalcategoryvectorinthen-dimensionalvectorspace.Thecoefficientsofeachofthekn-dimensionalvectors(thecategoryvector,respectively)arethefrequenciesofthentermsinthecorrespondingWebdocument(thecategoryvector,respectively).ƒCM[SM83].LikeVSM,CMalsointerpretseachofthentermsinthecategoryasanaxisofann-dimensionalvectorspaceandrepresentskWebdocumentsaskn-dimensionalvectorsinthen-dimensionalvectorspace.However,CMdiffersfromVSMinthatCMcreatesclusterssetsofWebdocumentsbasedonthe“similarity”amongtheircorrespondingn-dimensionalvectors.CMrepresentseachclusterCasann-dimensionalvector,whosecoefficientsaretheaveragefrequenciesofthentermswhicharefoundineachoftheWebdocumentsinC.Thisvectoriscalledtheterm-centroidvector.Whentheterm-centroidvectorandthecategoryvectorpointtothesameornearlythesamedirection,theWebdocumentsinC(representedbytheterm-centroidvector)arerelevanttotheontology(representedbythecategoryvector).SimilarworkonautomaticallycategorizingWebdocumentshasappeared.[ITN96,CGRU97]bothsubmitaquerytoasearchengineandcollectasetofdocumentswhichthesearchenginereturns.Theyalsodefineacategorybygivingasetofndifferentterms.Foreachreturneddocument,theydeterminethefrequenciesthattheseterms3appearinthedocument.Theprobabilitythat[ITN96,CGRU97]classifyadocumentasbelongingtoacategorydependsonthefrequencies.Ourautomaticprocessissimilar,butdiffersintwoways:(1)Thecreationofthesearchenginequery.In[ITN96,CGRU97],itisrequiredthatausermanuallycreatesasearchenginequery.However,ourcategorizationprogramautomaticallyextractsasetofobjectsetnamesfromtheontologicalmodelinstanceinanapplicationontology.ItformsthequerywhichisthelogicalORofalltheobjectsetnames.(2)Thecreationoftermsthatdescribeacategory.[CGRU97]defineacategorybymanuallyextractingtermsfromapre-classifiedsetofWebpagesfromsearchengines,suchasYahoo!andInfoseek.[ITN96]defineacategorybyusingexistinginformationscienceterminology(subjectdictionary),andsometermsthatdescribethecategoryusingthesauri.Ourapproachusesanobject-orientedontologytodefineacategory.Weusekeywordsandkeyword-associatedconstants(whichareautomaticallyextractedfromthedataframesintheontology)todescribethecategory.Weconductedanexperimentontwentypre-classifiedobituaryWebdocumentsretrievedfromYahoo!.Theexperimentalresultsshow90%recalland97%precision,andourapproachenhancestheprecisionofexistingsearchenginesinretrievingWebdocuments.Anincreaseinprecisionhastwoconsequences:a)itsavesusers’timewhenbrowsingretrievedWebdocuments;(b)itmeansretrievedWebdocumentscanbeusedasinputofthedataextractiontoolsproposedin[RL94,KSa97,Sod97,HGMC+97,ECLS98].4IIThesisStatementThisthesisproposestheautomaticcategorizationofWebdocumentswithrespecttoanapplicationontology.Theapproachusesapplicationontologies,theVectorSpaceIRModel,andtheCluste