AutomaticCategorizationofWebdocumentsBasedonApplicationOntologyAThesisProposalPresentedtotheDepartmentofComputerScienceBrighamYoungUniversityInPartialFulfillmentoftheRequirementsfortheDegreeMasterofSciencebyLinusW.KwongApril7,19991IIntroductionWebusersrelyonWorldWideWebsearchengines,suchasYahoo!andAltaVista,toretrieveWebdocumentsofinterest.Whetherasearchengineprovidescategoriesforausertoclickonoraqueryfacilityforausertotypeinkeywords,theWebdocumentsretrievedstillsufferfrompoorprecision(i.e.,toomanyirrelevantdocumentsareretrieved)andpoorrecall(i.e.,toomanyrelevantdocumentsareomitted)[CGRU97].AlternativeapproachesoncategorizationofWebdocumentsthereforebecomenecessary.ThisthesisproposestheautomaticcategorizationofWebdocumentswithrespecttoanapplicationontology1.Anapplicationontology[ECLS98]hastwoparts:(i)anontologicalmodelinstance(derivedfromConceptualModel).Thisinstanceconsistsofsetsofobjects,relationshipsamongtheobjects,andconstraintsovertheobjects;and(ii)dataframes.Adataframerepresentseachobjectsetin(i)intheformofpossiblecontextualkeywordsandconstants.Thisthesisrestrictsitselftoapplicationontologieswhichare(i)datarich,i.e.,containanumberofidentifiableconstants,suchasdates,names,andaccountnumbers;and(ii)narrowinontologicalbreadth,i.e.,havearelativelysmallontology[ECJ+98].Inthisthesis,wefocusonfourapplicationontologies,namelycaradvertisements,jobadvertisements,obituaries,anduniversitycoursedescriptions,whichsatisfytherestrictions.WefocusonretrievingWebdocumentswithmultiplerecords,i.e.,eachoftheserecordsshouldcontainagroupofinformationrelevanttoadomainofinterest[YJ98].1Sinceanapplicationontologydefinesadomainofinterestanda(IR)categoryisadomainofinterest,applicationontologiesandcategorieswillbeinterchangeablyusedinthisthesisproposal.2Inthisthesis,wedeterminetherelevanceofaWebdocumentwithmultiplerecordstoaparticularapplicationontologybyusingtwomathematicalvector-spacebasedIRmodels:theVectorSpaceModel(VSM)andtheClusteringModel(CM).Anassumption[Salton88]appliedtothesetwoIRmodelsisthatthereexistsasetofndifferenttermswhichrepresentacategoryandadocumentinthecategory.ƒVSM[Salton88].TheVSMinterpretseachofthentermsinthecategoryasanaxisofann-dimensionalvectorspace.TheVSMrepresentskWebdocumentsaskn-dimensionalvectorsandthecategoryasann-dimensionalcategoryvectorinthen-dimensionalvectorspace.Thecoefficientsofeachofthekn-dimensionalvectors(thecategoryvector,respectively)arethefrequenciesofthentermsinthecorrespondingWebdocument(thecategoryvector,respectively).ƒCM[SM83].LikeVSM,CMalsointerpretseachofthentermsinthecategoryasanaxisofann-dimensionalvectorspaceandrepresentskWebdocumentsaskn-dimensionalvectorsinthen-dimensionalvectorspace.However,CMdiffersfromVSMinthatCMcreatesclusterssetsofWebdocumentsbasedonthe“similarity”amongtheircorrespondingn-dimensionalvectors.CMrepresentseachclusterCasann-dimensionalvector,whosecoefficientsaretheaveragefrequenciesofthentermswhicharefoundineachoftheWebdocumentsinC.Thisvectoriscalledtheterm-centroidvector.Whentheterm-centroidvectorandthecategoryvectorpointtothesameornearlythesamedirection,theWebdocumentsinC(representedbytheterm-centroidvector)arerelevanttotheontology(representedbythecategoryvector).SimilarworkonautomaticallycategorizingWebdocumentshasappeared.[ITN96,CGRU97]bothsubmitaquerytoasearchengineandcollectasetofdocumentswhichthesearchenginereturns.Theyalsodefineacategorybygivingasetofndifferentterms.Foreachreturneddocument,theydeterminethefrequenciesthattheseterms3appearinthedocument.Theprobabilitythat[ITN96,CGRU97]classifyadocumentasbelongingtoacategorydependsonthefrequencies.Ourautomaticprocessissimilar,butdiffersintwoways:(1)Thecreationofthesearchenginequery.In[ITN96,CGRU97],itisrequiredthatausermanuallycreatesasearchenginequery.However,ourcategorizationprogramautomaticallyextractsasetofobjectsetnamesfromtheontologicalmodelinstanceinanapplicationontology.ItformsthequerywhichisthelogicalORofalltheobjectsetnames.(2)Thecreationoftermsthatdescribeacategory.[CGRU97]defineacategorybymanuallyextractingtermsfromapre-classifiedsetofWebpagesfromsearchengines,suchasYahoo!andInfoseek.[ITN96]defineacategorybyusingexistinginformationscienceterminology(subjectdictionary),andsometermsthatdescribethecategoryusingthesauri.Ourapproachusesanobject-orientedontologytodefineacategory.Weusekeywordsandkeyword-associatedconstants(whichareautomaticallyextractedfromthedataframesintheontology)todescribethecategory.Weconductedanexperimentontwentypre-classifiedobituaryWebdocumentsretrievedfromYahoo!.Theexperimentalresultsshow90%recalland97%precision,andourapproachenhancestheprecisionofexistingsearchenginesinretrievingWebdocuments.Anincreaseinprecisionhastwoconsequences:a)itsavesusers’timewhenbrowsingretrievedWebdocuments;(b)itmeansretrievedWebdocumentscanbeusedasinputofthedataextractiontoolsproposedin[RL94,KSa97,Sod97,HGMC+97,ECLS98].4IIThesisStatementThisthesisproposestheautomaticcategorizationofWebdocumentswithrespecttoanapplicationontology.Theapproachusesapplicationontologies,theVectorSpaceIRModel,andtheCluste