SearchingtheWebbyConstrainedSpreadingActivationFabioCrestaniPuayLengLeeDepartmentofComputingScienceUniversityofGlasgowGlasgowG128QQ,ScotlandTel.+44-(0)141-3306292Fax.+44-(0)141-3304913Email:ffabio,leeplg@dcs.gla.ac.uk.RunningTitle:SearchingtheWebbyConstrainedSpreadingActivation.Keywords(fromACMComputingClassi cationSystem):hyper-text/hypermedia,informationsearchandretrieval,spreadingactivation,queryformulation,intelligentagents.1AbstractIntelligentInformationRetrievalisconcernedwiththeapplicationofintelligenttechniques,likeforexamplesemanticnetworks,neuralnetworksandinferencenetstoInformationRetrieval.The eldofresearchhasseenanumberofapplicationsofConstrainedSpreadingActivation(CSA)techniquesondomainknowledgenetworks.How-ever,therehasneverbeenanyapplicationofthesetechniquestotheWorldWideWeb.TheWebisaveryimportantinformationresource,butusers ndthatlookingforarelevantpieceofinformationintheWebcanbelike\lookingforaneedleinahaystack.Wewerethere-foremotivatedtodesignanddevelopaprototypesystem,WebSCSA(WebSearchbyCSA),thatappliesaCSAtechniquetoretrieveinfor-mationfromtheWebusinganostensiveapproachtoqueryingsimilartoquery-by-example.Inthispaperwedescribethesystemanditsun-derlyingmodel.Furthermore,wereportonanexperimentcarriedoutwithhumansubjectstoevaluatethee ectivenessofWebSCSA.WetestedwhetherWebSCSAimprovesretrievalofrelevantinformationontopofWebsearchenginesresultsandhowwellWebSCSAservesasanagentbrowserfortheuser.Theresultsoftheexperimentsarepromising,andshowthatthereismuchpotentialforfurtherresearchontheuseofCSAtechniquestosearchtheWeb.21IntroductionThispaperisconcernedwiththeapplicationofConstrainedSpreadingAc-tivation(CSA)techniquesforretrievinginformationfromtheWorldWideWeb(herebyreferredtoastheWeb).TheWebpresentsaformidablestoreofinformation.Itisaninterconnectedsystemofover7millionsitesandtheirpages(inDecember1998)accessiblethroughbrowserslikeMosaic,NetscapeNavigatororMicrosoft’sInternetExplorer.AlthoughtheWebisoneoftheneweradditionstotheInternet,ithasgainedpopularityveryquickly,be-comingthesecondmostfrequently-usedfeatureoftheInternet,themostwidely-usedonebeingelectronicmail(Berners-Leeetal.,1992).TheinformationstoredintheWebdi ersfromtheinformationtraditionallydealtbyInformationRetrieval(IR)systemsinseveralaspects. Informationorganization.TheWebisnotorganized,inthesensethatassociatedorsimilardocumentsarenotplacedinclosephysicalproximitylikethecollectionsinaphysicallibraryorstoredinsomearchive.Internetdirectories,likeYahoo!,helporganizelinkstosimilardocumentstoeasetheretrievalproblem,butthecategorizationprocessisoftendonemanuallyandthisisexpensiveandtime-consuming.SincetheWebisahypertext/hypermediasystemandwedonotpossesstheresourceswhichInternetdirectoriesdo,thenaturalwayofreachingsimilardocumentsfromgivendocumentswouldbetotraversethelinksonthelatter.AretrievaltoolfortheWebshouldexploitthelinksintheWebdocuments(i.e.Webpages)initssearchfordocumentsrelevanttoauserrequest. Informationrange.SomeconventionalIRsystemscontainspecial-izedinformation,suchas,forexample,medicaldocumentation,orpatents.Hence,IRsystemscansometimesexploitdomainknowledgetoenhanceretrievalperformance.Incontrast,thesubjectrangeofin-formationontheWebisverywide.AnyretrievalprogrambuiltfortheWebmustbe exibletoretrieveinformationofawiderangeofsubjectsandwrittenindi erentnaturallanguages.Retrievalmodelsthatex-ploitassociationsbetweendocumentsareappropriateforretrievalontheWebbecausethesemodelsdonotdictatethetopicalrangeofrele-vantinformationprovidedatthebeginningofthesearch.Theysimply3searchforsimilarinformationregardlessofthetopicofthequery(Ellis,1996). Changeofcontent.TheWebisaverydynamicinformationcollec-tion.Everysecond,changesarebeingmadetoexistingWebpages,andpagesareaddedtoordeletedfromtheWeb.ConventionalIRsystemsarelessdynamicandthereismuchmorecontroloverthechangesmadetothedocumentcollection.AretrievalsystemfortheWebshouldbeabletoretrievedocumentsthatareup-to-dateandshouldnotrely(atleastnotcompletely)onindexesthatcouldbecomeoutdatedveryquickly.InthispaperwepresentaprototypeWebsearchsystemthatexploitstheabovedistinctionbetweendocumentsusuallymanagedbyIRsystemsandthosemanagedbytheWeb.TheunderlyingIRmodelofthisprototypeisavariationofthemodelknownasAssociativeRetrieval.AssociativeRetrievalwas rstintroducedbySalton(1968)andisconcernedwithexploitingasso-ciationsbetweeninformationitemsatretrievaltime.Associationsare rstdeterminedusingcitationsorstatisticaltechniques(likeforexampletermco-occurrence)andthenusedbycomplexretrievalfunctions.Intheworkpresentedinthispaperwedonotusecitationsorstatisticalassociations,butweusetheexistingassociationsrepresentedbyhypertextlinksbetweenWebdocuments.WhatweconsiderimportantofAssociativeRetrievalistheideabehindthisformofretrieval,i.e.thatitispossibletoretrieverelevantdocumentsbyretrievingthosethatareexplicitlyassociatedwithsomethattheuserknowstoberelevant.TheworkpresentedinthispaperintegratesAssociativeRetrievalwithOs-tensiveRetrieval.ThisnovelapproachtoIRwasproposedbyCampbellandVanRijsbergen(1996)andisconcerned