PIDALION:ImplementationissuesofaJava-basedMultimediaSearchEngineoverthewebDimitrisE.Charilas,OuraniaI.MarkakiNationalTechnicalUniversityofAthens,DepartmentofElectricalandComputerEngineering,Keywords:multimediacontent,queries,content-basedretrieval,multimediacrawler,metadata,imagehistogram,hierarchicalpresentationAbstract-Fuelledbytherapidexpansionofbroadbandconnectivityandincreasinginterestinonlinemultimedia-richapplications,thegrowthofdigitalmultimediacontenthasskyrocketed.Amongothers,thisgrowthiscompoundingtheneedformoreeffectivemethodsforsearchingmultimediainformation.Theautomatedwebsearchenginesthatarecurrentlyusedrelyonlyontextdescriptionsandasaresultprovidematchesofpoorqualityincaseofmultimediacontent.Theservicesofamultimediasearchenginearethereforeapossibilitythattheinternetusersstilllack.Thus,thescopeofthispaperistopresentanimplementationapproachforapersonalizedweb-basedmultimediasearchengineintheJavaprogramminglanguage.Thisapproachcombinesthecharacteristicsofthecurrentsearchenginesaswellasnewinnovativefeatureswhichguaranteeatthesametimethesystem’squickresponseandbettersearchresults.Inthispaperthereadercanfindananalyticalpresentationofallthecomponentsrequiredtoformamultimediasearchengine,aswellasindicationsonhowtoimplementkeyalgorithmsandfunctions.1.INTRODUCTIONThewebcreatesnewchallengesforinformationretrieval.Theamountofinformationonthewebisgrowingrapidlyandsoisthenumberofnewusersinexperiencedintheartofwebresearch.Itisestimatedthat1-2Exa-Bytes(millionsofTera-Bytes)ofnewinformationarecreatedeachyearovertheWeb.Thishugeamountofinformationisanticipatedtogrowbyafactorof10inthefollowingtwoyears.Automatedsearchenginesthatrelyonkeywordmatchingusuallyreturntoomanylowqualitymatches.Thesituationisworseasfarasmultimediacontentisconcerned.Themostpopularsearchengine,Google[1],reliesonlyonkeywordstosearchforimagesanddoesnotcontainanyinformationonsemanticcontent.Content-basedimageretrievalsystems(CBIR)trytosolvethisproblem.ManyCBIRsystemshavebeenrecentlyproposedandimplementedintheliterature.ExamplesincludetheQBICsystem[2],wherecolourinformationisexploited,thePicToSeeksystem[3],whichcombinescolourandshapeinvariantfeaturestoperformimageretrievalandVirage[4]thatallowstheuserstomanuallyregulatetheimportanceoftheextracteddescriptorsaccordingtotheirownperception.Fuzzyorganizationofthedescriptorsisproposedin[5]forincreasingtheretrievalprecisionatacertainrecallvalue,while3Dsearchingisdiscussedin[6].Applicationsofcontent-basedretrievalsystemsareexaminedin[7],whilein[8]asystemregardingmusicaccessisproposed.Personalizedretrievalisexaminedintheworkpresentedin[9].Lastbutnotleast,Marvelthelatestandmoreintelligentcontent-basedsearchengine,developedbytheIBMresearchcentre,USAin2004[10],triestoincreasetheretrievalprecisionaccuracybyincorporatingsemanticannotationinthemediavolumes.However,alltheadoptedapproacheshavestaticandlocalaccessonlytothesystem’sdatabaseandthuscannotretrievecontentfromtheweb[11].Furthermore,theaforementionedworksfocusonthealgorithmsforefficientcontent-basedretrievalandnotonthepracticalissuesregardingtheimplementationofalargescalemultimediasearchengineovertheWeb.Sofar,severaldifferenttechniquesformakingdistributedmultimediacontentsearchablehavebeenproposed.In[12]thereisinformationonthetechniquesofcheckingtheoutgoinglinks,analyzingthereferringpage,miningfortextualinformationinthemediafileandutilizingmetadatausingtheDublinCoremetadatamodelortheMPEG-7standard.Thispaperfocusesondescribingamultimediasearchenginethatcombinesfeaturesfromexistingsearchenginesandenhancestheirfunctionalitiesthroughinnovativealgorithmsandmechanisms.Ourgoalisnotonlytodescribethesystem’sarchitectureandinterconnectivity,butalsotoexplainhowthealgorithmscanbeimplementedinJavacode.Theproposedsystem,namedPIDALION,runsonWindowsenvironment,whiletheJavaServerPages(JSP)andJavaServletstechnologiesareadoptedtoensurethesystem’sinteroperabilityanddynamicbehaviour.Thesystem’sdatabaserunsonSQLServer2000.Oneofthekeyfeaturesoftheproposedsearchengineistheprovisionoffullypersonalizedretrievalservices:usersofPIDALIONmaysharetheirpersonalcontenteitherwithallwebusersorwithintheframeofgroups,aswellasmaintainapersonalprofile,wheretheirpreferencesarestored.Personalizedretrievalcanbeachievedthroughthecreationofsocialgroupsandtheuseofdynamicrelevancefeedbackmechanisms,whichtailorthesystem’sperformancetothecurrentuser’spreferences.Thispaperisorganizedasfollows:Section2presentsthesystem’sarchitecture,explainingbrieflytheroleofeachmaincomponent.Sections3to7presentthefunctionality,architectureandkeyfeatures-innovationsofeachcomponent.Keyalgorithmsaredepictedintheformofpseudo-code.Finally,inSection8theissuescoveredinthispaperaresummarizedandfutureexpansionsareproposed.2.SYSTEMOVERVIEWTheplatformdescribedinthispaperconsistsofthefollowingsubsystems:•Themultimediacrawlingsubsystem,whoseroleistoindexmultimediacontentandhandletheupdatingofheindexingprocess•Themultimediametadatasubsystem,whichextractsmetadatafrommultimediacontent,accordingtotheMPEG-7descriptorsachievinginthiswa