10 1965 International Conference on Computational

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

101965InternationalConferenceonComputationalLinguisticsNEASURENENTOFSI~IILARITYI~ETWI!ENNOUNSKennethE.llarperTileIOiNDCorporation1700MainStreetSantablonica,California9041)6AJ;STt',A(?TAstudywasr~adeoftiledegreeofsimilaritybetweenpairsofRussiannouns,asexpressedbytheirtendencytooccurinsentenceswithidentical~,,ordsinidenticalsyntacticrelationships.Asimilaritymatrixwaspreparedforfortynouns;foreachpairofnounsthenumberofshared(i)adjectivedependents,(ii)noundependents,and(iii)noungovernorswasautomaticallyretrievedfrommachine-processedtext.Thesimilaritycoefficientforeachpair~;asdeterminedastheratioofthetotalofsuchshared~'ordstotheproductofthefrequenciesofthetwonounsinthetext.The78~pairswererankedaccordingtothiscoefficient.Thetextcomprised12(1,~00runningwordsofphysicstextprocessedatTheRANDCorporation;thefrequenciesofoccurrenceofthefortynounsinthistextrangedfrom42to328.Theresultssuggestthatthesampleoftextisofsufficientsizetobeusefulfortheintendedpurpose.Manynounpairswithsimilarproperties(synonymy,antonym),,derivationfromdistributionallysimilarverbs,etc.)arecharacterizedbyhighsimilaritycoefficients;theconverseisnotobserved.Therelevanceofvarioussyntacticrela-tionshipsascriteriaformeas~rementisdiscussed.[larper1MEASURENIiNTOFSIMILARITYBETWEENNOUNSI.INTRODUCTIONOneofthegoalsofstudiesinDistributionalSemanticsistheestablishmentofwordclassesonthebasisoftheobservedbehaviorofwordsinwrittentexts.Aconvenientandsignificantwayofdiscussingbehaviorofwordsisintermsofsyntacticrelationship.Attheoutset,infact,itisnecessarythatwetreatawordintermsofitsSyntacticallyRelatedWords(SRW).Inagiventext,eachwordbearsagivensyntacticrelationshiptoafinitenum-berofotherwords;e.g.,afinitenumberofwords(nounsandpronouns)appearassubjectforeachactiveverb;anothergroupofnounsandpronounsareusedasdirectobjectofeachtransitiveverb;otherwordsoftheclass,adverb,appearasmodifiersofagivenverb.IneachinstancewemayspeakoftherelatedwordsasSRWofagivenverb,sothatinourexamplethreedifferent~ofSRWemerge;agivenSRWisthendefinedintermsbothofwordclassandspecificrelationshiptotheverb.(AgivennounmayofcoursebelongtotwodifferenttypesofSRW,e.g.,asbothsubjectandobjectofthesameverb.)Distributionally,wemaycomparetwoverbsintermsoftheirSRN.TheobjectiveofthepresentstudyistotestthepremisethatsimilarwordstendtohavethesameSRW.Thispremiseistested,notwithverbs,asinthel,arperaboveexample,butwithnouns.Ourprocedureis(i)tofindinagiventextthreetypesofSRWforasmallgroupofnouns,(2)tofindthenumberofSill;Tsharedbyeachpairofnounsformedfromthegroup,and(3)toexpressthesimilaritybetweenindividualnouns)andgroupsofnouns,asafunctionoftheirsharedSRI~.Anotherexample:itmightturnoutthatinagiventextthenounsaandb(avocadoandcherry)sharesuchadjectivemodifiersasripe,whereasnounsc)'andd(chairandfurniture)haveincommontheadjectivemodifiermodern.Thesefactswouldleadustoconcludethataandbaresimi-lar,thatcanddaresimilar)thataandcarelesssimilar,etc.Anumberofquestionsarise:Whatissimilarityanyway?DowordsthataresimilarinmeaningreallyshareasignificantnumberofSRWinagiventext?Whatisasignificantnumber?DonotdissimilarwordsalsohavemanycommonSRW?flowmuchtextisnecessaryinordertoestab-lishpatternsofwordbehavior?Whatistheeffectofmultiple=meaninginwords,andofusing,textsfromdiffer=entsubjectareas?Thepresentinvestigationshouldberegardedasanexperimentdesignedtothrowsomelightonthesequestions;novalidityisclaimedfortheresultsobtained.Ouraudacityinattemptingtheexperimentatallisbasedonthreefactors:thepossessionofatextinalimitedfield(physics),theforeknowledgethatthemultiple=llarper3meaningprobler:lismininlal,andthecapabilityforautomaticprocessingoftext.(Thelatterisclearlyanecessity,inviewo£thesizeandcomplexityoftheproblem.)Thereadermaywellconcludethattheexperimentprovesnothing.Wewouldhope,however,thatsuchanopinionwouldnotprecludeacriticaljudgmentoftheproceduresemployed,orthesuspensionofdisbeliefiftheresultsdonotcorrespondwithhisexpectations.2.PROCIiDIIRI']TilepresentstudywasbasedonaseriesofarticlesfromRussianphysicsjournals,comprisingapproximately120)000runningwords(some500pages).Theprocessinp,ofthiste.xthasbeendescribedelsewhere,(1'2)ltere,wenoteonlythateachsentenceofthistextisrecordedonmagnetictape,togetherwiththefollowinginformationforeachoccurrenceinthesentence:itspartofspeech,itswordnumber(anidentificationnumberinthemachineglossary},anditssyntacticgovernorordependent(i£any)inthesentence.AretrievalprogramappliedtothistexttapethenyieldedinformationabouttheSRI'iforwordsinwhichwewereinterested.Forconvenienceandeconomy,allwordsinthemachineprintoutforthisstudyareidentifiedbywordnumber,ratherthanintheirnatural-language£orv).Inourstudywechosetodealwit]~theSRI~offortyRussiannouns,hereincalledTest~ords{TW).Thenumberltarper4iscompletelyarbitrary;tileparticularnounschosen(seeTable1)a'erepresumedtoformdifferentsemanticgroupings.Table1givesonepossiblegroupingofthesewords;thecriteriaforgroupingaremoreorlessobvious,althoughthereadermayeasilyformdifferentgroups,byexpandingorcontractingthegroupsthatwehavedesignated.Theonlypurposeofgroupingistoprovideaweakmeasureofc

1 / 23
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功