Cluster-Based Pattern Recognition in Natural Langu

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

Cluster-BasedPatternRecognitioninNaturalLanguageTextAthesissubmittedinpartialfulfillmentoftherequirementsforthedegreeofMasterofSciencebyShmuelBrodyunderthesupervisionofProf.NaftaliTishbyAugust2005iiAcknowledgementsIwouldliketothankmyadviserProf.Tishbyforhisguidanceandassistanceinproducingthiswork,andforhissuggestionsandpositiveinput.IwouldalsoliketothankBeataBeigmanKlebanovforherconstanthelpandadvicethroughoutthiswork,including(butdefinitelynotlimitedto)thecontributionoftheparseddatausedhere.Alsodeservingofthanksaremyfamily,fortheirsupport,andespeciallymygrandmother,RoseBrody,forherconfidenceinmyachievements.iiiAbstractThisworkpresentstheClusteredClausestructure,whichusesinformation-basedclusteringanddependenciesbetweensentencecomponentstoprovideasimplifiedandgeneralizedmodelofagrammaticalclause.Weshowthatthisrepresentation,whichisbasedondependencieswithinthesentence,enablesustodetectcomplextextualrelationsatahigherlevelofcontext.Therelationswedetectareofinterestinthemselves,aslinguisticphenomena,andarealsohighlysuitedforuseincertainlinguisticandcognitivetasks.Wedefineandsearchforseveraltypesofpatterns,movingfrombasicpatternstomorecomplexones,frompatternswithinthesentencetothoseinvolvingentiresentences.Examplesofrecognizedpatternsofeachtypearepresented,andalsodescriptionsofseveralinterestingphenomenadetectedbyourmethod.Weassessthequalityoftheresults,anddemonstratetheimportanceoftheclusteringanddependencymodelwechose.Theprinciplesbehindourmethodarelargelydomain-independent,andcanthereforebeappliedtootherformsofstructuredsequentialdataaswell.ivTableofContents1Introduction11.1TheProblem11.2Overview21.3RelatedPreviousWork21.3.1SyntaxandDistributionalInformationasMeasuresofSemantics21.3.2RelationsfromPatternsandTemplates41.3.3FeatureSetsandSimilarityMeasures51.3.4UsesofSimilarityandRelatednessMeasures71.3.5SemanticDatabases81.3.6RelationshipsInvolvingaHigherLevelofContext101.3.7PatternsContainingClusterUnits121.3.8NovelAspectsofthisWork131.4ImportanceandMotivation131.4.1Cognition&WorldKnowledgeAcquisition131.4.2AutomatedRuleAcquisition141.4.3QueryEnhancement151.4.4Implication&Entailment151.4.5AnaphoraResolution162TheWorkSetup172.1TheClauseModel172.1.1MINIPAR'sSentenceStructure172.1.2TheSimplifiedClauseStructure182.2TheClustering182.2.1ClusteringMethods182.3TheInformationBottleneckConcept192.3.1TheSequentialIBClusteringMethod212.3.2TheVariablesandUseoftheClauseModel232.4TheClustered-ClauseRepresentation242.5PatternDefinition242.6EvaluationMethod26v3TheProcedure273.1TheData273.2Preprocessing273.3Clustering283.4SimplePatternDetection293.5ComplexPatternDetection303.6ReducingGeneralizedPatternstoSpecificOnes303.7SignificanceCalculation314Results344.1ClusteringResults364.1.1TheClusters364.1.2EvaluatingtheQualityoftheClustering384.1.3TheClusteredClauses404.2Intra-ClausePatterns424.3Inter-ClausePatterns444.3.1PatternswithinThreeClauses(t=3)444.3.2Longer-RangePatterns(t=6,t=9)464.4ComplexPatterns484.5TheInfluenceofClustering505Discussion525.1Conclusions525.2OtherAreasofApplication525.3PossibleExtensionsandImprovements535.3.1Re-insertingRemovedWords535.3.2DifferentDataSet535.3.3DifferentEvaluationMethod545.3.4RicherSentenceModel546AppendixA–Clusteringresults556.1SubjectClusters556.2VerbClusters59vi6.3ObjectClusters617AppendixB-Intra-ClausePatterns667.1Subject–VerbPatterns667.2Verb-ObjectPatterns677.3Subject-ObjectPatterns687.4Word-WordPatterns697.4.1LanguagePhrasePatterns697.4.2WorldPatterns697.4.3PatternsResultingfromParserMisclassification697.4.4PatternsSpecifictotheCorpus698AppendixC-Inter-ClausePatterns708.1PatternswithinThreeClauses(t=3)709AppendixD–ComplexPatterns729.1PatternswithSubject-SubjectAnchor729.2PatternswithVerb-VerbAnchor749.3PatternswithObject–ObjectAnchor759.4PatternswithSubject-ObjectAnchor769.5PatternswithObject-SubjectAnchor77Bibliography7811Introduction1.1TheProblemThisworkisconcernedwiththeproblemofdetectingpatternsinsequentialdata.Whenwedealwithsequenceswhereeachpointinthesequencecanhaveoneofaverylargenumberofvalues,suchpatternsareoftendifficulttodetect.Thedifficultystemsmainlyfromtheproblemofdatasparseness,meaningthatoursequenceisnot(andusually,cannotfeasiblybe)longenoughtogiveatruerepresentationofthevaluedistribution.Thelargenumberofvaluespresentsanotherproblem:sinceweareusuallylookingforpatternswhichshouldbeapplicabletoalargepartofthedata,findingapatternwhichappliestoasmallnumberofvaluesisoflittleusetous.Theprocedurewepresenthereisdesignedtosolveboththeseproblems,andfacilitatethepatterndetectiontask.Itcombinestheuseofclusteringviamutualinformationwithamodelchosentofitbothourspecificpatterndetectiontaskandthedatawedealwith.Animportantexampleofsuchasituationispatterndetectionintext.Ifweviewthetextasasequenceofsentences,itiseasytoseethatfindingpatternsbetweensentencesisverydifficult.Wehardlyeverencounterthesameexactsentencemorethanonce,soatfirstglance,nopatternsexistbetweenwholesentences.Wearewellaware,however,thatpatternsdoexist,butonthesemanticlevel,ratherthanthepurelylexicalone.Sincesemanticinf

1 / 87
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功