Text Mining is about...

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

TextMining:TechniquesandApplicationsดร.ชชาต หฤไชยะศกด ChoochartHaruechaiyasak,Ph.D.หน'วยปฎ,บ.ต,การว,จ.ยว,ทยาการมน6ษยภาษาHumanLanguageTechnology(HLT)ศBนยCเทคโนโลยHอ,เลJกทรอน,กสCและคอมพ,วเตอรCแห'งชาต,(เนคเทค)NationalElectronicsandComputerTechnologyCenter(NECTEC)Rev.8March2007Overview–ScopeandTasksTechnique–InformationExtraction(IE)Application–TechMining:Applicationoftextminingtoscienceandtechnology(S&T)informationLectureOutlineOverview:ScopeandTasks“Siftingthroughvastcollectionsofunstructuredorsemistructureddatabeyondthereachofdataminingtools,textminingtracksinformationsources,linksisolatedconceptsindistantdocuments,mapsrelationshipsbetweenactivities,andhelpsanswerquestions.”TappingthePowerofTextMiningCommunicationsoftheACM,Sept.2006TextMiningisabout...Humans:Abilitytodistinguishandapplylinguisticpatternstotext–Couldovercomelanguagedifficultiessuchasslangs,spellingvariations,contextualmeaning.Computers:Abilitytoprocesstextinlargevolumesathighspeed–Couldsiftthroughalargecollectionoftextstofindsimplestatisticsandrelationshipamongtermsinaninstantoftime.TextminingrequiresacombinationofbothHuman'slinguisticcapability+computer'sspeedandaccuracyNLPDataMiningHumansVS.ComputersNLPLexical/MorphologicalAnalysisTagging/ChunkingNamedEntitiesRecognition(NER)SyntacticAnalysis(Shallowparsing)WordSenseDisambiguationSemanticAnalysisReferenceResolutionDiscourseAnalysisNLP+DataMiningTasksTextMiningTasksDataMiningClassification(supervisedlearning)Clustering(unsupervisedlearning)AssociationRuleMiningSequentialPatternAnalysisRegressionAnalysisDependencyModelingChangeandDeviationDetectionInformationextraction:–Analyzeunstructuredtextandidentifykeyphrasesandrelationshipswithintext.Topicdetectionandtracking:–Filterandpresentonlydocumentsrelevanttotheuserprofile.Summarization:–Textsummarizationreducesthecontentbyretainingonlyitsmainpointsandoverallmeaning.Categorization:–AutomaticclassifydocumentsintopredefinedcategoriesClustering:–GroupsimilardocumentsbasedontheirsimilarityTextMiningTasksConceptLinkage–Connectrelateddocumentsbyidentifyingtheirsharedconcepts,helpingusersfindinformationtheyperhapswouldn'thavefoundthroughtraditionalsearchmethodsInformationVisualization–Representdocumentsorinformationingraphicalformatsforeasilybrowsing,viewing,orsearching.Questionandanswering(Q&A)–SearchandextractthebestanswertoagivenquestionTextMiningTasks(cont'd)Example:ConceptLinkageBiomedicine:Co-occurrenceoftermsExample:ConceptLinkageBiomedicine:Entities&RelationshipExample:SearchResultClusteringvivisimo.comExample:Question&Answeringask.comExample:InformationVisualizationkartoo.comTechnique:InformationExtractionFirstnoteonthismisunderstanding:InformationRetrievaldoesn’tretrieveinformationYouhaveaninformationneed,butwhatyougetbackisn’tinformationbutdocuments,whichyouhopehavetheinformationInformationextractionisoneapproachtogoingfurtherforaspecialcase:There’ssomerelationyou’reinterestedinYourqueryisforelementsofthatrelationAlimitedformofnaturallanguageunderstandingWhatisInformationExtraction?Identifyspecificpiecesofinformation(data)inaunstructuredorsemi-structuredtextualdocument.Transformunstructuredinformationinacorpusofdocumentsorwebpagesintoastructureddatabase.Appliedtodifferenttypesoftext:NewspaperarticlesWebpagesScientificarticlesNewsgroupmessagesClassifiedadsMedicalnotesInformationExtraction(IE)Jobpostings/resumesSeminarannouncementsCompanyinformationfromthewebContinuingeducationcourseinfofromthewebUniversityinformationfromthewebApartmentrentaladsMolecularbiologyinformationfromMEDLINEApplicationsExtractingCorporateInformationDataautomaticallyextractedfrommarketsoft.comSourcewebpage.Colorhighlightsindicatetypeofinformation.(e.g.,red=name)E.g.,informationneed:WhoistheCEOofMarketSoft?Source:Whizbang!Labs/AndrewMcCallumShoppingCommercialInformationNeedthispriceTitleAbook,NotatoyProductInformationDigitalCameras:ImageCaptureDevice:1.68millionpixel1/2-inchCCDsensorImageCaptureDevice:TotalPixelsApprox.3.34million,EffectivePixelsApprox.3.24millionImagesensor:TotalPixels:Approx.2.11million-pixelImagingsensor:TotalPixels:Approx.2.11million1,688(H)x1,248(V)CCDTotalPixels:Approx.3,340,000(2,140[H]x1,560[V])EffectivePixels:Approx.3,240,000(2,088[H]x1,550[V])RecordingPixels:Approx.3,145,000(2,048[H]x1,536[V])Theseallcameoffthesamemanufacturer’swebsite!!DifficultBecauseofTextualInconsistencyBackground:AdvertisementsareplaintextClassifiedAdvertisements(RealEstate)ADNUM2067206v1/ADNUMDATEMarch02,1998/DATEADTITLEMADDINGTON$89,000/ADTITLEADTEXTOPEN1.00-1.45BRU11/10BERTRAMSTBRNEWTOMARKETBeautifulBR3brmfreestandingBRvilla,closetoshops&busBROwnermovedtoMelbourneBRideallysuit1sthomebuyer,BRinvestor&55andover.BRBrianHazelden0418958996BRRWHITELEEMING93323477/ADTEXTWhatyousearchforinrealestateadvertisements:Towns:youmightthinkeasy,but:Realestateagents:ColdwellBanker,MosmanPhrases:Only45minutesfromParramattaMultiplepropertyadshavedifferenttownsMoney:wantarangenotatextualma

1 / 66
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功