DataQualityIssuesinConstructingKnowledgeGraph知识图谱构建的质量控制苏州大学先进数据分析研究中心李直旭2017.07.13复旦大学Outline2IntroductiontoDQComputationalDQProblemsDataQualityIssuesinConstructingKGDataCleaninginKGEntityLinkinginKGDataImputationinKGConclusionsDataExplosion3KnowledgeExplosion?4DQProblemsinDBLP5Polyseme:10+different“WeiWang”Synonyms:“PeiLee”and“PeiLi”DifficultNamesinGoogleSearchDataFusion|VLDB2009Tutorial|LunaDong&FelixNaumann6AnotherExamplewithKBs7DifferentSchemas:e.g.,“Sex”-“Gender”,“Phone/Fax”-“Phone”+“Fax”Inconsistencyvalues:e.g.,“0/1”-“F/M”MissingvaluesSixDQDimensions8TheTaxonomyofDQProblems910ComputationalDataQualityProblemsDataIntegrationSchemaMappingRecordMatchingDataCleaningDataImputationDataProvenanceDataUncertaintyDataConstraintsOutline11IntroductiontoDQComputationalDQProblemsDataQualityIssuesinConstructingKGDataCleaninginKGEntityLinkinginKGDataImputationinKGConclusions12OpenIE-KnowledgeGraphBootstrappingMechanismse.g.:KnowItAll,SnowBall,ProBase…However,theaccuracydecreasessharplyafterseveraliterations.DataCleaninginKG13AMajorReason-Semanticdrifthappens(a)Semantic-basedbootstrappingmechanism(b)Syntax-basedbootstrappingmechanismDataCleaninginKG14MainstreamapproachesMutualExclusionBootstrapping(PACLING’07)DropthoseinstancesbelongingtomutuallyexclusiveclassesTypeChecking(WSDM’10)CheckthetypeofanentityforcorrectnessRandomWalkRanking(ICDM’06)Constructagraph,dorandomwalkrankingPattern-RelationDualityRanking(WSDM’11)Thequalityofapattern(tuple)canbedeterminedbythetuples(patterns)itextracts.AModelbasedonDetectedDriftingPoints(EDBT’14)DataCleaninginKG15MainstreamapproachesMutualExclusionBootstrapping(PACLING’07)DropthoseinstancesbelongingtomutuallyexclusiveclassesTypeChecking(WSDM’10)CheckthetypeofanentityforcorrectnessRandomWalkRanking(ICDM’06)Constructagraph,dorandomwalkrankingPattern-RelationDualityRanking(WSDM’11)Thequalityofapattern(tuple)canbedeterminedbythetuples(patterns)itextracts.AModelbasedonDetectedDriftingPoints(EDBT’14)DataCleaninginKG16MutualExclusionBootstrappingProsandCons:HighPrecision,LowRecallDataCleaninginKGPositives:CanadaEgyptFrance…warwith×ambassadorto×warin×occupationof×PlanetEarthFreetownNorthAfricaNegatives:AsiaEuropeLondonFlorida…nationslike×countriesotherthan×countrylike×PakistanSriLankaGreeceRussia17MainstreamapproachesMutualExclusionBootstrapping(PACLING’07)DropthoseinstancesbelongingtomutuallyexclusiveclassesTypeChecking(WSDM’10)CheckthetypeofanentityforcorrectnessRandomWalkRanking(ICDM’06)Constructagraph,dorandomwalkrankingPattern-RelationDualityRanking(WSDM’11)Thequalityofapattern(tuple)canbedeterminedbythetuples(patterns)itextracts.AModelbasedonDetectedDriftingPoints(EDBT’14)DataCleaninginKG18TypeCheckingCheckingtypesofrelevantentitiesProsandCons:HighPrecision,LowRecallDataCleaninginKGX,whichisbasedinYPillar,SanJoseOKTypeCheckingArguments:…companiessuchasPillar……citieslikeSanJose…Inclinedpillar,foundationplateNO19MainstreamapproachesMutualExclusionBootstrapping(PACLING’07)DropthoseinstancesbelongingtomutuallyexclusiveclassesTypeChecking(WSDM’10)CheckthetypeofanentityforcorrectnessRandomWalkRanking(ICDM’06)Constructagraph,dorandomwalkrankingPattern-RelationDualityRanking(WSDM’11)Thequalityofapattern(tuple)canbedeterminedbythetuples(patterns)itextracts.AModelbasedonDetectedDriftingPoints(EDBT’14)DataCleaninginKG20RandomWalkbasedCleaningDataCleaninginKG143256791081112RankingvectorStartingvectorAdjacentmatrix(1)iiircWrceRestartp0.1301/31/31/3000000000.101/301/300001/40000.130.220.130.050.90.050.080.040.030.040.0201/31/301/3000000001/301/301/400000000001/301/21/21/4000000001/401/20000000001/41/200000001/3001/40001/201/3000000001/401/300000000001/201/31/200000001/401/301/20000000001/31/300.1300.1000.1300.220.1300.0500.10.0500.0800.0400.0300.0402010.021MainstreamapproachesMutualExclusionBootstrapping(PACLING’07)DropthoseinstancesbelongingtomutuallyexclusiveclassesTypeChecking(WSDM’10)CheckthetypeofanentityforcorrectnessRandomWalkRanking(ICDM’06)Constructagraph,dorandomwalkrankingPattern-RelationDualityRanking(WSDM’11)Thequalityofapattern(tuple)canbedeterminedbythetuples(patterns)itextracts.AModelbasedonDetectedDriftingPoints(EDBT’14)DataCleaninginKG22Pattern-RelationDualityIdea:Thequalityofapattern(tuple)canbedeterminedbythetuples(patterns)itextracts.Cons:stillcannotreachhighprecisionandrecallDataCleaninginKG143256791081112RWonPrecisionRWonRecallF-Score=Precision+RecallRankingwithF-Score23MainstreamapproachesMutualExclusionBootstrapping(PACLING’07)DropthoseinstancesbelongingtomutuallyexclusiveclassesTypeChecking(WSDM’10)CheckthetypeofanentityforcorrectnessRandomWalkRanking(ICDM’06)Constructagraph,dorandomwalkrankingPattern-RelationDualityRanking(WSDM’11)Thequalityofapattern(tuple)canbedeterminedbythetuples(patterns)itextracts.AModelbasedo