知识图谱与交通⼤数据复旦⼤学肖仰华2Outline•关于我们•什么是知识图谱?•为何研究知识图谱?•如何构建知识图谱? •如何应⽤知识图谱?–Kowledge Works@FUDAN––Knowledge works is a studio focusing on building and managing large scale knowledge graphs of high quality as well as the applications of knowledge graphs in text understanding, intelligent search and robot brain.–Graph Data Management Lab@FUDAN––GDM@FUDANfocuses on studying and developing effective and efficient solutions to manage and mine these graph data, aiming at understanding real graphs and supporting real applications built upon large real graphs. Recently, we are especially interested in knowledge graphs and its application.OurLabOurMission:Theconstruction,managementandapplicationoflargescaleknowledgegraphsKnowledgeGraphakindofsemanticnetworkthatconsistsofentities/conceptsaswellastheirsemanticrelationships.Highercoverageoverentitiesandconcepts,moreabundantsemanticrelationships,constructedinanmoreautomaticway,higheraccuracyisexpected.Thekeyofintelligentinformationprocessing.KGhasshownitspotentialpowerinsolveproblemssuchassearchintentunderstanding,relationshipexplaining,userprofiling.Itisofgreatbusinessvalueinintelligentsearch,intelligentsoftware,cyberneticsecurityandintelligentbusiness.ThekeytobuildamachinethatthinklikehumanKGprovidesnecessarybackgroundknowledgetoenablemachinetounderstandlanguageandthinklikehuman.Knowledge graph construction and application•Recommendation using KG (、DASFAA2015)•User profiling by KG (ICDM2015、CIKM2015)•Knowledge Reorganization (CIKM 2014)•Categorization by KG (CIKM 2015)•Summarization by KG (IJCAI2015)•Verb-centric KG construction (AAAI2016)•Cross-lingual type inference(DASFAA2016) Graph Management•Big graph systems(SIGMOD12),•Overlapping community search(SIGMOD2013), •Local Community search(SIGMOD2014)•Big graph partitioning(ICDE2014)•Shortest distance query (VLDB2014)Graph Analytic•Models for symmetry (Physical Review E 2008), •Simplifying graph by symmetry (Physical Review E 2008,”substantial contribution”)•Complexity/distance measurement by symmetry (Pattern Recognition 2008, Physica A 2008), •Using Symmetry to Reduce index size(EDBT 2009, “Pioneering work ”)•Using symmetry for Social network anonymization(EDBT 2010)ResearchOutlineQ1:Real graphs are symmetric, why real graphs are symmetric, how to use symmetry in real applications?Q2: Real graphs are big even with billions of nodes:How to efficiently and effectively manage and analyze these big graphs?Q3:Real graphs are semantic rich: How to construct knowledge graphs (KG) and how to use them in search, recommendation,and inference?6Ourknowledgebase(kw.fudan.edu.cn)1.CN-DBPediaCN-DBpediaisanefforttoextractstructuredinformationfromChineseencyclopediasites,suchasBaiduBaike,andmakethisinformationavailableontheWeb.CN-DBpediaallowsyoutoasksophisticatedqueriesagainstChineseencyclopediasites,andtolinkthedifferentdatasetsontheWebtoChineseencyclopediasitesdata2.ProbasePlusProbase is a web-scale taxonomy that contains 10 millions of concepts/entities and 16 millions of isA relations. In addition, ProbasePlus is a updated taxonomy that has more isA relations inferred from the original Probase. They are useful for conceptualization, reasoning, etc3.VerbBaseVerb pattern is a probabilistic semantic representation on verbs. We introduce verb patterns to represent verbs’ semantics, such that each pattern corresponds to a single semantic of the verb. We constructed verb patterns with the consideration of their generality and specificity.知识图谱•知识图谱是⼀种海量知识表征形式,表达了各类实体及其之间的各种语义关系。–更⾼的实体、概念覆盖率–更为丰富的语义关系–⾃动化构建程度⾼–较⾼的数据质量•知识图谱的研究意义。–(语义鸿沟)为语义匹配提供了丰富的背景知识–(机器智脑)为机器智脑提供了丰富的知识背景•Yago,WordNet, FreeBase, Probase, NELL, CYC, DBPedia….CN-Dbpedia实体关系图,来⾃kw.fudan.edu.cn8交通知识图谱•如何理解⼤数据–关联与理解–数据理解缺乏背景知识•交通⼤数据应⽤–套牌识别–问题⻋辆画像–关联查询与分析CN-DBpedia系统简介10什么是CN-DBPedia–CN-DBpedia致⼒于构建最⼤的中⽂知识图谱。主要从中⽂百科类⺴站(如百度百科、互动百科、中⽂维基百科等)的纯⽂本⻚⾯中提取信息,经过滤、融合、推断等操作后,最终形成⾼质量的结构化数据,供机器和⼈访问–百度百科CN-DBpedia11CN-DBPedia的优点•数据结构化•数据质量更⾼–属性融合–⽇期属性值归⼀化–属性值中实体分割–命名实体识别•语义关系更多–跨语⾔实体链接–跨语⾔实体类型推断•提供API接⼝•提供图形化展⽰⾃2015年12⽉上线以来,已经突破680万次访问和调研分布式防屏蔽爬⾍•架构:使⽤星型⺴络架构。•多操作系统⽀持:使⽤C#编写,可以在⾮win机器上使⽤mono库运⾏。•多⺴段⽀持:TCP消息通信,只需要主从机之间⺴络互通,不需要做预先打通⽆密码ssh隧道等⺴络配置。•部署⽅便:从机单可执⾏⽂件部署,部署⾮常⽅便。•多语⾔⽀持:利⽤约定好输⼊输出格式的可执⾏⽂件或脚本作为Map或Reduce程序,从⽽可以使⽤多种语⾔编写。•⾼容错性:从机当机不影响任务进⾏,主机则使⽤快照机制进⾏故障恢复。•全⾃动防屏蔽机制•基础爬⾍:基本的列表式爬⾍。•⼲度优先搜索爬⾍:爬⾍会沿着⺴⻚中的超链接持续抓取。•Cookies⽀持:爬⾍⽀持使⽤cookies以抓取需要登录才可获得的内容。•代理⽀持:爬⾍⽀持在⺴络上⾃动寻找可⽤的免费Http代理(另⼀模块),并使⽤这些代理进⾏抓取。•Ajax解析:爬⾍⽀持在简单配置之后解析并抓取复杂的ajax⻚⾯。•⼿机端抓取:抓取微信、即刻、今⽇头条等⼿机App中的内容。⺴⻚解析与抽取•百科类⺴站⻚⾯结构化程度⾼、质量准确,便于抽取关系抽取•对于每个元素树,我们从中抽取出实体名、摘要关系、属性关系以及分类关系等•这些关系都以⼀种相对结构化的形式存在于元素树中,可⽤通⽤规则进⾏抽取。从title中得到实体名从div class=”lemma-summary”标签中得到实体摘要关系从div class=”basic-info”标签中得到属性关系从div id=”open-tag”标签中得到分类关系属性与值的规范化•出⽣⽇期–出⽣⽇期, 出⽣年⽉, 出⽣时间, ⽣⽇, …•中⽂名–中⽂名, 中⽂名称, 名称, …•所属地区–所属地区, 所在地区, 所属区域, 所属地, 区域所属, …•1905年(⼄⺒年)9⽉14⽇à1905-09-14•1905.9.14 à1905-09-14•1987年2月à1987-02-00•1987年à1987-00-00属性融合•⺫标–同⼀属性存在不同的表现形式,合并其中的相同属性–例如,“出⽣⽇期”和“出⽣年⽉”•挑战–实体的属性成千上万,完全