Abstract
Data Mining and Semantic Web are two different avenues leading to the same goal that's efficient retrieval of knowledge, from large compact or distributed databases, or the Internet. Knowledge in this context means synergistic interaction of information (data) and its relationships (correlations) but the major difference is placement of complexity.
These two approaches have their own advantages and disadvantages and can be integrated to each other to diminish both their drawbacks.
In this paper integration of semantic web and data mining in different field of application explored.
数据挖掘和语义Web是两种不同的途径,导致相同的目标,有效地检索知识,从大型紧凑型或分布式数据库,或互联网。在此背景下的知识意味着协同作用的信息(数据)和它的关系(相关性),但主要的区别是放置的复杂性。
这两种方法都有自己的优点和缺点,可以集成到对方,以减少他们的缺点。
本文结合语义网和数据挖掘在不同应用领域的应用探讨。
1 Introduction
Information resources in the web are mostly in text format and are natural language documents that are suitable for human consumption. But there is a problem with this kind of web; Web content is not machine-accessible. Search engines try to establish connections between documents but there are serious problems associated with their use such as "High recall, low precision", "Low or no recall", "Results are highly sensitive to vocabulary" and other problems.
Outgoing efforts for overcoming this problems and structuring web content in order to query them and retrieve information is in two directions.
One solution is to use the content as it is and works on information retrieval and text mining categorizing textual resources. The approaches either (i) predefine a metric on a document space in order to cluster 'nearby' documents into meaningful groups of documents (called 'unsupervised categorization' or 'text clustering';) or (ii) they adapt a metric on a document space to a manually predefined sample of documents assigned to a list of target categories such that new documents may be assigned to labels from the target list of categories, too ('supervised categorization' or 'text classification';).[1] In this approach data and knowledge represented with simple mechanisms, typically HTML, and without metadata. In data mining relatively complex algorithms have to be used such as decision trees; rule induction ... This method has its advantages and problems. The advantage is that document categorization with this method is nearly cheap but the problem is that the qualities of its document categorization for larger sets of target categories as well as the understandability of its results are often quite low.
An alternative approach is to represent Web content in a form that is more easily machine-processable with the use of semantic Web. Data and knowledge represented with complex mechanism, typically XML, and with plenty of metadata. In this approach thesauri and ontologies that are conceptual structures are constructed. Advantage of this approach is that the quality of manual metadata may be very good and relatively simple algorithms can be used with low complexity at the retrieval request time, but the cost of building ontology and adding manual metadata typically are one or several orders of magnitude higher than for automatic approaches and has large metadata design and maintenance complexity at system design time.[2]
First approach can be mentioned as data mining approach and the second as the use of Semantic Web. These two approaches can be integrated to each other to diminish both their drawbacks.
In this paper some usage of semantic web and data mining is presented. In the next section an ontology-based framework for text mining is introduced. In section 3 applications of semantic web and data mining in healthcare is discussed. In section 4 the paper is concluded.
网络中的信息资源主要是文本格式,是适合人类消费的自然语言文档。但这种网站有一个问题,网页内容是不可访问的。搜索引擎试图建立文档之间的连接,但有严重的问题,与他们的使用,如“高召回率,低精度”,“Low或不记得”,“结果是高度敏感的词汇”和其他问题。
传出的努力克服这个问题,并构建Web内容,以查询他们和检索信息是在两个方向。
一种解决方案是使用内容并在信息检索和文本挖掘分类的文本资源。的方法是(我)在一个文档中一个度量空间定义为集群附近的“文件转换成有意义的文件组(称为“非监督分类或聚类的;)或(ii)他们适应在文档空间度量手动样本分配给列表的目标类别这种新的文件可能被分配到的标签类别目标列表文件,太('supervised分类”或“文本分类;)。[ 1 ]在这种方法中的数据和知识表示的简单机制,通常是HTML,而元数据。在数据挖掘中比较复杂的算法必须使用如决策树;规则归纳…这种方法有其优点和存在的问题。的优点是用这种方法文档分类几乎是便宜,但问题是,对于较大的数据集的目标类别以及其结果的可理解性通常很低的文档分类的品质。
另一种方法是代表在一个更容易的机器可处理的语义Web的使用形式的Web内容。数据和知识为代表的复杂的机制,通常是XML,并与大量的元数据。在这种方法中叙词表和本体是概念结构构造。这种方法的优点是,手动元数据的质量可能是非常好的,相对简单的算法可以使用低复杂度的检索请求时间,但建筑本体的成本和添加手动元数据通常是一个或几个数量级高于自动方法,并具有较大的元数据设计和维护的复杂性,在系统设计时间。
第一种方法可以被提到作为数据挖掘的方法和第二个使用语义Web。这两种方法可以被集成到彼此,以减少他们的缺点。
本文介绍了语义Web和数据挖掘的一些用法。在下一节中,基于本体的文本挖掘框架。在第3节中的语义Web和数据挖掘在医疗保健中的应用进行了讨论。在第4节中的结论。