论文修改 Data mining and semantic web

6年前 232次浏览 论文修改 Data mining and semantic web已关闭评论


Data Mining and Semantic Web are two different avenues leading to the same goal that's efficient retrieval of knowledge, from large compact or distributed databases, or the Internet. Knowledge in this context means synergistic interaction of information (data) and its relationships (correlations) but the major difference is placement of complexity.

These two approaches have their own advantages and disadvantages and can be integrated to each other to diminish both their drawbacks.

In this paper integration of semantic web and data mining in different field of application explored.




1 Introduction

Information resources in the web are mostly in text format and are natural language documents that are suitable for human consumption. But there is a problem with this kind of web; Web content is not machine-accessible. Search engines try to establish connections between documents but there are serious problems associated with their use such as "High recall, low precision", "Low or no recall", "Results are highly sensitive to vocabulary" and other problems.

Outgoing efforts for overcoming this problems and structuring web content in order to query them and retrieve information is in two directions.

One solution is to use the content as it is and works on information retrieval and text mining categorizing textual resources. The approaches either (i) predefine a metric on a document space in order to cluster 'nearby' documents into meaningful groups of documents (called 'unsupervised categorization' or 'text clustering';) or (ii) they adapt a metric on a document space to a manually predefined sample of documents assigned to a list of target categories such that new documents may be assigned to labels from the target list of categories, too ('supervised categorization' or 'text classification';).[1] In this approach data and knowledge represented with simple mechanisms, typically HTML, and without metadata. In data mining relatively complex algorithms have to be used such as decision trees; rule induction ... This method has its advantages and problems. The advantage is that document categorization with this method is nearly cheap but the problem is that the qualities of its document categorization for larger sets of target categories as well as the understandability of its results are often quite low.

An alternative approach is to represent Web content in a form that is more easily machine-processable with the use of semantic Web. Data and knowledge represented with complex mechanism, typically XML, and with plenty of metadata. In this approach thesauri and ontologies that are conceptual structures are constructed. Advantage of this approach is that the quality of manual metadata may be very good and relatively simple algorithms can be used with low complexity at the retrieval request time, but the cost of building ontology and adding manual metadata typically are one or several orders of magnitude higher than for automatic approaches and has large metadata design and maintenance complexity at system design time.[2]

First approach can be mentioned as data mining approach and the second as the use of Semantic Web. These two approaches can be integrated to each other to diminish both their drawbacks.

In this paper some usage of semantic web and data mining is presented. In the next section an ontology-based framework for text mining is introduced. In section 3 applications of semantic web and data mining in healthcare is discussed. In section 4 the paper is concluded.



一种解决方案是使用内容并在信息检索和文本挖掘分类的文本资源。的方法是(我)在一个文档中一个度量空间定义为集群附近的“文件转换成有意义的文件组(称为“非监督分类或聚类的;)或(ii)他们适应在文档空间度量手动样本分配给列表的目标类别这种新的文件可能被分配到的标签类别目标列表文件,太('supervised分类”或“文本分类;)。[ 1 ]在这种方法中的数据和知识表示的简单机制,通常是HTML,而元数据。在数据挖掘中比较复杂的算法必须使用如决策树;规则归纳…这种方法有其优点和存在的问题。的优点是用这种方法文档分类几乎是便宜,但问题是,对于较大的数据集的目标类别以及其结果的可理解性通常很低的文档分类的品质。