论文修改 Data mining and semantic web

enlunwen

8年前

Abstract

Data Mining and Semantic Web are two different avenues leading to the same goal that's efficient retrieval of knowledge, from large compact or distributed databases, or the Internet. Knowledge in this context means synergistic interaction of information (data) and its relationships (correlations) but the major difference is placement of complexity.

These two approaches have their own advantages and disadvantages and can be integrated to each other to diminish both their drawbacks.

In this paper integration of semantic web and data mining in different field of application explored.

数据挖掘和语义Web是两种不同的途径，导致相同的目标，有效地检索知识，从大型紧凑型或分布式数据库，或互联网。在此背景下的知识意味着协同作用的信息（数据）和它的关系（相关性），但主要的区别是放置的复杂性。

这两种方法都有自己的优点和缺点，可以集成到对方，以减少他们的缺点。

本文结合语义网和数据挖掘在不同应用领域的应用探讨。

1 Introduction

Information resources in the web are mostly in text format and are natural language documents that are suitable for human consumption. But there is a problem with this kind of web; Web content is not machine-accessible. Search engines try to establish connections between documents but there are serious problems associated with their use such as "High recall, low precision", "Low or no recall", "Results are highly sensitive to vocabulary" and other problems.

Outgoing efforts for overcoming this problems and structuring web content in order to query them and retrieve information is in two directions.

One solution is to use the content as it is and works on information retrieval and text mining categorizing textual resources. The approaches either (i) predefine a metric on a document space in order to cluster 'nearby' documents into meaningful groups of documents (called 'unsupervised categorization' or 'text clustering';) or (ii) they adapt a metric on a document space to a manually predefined sample of documents assigned to a list of target categories such that new documents may be assigned to labels from the target list of categories, too ('supervised categorization' or 'text classification';).[1] In this approach data and knowledge represented with simple mechanisms, typically HTML, and without metadata. In data mining relatively complex algorithms have to be used such as decision trees; rule induction ... This method has its advantages and problems. The advantage is that document categorization with this method is nearly cheap but the problem is that the qualities of its document categorization for larger sets of target categories as well as the understandability of its results are often quite low.

An alternative approach is to represent Web content in a form that is more easily machine-processable with the use of semantic Web. Data and knowledge represented with complex mechanism, typically XML, and with plenty of metadata. In this approach thesauri and ontologies that are conceptual structures are constructed. Advantage of this approach is that the quality of manual metadata may be very good and relatively simple algorithms can be used with low complexity at the retrieval request time, but the cost of building ontology and adding manual metadata typically are one or several orders of magnitude higher than for automatic approaches and has large metadata design and maintenance complexity at system design time.[2]

First approach can be mentioned as data mining approach and the second as the use of Semantic Web. These two approaches can be integrated to each other to diminish both their drawbacks.

In this paper some usage of semantic web and data mining is presented. In the next section an ontology-based framework for text mining is introduced. In section 3 applications of semantic web and data mining in healthcare is discussed. In section 4 the paper is concluded.

网络中的信息资源主要是文本格式，是适合人类消费的自然语言文档。但这种网站有一个问题，网页内容是不可访问的。搜索引擎试图建立文档之间的连接，但有严重的问题，与他们的使用，如“高召回率，低精度”，“Low或不记得”，“结果是高度敏感的词汇”和其他问题。

传出的努力克服这个问题，并构建Web内容，以查询他们和检索信息是在两个方向。

一种解决方案是使用内容并在信息检索和文本挖掘分类的文本资源。的方法是（我）在一个文档中一个度量空间定义为集群附近的“文件转换成有意义的文件组（称为“非监督分类或聚类的；）或（ii）他们适应在文档空间度量手动样本分配给列表的目标类别这种新的文件可能被分配到的标签类别目标列表文件，太（'supervised分类”或“文本分类；）。[ 1 ]在这种方法中的数据和知识表示的简单机制，通常是HTML，而元数据。在数据挖掘中比较复杂的算法必须使用如决策树；规则归纳…这种方法有其优点和存在的问题。的优点是用这种方法文档分类几乎是便宜，但问题是，对于较大的数据集的目标类别以及其结果的可理解性通常很低的文档分类的品质。

另一种方法是代表在一个更容易的机器可处理的语义Web的使用形式的Web内容。数据和知识为代表的复杂的机制，通常是XML，并与大量的元数据。在这种方法中叙词表和本体是概念结构构造。这种方法的优点是，手动元数据的质量可能是非常好的，相对简单的算法可以使用低复杂度的检索请求时间，但建筑本体的成本和添加手动元数据通常是一个或几个数量级高于自动方法，并具有较大的元数据设计和维护的复杂性，在系统设计时间。

第一种方法可以被提到作为数据挖掘的方法和第二个使用语义Web。这两种方法可以被集成到彼此，以减少他们的缺点。

本文介绍了语义Web和数据挖掘的一些用法。在下一节中，基于本体的文本挖掘框架。在第3节中的语义Web和数据挖掘在医疗保健中的应用进行了讨论。在第4节中的结论。

2 An ontology-based Framework for Text Mining

This framework is constructed by S. Bloehdorn, P. Cimiano, A. Htho and s. Staab[1] that uses text mining to learn the target ontology from text documents and uses then the same target ontology in order to improve the effectiveness of both supervised an unsupervised text categorization approaches.

The architecture builds upon the Karlsruhe Ontology and Semantic Web Infrastructure (KAON) that's a general and multi-functional open source ontology management infrastructure and tool suite developed at Karlsruhe University. In this framework some definitions of ontology is given that define the core ontology, sub concepts and super concepts, domain and range, lexicon for an ontology and knowledge base. The main component of the framework that is responsible for creating and maintaining ontologies is "TextToOnto". It employs text mining techniques such as term clustering and matching of lexico-syntactic patterns as well as other resources of a general nature such as WordNet[1]. It has three main components: Ontology Management Component that provides basic ontology management such as editing and browsing and evolution of ontologies. The second component is the Algorithm Library Component that incorporates a number of text mining methods. The third component is Coordination Component that is used to interact with the different ontology learning algorithms from the algorithm library.

2.1 Ontology-based Text Clustering and Classification

The demand of systems that automatically classify text documents into predefined thematic classes or detect clusters of documents with similar content is very urgent due to the ever growing amount of textual information available electronically. Existing text categorization systems have typically used the Bag-of-Words model that is a model in information retrieval where single words or word stems are uses as features for representing document content. In this paradigm documents are represented as bags of terms. The absolute frequency of term t in document d is given by tf(d,t) and Term vectors are denoted td = (tf(d, t1); : : : ; tf(d, tm)).

To exploit background knowledge about concepts that is given according to the ontology model, term vectors extended by new entries for ontological concepts c appearing in the document set.

The process of extracting concepts from texts has five steps: 1. Candidate Term Detection that's an algorithm that maps multi-word expression to the most appropriate concept.2. Syntactical Patterns that uses part-of-speech tags of the words3. Morphological Transformations 4. Word Sense Disambiguation 5. Generalization: The last step in the process is about going from the specific concepts found in the text to more general concept representations.

3 Semantic Web and data mining in Healthcare

This section discuss about use of semantic web and data mining in health care. First part discuss about overall usage, 3.2 discusses about using semantic dependencies to mine depressive symptoms from consultation records and 3.3 discusses about the requirements for ontologies in medical data integration.

3.1 Overview

The Web has become a major vehicle in performing research and practice related activities for healthcare researchers and practitioners, because it has so many resources and potentials to offer in their specialized professional fields. []. There is tremendous amount of information and knowledge existing on the Web and waiting to be discovered, shared and utilized. The research in improving the quality of life through the Web has become attractive. Both healthcare researchers and practitioners require a lot of information to make their healthcare related activities and practices either with drug prescriptions which can effectively cure patients' illness or with correct and efficient medical/clinical procedures and services. Information technology has been playing an important and critical role in this field for many years. By using the Semantic Web and mining technologies, not only can researchers and practitioners in healthcare from different countries share their information by exchanging the XML-based ontology, but they can also effectively collaborate on healthcare research projects and work closely together as a team. By focusing on the semantic based information, they will have better access to the knowledge and information required to effectively prescribe drugs and medical procedures to prevent/treat dangerous and infectious diseases. Researchers and practitioners in healthcare have access to the databases of the latest diseases, their symptoms, treatments, diagnosis analysis and other important information. This kind of information can be structured in a more understandable and machine interpretable way by using Semantic Web languages. If this is done successfully, then this ontology or RDF can be fed into an inference engine, which can effectively make new discoveries useful to the patient treatment procedures or the general healthcare activities. Ontologies play a key role in describing semantics of data in both traditional knowledge engineering and emerging Semantic Web. Since ontology defines the exact nature of every resource in its domain and the relationship among these resources, it becomes much simpler to extract the users' needs and usage tendencies.

网络已成为一个主要的车辆进行研究和实践相关的活动，为医疗保健研究人员和从业者，因为它有这么多的资源和潜力，在他们的专业领域提供。[ ]。有大量的信息和知识存在于网络上，并等待被发现，共享和利用。通过网络提高生活质量的研究已成为有吸引力的。无论是医疗保健研究人员和从业者都需要大量的信息，使他们的医疗保健相关的活动和做法，无论是与药物处方，可以有效地治愈患者的疾病或正确和有效的医疗/临床程序和服务。多年来，信息技术在这一领域发挥着重要而关键的作用。通过使用语义Web和挖掘技术，不仅可以在医疗保健的研究人员和从业者共享他们的信息通过交换基于XML的本体论，但他们也可以有效地合作医疗保健研究项目，并紧密合作，作为一个团队。通过专注于基于语义的信息，他们将有更好的访问到所需的知识和信息，有效地规定药物和医疗程序，以防止/治疗危险和传染病。医疗保健的研究人员和从业者可以访问最新的疾病的数据库，他们的症状，治疗，诊断分析和其他重要信息。这种信息可以以更容易理解，机器可解释的方式，通过使用语义Web语言。如果这是成功的，那么这个本体或RDF可以放入一个推理引擎，可以有效地使新发现病人或治疗程序的一般医疗活动的有用。本体在传统知识工程和新兴语义网中对数据语义的描述中起着关键性的作用。由于本体定义了其域中的每一个资源的确切性质和这些资源之间的关系，它变得更简单，提取用户的需求和使用倾向。

3.2 Using semantic dependencies to Mine Depressive Symptoms from Consultation Records

Many psychiatric Web sites have developed various psychiatric screening services for mental health care and crisis prevention that people can use these services to consult professionals about depressive symptoms, get a preliminary assessment of their symptoms' severity, and receive health education via email or other communication media. Analyzing consultation records and making suggestion with the current systems take a lot of time of professionals. Semantic web can help so much to solve this problem. The new system should has a service that first understand what kind of depressive symptoms people are experiencing and the semantic relations between symptoms; then it could offer further diagnostic and educational services. In [4] a framework is suggested for mining depressive symptoms and their relations from consultation records.

In this framework depressive symptoms are embedded in a single sentence or a discourse segment-that is, successive sentences describing the same depressive symptom. As the domain knowledge Hamilton Depression Rating Scale (HDSR) is used. Data mining methods are used to identify the symptom. The mining task is decomposed into subtasks:

Identify discourse segments by grouping the successive sentences with the same semantic label.
Discover semantic relations that hold between discourse segments.

In this framework semantic-dependency, lexical-cohesion, and domain-ontology knowledge sources are integrated to mine depressive symptoms and their relations. To identify the discourse segments, each sentence's semantic dependencies are modeled using a semantic dependency graph (SDG). In SDG head word of each sentence that is the central element to which other elements have some dependency relation, that is a relation between each word toke and its head in a sentence, is used to label sentences. SDG has semantic dependencies that provide the significant features for inferring a semantic label for each sentence. Four kind of semantic relations are discovered among the discourses:

Cause-effect-because, therefore
Contrast-however, but
Joint-and, also
Temporal sequence-before, after

The experiments in [4] shows that the framework identifies significant features for the task of mining depressive symptoms and heir semantic relations to support interactive psychiatric services. The semantic-dependency structure captures the intra sentential information, the lexical cohesion captures the inter sentential information, and the domain ontology models the domain knowledge. Integrating these knowledge sources is a promising approach to the mining task.

3.3 The requirements for ontologies in medical data integration

Information technology today is widely adopted in modern medical practice, especially supporting digitized equipment, administrative tasks, and data management. But computational techniques doesn't use much of this medical information in research or practice because the laws of medicine are knowledge based disciplines and rely greatly on observed similarities rather than on the application of precise rules. In [5] the Health-e-Child (HeC) project is conducted to demonstrate that indeed integrating medical integration in novel ways yields immediate benefit for clinical research and practice. It aims to develop an integrated platform for European Paediatrics, providing seamless integration of traditional and emerging sources of biomedical information as part of a longer-term vision for large-scale information-based research and training, and informed policy making.

To have a vertical integration of data that is establishing a coherent view of the child's health to which information from each vertical level contributes, from molecular through cellular to individual, sharing data among spatially separated clinicians and information produced in different departments or multiple hospitals brings together for the purpose of creating statistically significant samples, studying population characteristics and sharing knowledge among clinicians. The emphasis of the Health-e-Child requirements process is therefore on "universality of information" and its corner stone is the integration of information across biomedical abstractions, whereby all layers of biomedical information are 'vertically integrated' to provide a unified view of a child's biomedical and clinical condition.

Ontology is a formal specification of a shared conceptualization. This means that ontology represents a shared, agreed and detailed model of a problem domain. One advantage for the use of ontologies is their ability to resolve any semantic heterogeneity that is present within the data. Ontologies define links between different types of semantic knowledge. The fact that ontologies are machine processable and human understandable is especially useful in this regard. There are many ontologies in existence today especially in the biomedical domain, however they are often limited to one level vertical integration and it would not be sensible to reuse these ontologies in their entirety; so to make an appropriate ontology for Hec available ontologies are integrated bye the extraction of the relevant parts and then the integration of these into a coherent whole, thereby capturing most of the HeC domain but the missing attributes of Hec modeled sepratly. Integration process involves identifying similarities between ontologies in order to determine which concepts and properties represent similar notions across heterogeneous data samples in a (semi-)automatic manner.

As mentioned above use of ontology and inference engine can aid in the area of query enhancement. It provides clinicians with more targeted information. Use of ontology enabled clinicians to take basic queries from users and translate them into more complex context aware searches and minimizes the load on the system as fewer searches are necessary. Query optimization also assists in this regard by using the HeC ontology to aid the creation of efficient data access paths by semantically altering the initial query to find a more efficient execution path within the database. Both query enhancement and optimization are crucial in delivery of intuitive data access for clinicians whilst at the same time ensuring the scalability and overall stability of the system.

4 Conclusion

This paper attempted to find application of semantic web and data mining in different fields. Observed application demonstrated that data mining methods can be very useful for ontology construction and the constructed ontology itself can be used for classification in data mining. Use of ontologies in healthcare has significant effect and cause having better standard of life.

References

S. Bloehdorn , P. Cimiano1 , A. Hotho and S.Staab. An Ontology-based Framework for Text Mining. 2004
V. Milutinovic. Data Mining versus Semantic Web, http://galeb.etf.bg.ac.yu/vm
Weider D. Yu Soumya R. Jonnalagadda . Semantic Web and Mining in Healthcare
Chung-Hsien Wu and Liang-Chih Yu. Using Semantic Dependencies to Mine Depressive Symptoms from Consultation Records
A. Anjum, P. Bloodsworth, A. Branson, T. Hauer, R. McClatchey, K. Munir, D. Rogulin, J. Shamdasani. The Requirements for Ontologies in Medical Data Integration: A Case Study

本文试图在不同的领域中找到语义Web和数据挖掘的应用。观察到的应用表明，数据挖掘方法可以是非常有用的本体构建和构建本体本身可以用于数据挖掘中的分类。使用本体在医疗保健中有显着的效果和原因有更好的生活标准。更多请访问首页