Text Mining through Semi Automatic Semantic Annotation

Please download to get full document.

View again

of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report



Views: 2 | Pages: 12

Extension: PDF | Download: 0

Related documents
Text Mining through Semi Automatic Semantic Annotation Nadzeya Kiyavitskaya 1, Nicola Zeni 1, Luisa Mich 2, James R. Cordy 3, and John Mylopoulos 4 1 Dept. of Information and Communication Technology,
Text Mining through Semi Automatic Semantic Annotation Nadzeya Kiyavitskaya 1, Nicola Zeni 1, Luisa Mich 2, James R. Cordy 3, and John Mylopoulos 4 1 Dept. of Information and Communication Technology, University of Trento, Italy {nadzeya, 2 Dept. of Computer and Management Sciences, University of Trento, Italy 3 School of Computing, Queens University, Kingston, Canada 4 Dept. of Computer Science, University of Toronto, Ontario, Canada Abstract. The Web is the greatest information source in human history. Unfortunately, mining knowledge out of this source is a laborious and errorprone task. Many researchers believe that a solution to the problem can be founded on semantic annotations that need to be inserted in web-based documents and guide information extraction and knowledge mining. In this paper, we further elaborate a tool-supported process for semantic annotation of documents based on techniques and technologies traditionally used in software analysis and reverse engineering for large-scale legacy code bases. The outcomes of the paper include an experimental evaluation framework and empirical results based on two case studies adopted from the Tourism sector. The conclusions suggest that our approach can facilitate the semi-automatic annotation of large document bases. Keywords: semantic annotation, large-scale document analysis, conceptual schemas, software analysis. 1 Introduction The Web is the greatest information source in human history. Unfortunately, mining knowledge out of this source is a laborious and error-prone task, much like looking for the proverbial needle in a haystack. Many researchers believe that a solution to the problem can be founded on semantic annotations that need to be inserted in webbased documents and guide information extraction and knowledge mining. Such annotations use terms defined in an ontology. We are interested in knowledge mining the Web, and use semantic annotations as the key idea in terms of which the mining is to be done. However, adding semantic annotations to documents is also a laborious and errorprone task. To help the annotator, we are developing tools that facilitate the annotation process by making a first pass at the documents, inserting annotations on the basis of textual patterns. The annotator can then make a second pass improving manually the annotations. The main objective of this paper is to present a toolsupported methodology that semi-automates the semantic annotation process for a set of documents with respect to a semantic model (ontology or conceptual schema). In this work we propose to approach the problem using highly efficient methods and tools proven effective in the software analysis domain for processing billions of lines of legacy software source code [2]. In fact, document analysis for the Semantic Web and software code analysis have striking similarities in their needs:! robust parsing techniques, given that real documents rarely match given grammars;! a semantic understanding of source text, on the basis of a semantic model;! semantic clues drawn from a vocabulary associated with the semantic model;! contextual clues drawn from the syntactic structure of the source text;! inferred semantics from exploring relationships between identified semantic entities and their properties, contexts and related other entities. On the basis of these considerations, we have adapted software analysis techniques to the more general problem of semantic annotation of text documents. Our initial hypothesis is that these methods can attain the same scalability for analysis of textual documents as for software code analysis. In this work we extend and generalize the process and architecture of the prototype semantic annotation tool presented earlier in [3]. The contribution of this work includes also an evaluation framework for semantic annotation tools, as well as two real-world case studies: accommodation advertisements and Tourist Board web sites. For the first experiment, we use a small conceptual schema derived from a set of user queries. For the second experiment, we adopt more elaborated conceptual schemas reflecting a richer semantic domain. Our evaluation of both applications uses a three-stage evaluation framework which takes into account:! standard accuracy measures, such as Recall, Precision, and F-measure;! productivity, i.e. the fraction of time spent for annotation when the human is assisted by our tool vs. time spent for manual annotation from scratch ; and! a calibration technique which recognizes that there is no such thing as correct and wrong annotations, as human annotators also differ among themselves on how to annotate a given document. The rest of the paper is organized as follows. Our proposed annotation process and the architecture of our semantic annotation system are introduced in section 2. The two case studies are presented in section 3, and section 4 describes the evaluation setup and experimental results. Section 5 provides a short comparative overview of semantic annotation tools and conclusions are drawn in section 6. 2 Methodology Our method for semantic annotation of documents uses the generalized parsing and structural transformation system TXL [4], the basis of the automated Year 2000 system LS/2000 [5]. TXL is a programming language specially designed to allow by- example rapid prototyping of language descriptions, tools and applications. The system accepts as input a grammar and a document, generates a parse tree for the input document, and applies transformation rules to generate output in a target format. The architecture of our solution (Fig. 1) is based on the LS/2000 software analysis architecture, generalized to allow for easy parameterization by a range of semantic domains. Domain independent components Documents Domain dependent components Object grammars Parse Phrase grammar Xml grammar Preparsed documents Markup Annotation schema (Category indicators) Annotated documents Schema grammar Map Database schema Database of annotations Fig. 1. Architecture of our semantic annotation process. The architecture explicitly factors out reusable domain independent knowledge such as the structure of basic entities ( and web addresses, dates, and other word-equivalent objects) and language structures (document, paragraph, sentence and phrase structure), shown on the left hand side, while allowing for easy change of semantic domain, characterized by vocabulary (category word and phrase lists) and semantic model (entity-relationship schema and interpretation), shown on the right. The process consists of three phases. In the first stage, an approximate ambiguous context-free grammar is used to efficiently obtain an approximate phrase structure parse of the source text using the TXL parsing engine. Using robust parsing techniques borrowed from compiler technology [6] this stage results in a deterministic maximal parse. As part of this first stage, basic entities are recognized. The parse is linear in the length of the input and runs at compiler speeds. In the second stage, initial semantic annotation of the document is derived using a wordlist file specifying both positive and negative indicators for semantic categories. Indicators can be both literal words and phrases and names of parsed entities. Phrases are marked up once for each category they match thus at this stage a sentence or phrase may end up with many different semantic markups. Vocabulary lists are derived from the semantic model for the target domain. This stage uses the structural pattern matching and source transformation capabilities of the TXL engine similarly as for software markup to yield a preliminary marked-up text in XML form. The third stage uses the XML marked-up text to populate an XML database schema, derived from the semantic model for the target domain. Sentences and phrases with multiple markups are cloned using TXL source transformation to appear as multiple copies, one for each different markup, before populating the database. In this way we do not prejudice one interpretation as being preferred. The outputs of our process are both the XML marked-up text and the populated database. The populated database can be queried by a standard SQL database engine. 3 Experimental Case Studies Our case studies involve two applications in the Tourism area. Tourism is a very broad sector of economy which comprises many heterogeneous domains: accommodation and eating structures, sports, means of transport, historical sites, tourist attractions, medical services and other areas of human activity. Information available from heterogeneous data sources must be integrated in order to allow effective interoperability of tourism information systems and to enable knowledge mining for the variety of roles and services that characterize such a compound sector (e.g. composition of services for tourist packages). This is where semantic annotations come in handy. 3.1 Accommodation Ads As a first full experiment in the application of our new method, we have been working in the domain of travel documents, and in particular with published advertisements for accommodation. This domain is typical of the travel domain in general and poses many problems commonly found in other text markup problems, such as: partial and malformed sentences; abbreviations and short-forms; locationdependent vocabulary; monetary units; date and time conventions, and so on. In the first case study we used a set of several hundred advertisements for accommodation in Rome drawn from an online newspaper. The task was to identify and mark up the categories of semantic information in the advertisements according to a given accommodation conceptual schema (Fig. 2), which was reduced by hand to an XML schema for input to our system. The desired result was a database with one instance of the schema for each advertisement in the input, and the marked-up original advertisements. To adapt our semantic annotation methodology to this experiment the domain-related wordlists were constructed by hand from a set of examples. Fig. 2. Conceptual schema for accommodation ads. 3.2 Tourist Board Web Pages In the second case study we pursued two main goals: to demonstrate the generality of our method over different domains, and to verify the scalability of our approach on a richer semantic model and larger natural language documents. For this purpose, we considered the web sites of Tourist Boards in the province of Trentino (Italy) 1 as input documents. In contrast to the classified ads, this domain presents a number of specific problematic issues: free unrestricted vocabulary; differently structured text; a rich semantic model covering the content of web sites. This experiment was run in the collaboration with the marketing experts of the etourism 2 group of University of Trento. From the point of view of tourist marketing experts in tourism, the high-level business goal of this case study was to assess the communicative efficacy of the web sites based on content quality or informativity, that is, how comprehensively the web site covers relevant topics according to the strategic goals of the Tourist Board. In order to assess the communicative efficacy we performed semantic annotation of the web pages revealing the presence of information important for a Tourist Board web site to be effective. The list of semantic categories and their descriptions was provided by the tourism experts (Fig. 3). Geography Climate Weather predictions Land Formation Lakes and Rivers Landscape Local products Local handcrafting Agricultural products Gastronomy Culture Traditions and customs Local history Festivals Population Cultural institutions and associations Libraries Cinemas Local literature Local prominent people Artistic Heritage Places to visit: museums, castles Tickets, entrance fees, guides Sport Sporting events Sport infrastructure Sport disciplines Accommodation Places to stay How to book How to arrive Prices Availability Food and refreshment Places to eat Dishes Degustation Time tables How to book Wellness Wellness centers Wellness services Services Transport, schedules Information offices Terminals, stations, airports Travel agencies Fig. 3. Relevant categories for communicative efficacy of a Tourist Board web site In this second experiment, we adapted our annotation framework to the new domain by replacing the domain-dependent components with respect to this specific task. For this purpose, the initial rough schema provided by the domain experts was transformed into a richer conceptual schema consisting of about 130 concepts systematized into a hierarchy and connected by semantic relations (see the partial view in Fig. 4 3 ). Fig. 4. A slice of the conceptual schema showing semantic (placement in the hierarchy, relationships, attributes) and syntactic (keywords or patterns) information associated with concepts. This view shows only is-a relations, because this type of relation is essential in guiding the annotation process. The complete model includes many more relations apart from taxonomical ones. Domain dependent vocabulary was derived semi-automatically, expanding concept definitions with the synonyms provided by the WordNet 4 database and on-line Thesaurus 5 and mined from a set of sample documents. The total number of keywords collected was 507 and an additional four object patterns were re-used from previous application to detect such entities as monetary amounts, s, web addresses and phone numbers. To begin this experiment we downloaded the English version of 13 Tourist Board web sites using an offline browser software 6. For some of them (which are generated dynamically) we had to apply a manual screen-scraping technique. Then two human annotators and the tool were given text fragments for annotation. The required result was a database with one instance of the schema for each Tourist Board web site, and the marked-up original text (Fig. 5). FoodAndRefreshment Bread and wine snack in the shade of an elegant park. /foodandrefreshment FoodAndRefreshment Dinner at the La Luna Piena restaurant, consisting of the Il Piatto del Vellutaio /FoodAndRefreshment ArtisticHeritage Museo del Pianoforte Antico: guided visit and concert proposed within the Museum Nights programme on the 3, 10, 17 and 24 of August. /ArtisticHeritage Fig. 5. Example of XML-marked up content of a tourism web site. 3 The visualization tool RDFGravity: SurfOffline 1.4: 4 Experimental Evaluation 4.1 Evaluation Framework The performance of semantic annotation tools is usually evaluated similarly to information extraction systems, i.e. by comparing with a reference correct markup and calculating recall and precision metrics. In order to evaluate our initial experimental results, we designed a three stage validation process. At each stage, we calculated a number of metrics [7] for the tool s automated markup compared to manually-generated annotations: Recall evaluates how well the tool performs in finding relevant items; Precision shows how well the tool performs in not returning irrelevant items; Fallout measures how quickly precision drops as recall is increased; Accuracy measures how well the tool identifies relevant items and rejects irrelevant ones; Error rate demonstrates how much the tool is prone to accept irrelevant items and reject relevant ones; F-measure is an harmonic mean of recall and precision. In the first step of our evaluation framework, we compare the system output directly with manual annotations. We expect that quality of manual annotations constitutes an upper bound for automatic document analysis. Of course, this type of evaluation can t be applied on a large scale for cost reasons. In the second step, we check if the use of automatic tool increases the productivity of human annotators. We note the time used for manual annotation of the original textual documents and compared it to the time used for manual correction of the automatically annotated documents. The percentage difference of these two measures shows how much time can be saved when the tool assists the human annotator. Finally, in our third step we take into account disagreement between annotators to interpreted the automatically obtained annotation. Then, we compare system results against the final human markup made by correcting the automatically generated markup. 4.2 Experimental Results Experiment 1: Accommodation Ads. The details of our evaluation for the accommodation ads application can be found in [2]. We only say that as a result of this first experiment, even without local knowledge and using a very small vocabulary and only few TXL rules, we obtained results comparable to some of the best heavyweight methods, albeit on a very limited domain. Performance of our untuned experimental tool was also already very fast, handling for example 100 advertisements in about 1 second on a 1 GHz PC. Experiment 2: Tourist Board Web Pages. As the semantic model in this experiment was fairly extensive, we could not afford humans to handle properly all of the entities of the rich domain schema. Accordingly, in our evaluation we considered only general categories in the annotation schema (Geography, Sport, Culture, Artistic Heritage, Local Products, Wellness, Accommodation, Food and Refreshment, Services). For these we performed simple metrics-based validation (Tables 1a, b, c) and calibration of the results taking into account inter-annotator disagreement (Table 2) for the entire set of paragraphs. Topic Table 1a. Evaluating system annotation vs. human Annotator 1. Geography Artistic Heritage Sport Local Products Culture Accommodation Measure Recall Precision Fallout Accuracy Error F-measure Topic Table 1b. Evaluating system annotation vs. human Annotator 2. Food & Refreshment Wellness Service Geography Artistic Heritage Sport Local Products Culture Accommodation Food & Refreshment Wellness Service Measure Recall Precision Fallout Accuracy Error F-measure Table 1c. Evaluating system annotation vs. humans average category scores. Measure Tool vs. A1 Tool vs. A2 Recall Precision Fallout Accuracy Error F-measure Table 2. Comparing system results vs. human annotators. Measure A2 vs. A1 Tool vs. A1 A1 vs. A2 Tool vs. A2 Recall Precision Fallout Accuracy Error F-measure As shown in Table 2, for the given annotation schema the task turned out to be difficult both for the system and for the humans due to the vague definitions of the semantic categories. For example, text about local food may be associated with either or both of the Local Products category and the Food and Refreshment category, depending on the context. Explicit resolution of such ambiguities in the expert definition would improve the results. Interpreting the results of this case study, we must take into account also that the diversity in accuracy metrics is partially caused by the different experience of the annotators in the tourism area. If we compare the difference in scores of F-measure, as the most aggregate characteristic, the overall difference in performances of the system and the humans is approximately 10%. In the second stage of evaluation, the human annotators were observed to use 72% less time to correct automatically annotated text than they spent on their original unassisted annotations. In the third stage, when the human annotators corrected automatically marked up documents, the results of comparison to the final human markup are given in Tables 3a, b, c and calibration to human performance in Table 4. Table 3a. Evaluating system annotation vs. human Annotator 1 as assisted by the tool. Topic Geography Artistic Heritage Sport Local Products Culture Accommodation Measure Recall Precision Fallout Accuracy Error F-measure Table 3b. Evaluating system annotation vs. human Annotators2 as assisted by the tool. Topic Food & Refreshment Wellness Service Geography Artistic Heritage Sport Local Products Culture Accommoda
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!