A corpus-based approach for the induction of ontology lexica - PDF

Please download to get full document.

View again

of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Business & Finance

Published:

Views: 9 | Pages: 12

Extension: PDF | Download: 0

Share
Related documents
Description
A corpus-based approach for the induction of ontology lexica Sebastian Walter, Christina Unger, and Philipp Cimiano Semantic Computing Group, CITEC, Bielefeld University Abstract. While there are many
Transcript
A corpus-based approach for the induction of ontology lexica Sebastian Walter, Christina Unger, and Philipp Cimiano Semantic Computing Group, CITEC, Bielefeld University Abstract. While there are many large knowledge bases (e.g. Freebase, Yago, DBpedia) as well as linked data sets available on the web, they typically lack lexical information stating how the properties and classes are realized lexically. If at all, typically only one label is attached to these properties, thus lacking any deeper syntactic information, e.g. about syntactic arguments and how these map to the semantic arguments of the property as well as about possible lexical variants or paraphrases. While there are lexicon models such as lemon allowing to define a lexicon for a given ontology, the cost involved in creating and maintaining such lexica is substantial, requiring a high manual e ort. Towards lowering this e ort, in this paper we present a semi-automatic approach that exploits a corpus to find occurrences in which a given property is expressed, and generalizing over these occurrences by extracting dependency paths that can be used as a basis to create lemon lexicon entries. We evaluate the resulting automatically generated lexica with respect to DBpedia as dataset and Wikipedia as corresponding corpus, both in an automatic mode, by comparing to a manually created lexicon, and in a semi-automatic mode in which a lexicon engineer inspected the results of the corpus-based approach, adding them to the existing lexicon if appropriate. Keywords: ontology lexicalization, corpus-based approach, lemon 1 Introduction The structured knowledge available on the web is increasing. The Linked Data Cloud, consisting of a large amount of interlinked RDF datasets, has been growing steadily in recent years, now comprising more than 30 billion RDF triples 1. Popular and huge knowledge bases exploited for various purposes are Freebase, DBpedia, and Yago. 2 Search engines such as Google are by now also collecting and exploiting structured data, e.g. in the form of knowledge graphs that are used to enhance search results. 3 As the amount of structured knowledge available keeps growing, intuitive and e ective paradigms for accessing and querying yago-naga/yago/ 3 2 Sebastian Walter, Christina Unger, Philipp Cimiano this knowledge become more and more important. An appealing way of accessing this growing body of knowledge is through natural language. In fact, in recent years several researchers have developed question answering systems that provide access to the knowledge in the Linked Open Data Cloud (e.g. [8], [13], [14], [2]). Further, there have been some approaches to applying natural language generation techniques to RDF in order to verbalize knowledge contained in RDF datasets (e.g. [10], [12], [4]). For all such systems, knowledge about how properties, classes and individuals are verbalized in natural language is required. The lemon model 4 [9] has been developed for the purpose of creating a standard format for publishing such lexica as RDF data. However, the creation of lexica for large ontologies and knowledge bases such as the ones mentioned above involves a high manual e ort. Towards reducing the costs involved in building such lexica, we propose a corpus-based approach for the induction of lexica for a given ontology which is capable of automatically inducing an ontology lexicon given a knowledge base or ontology and an appropriate (domain) corpus. Our approach is supposed to be deployed in a semi-automatic fashion by proposing a set of lexical entries for each property and class, which are to be validated by a lexicon engineer, e.g. using a web interface such as lemon source 5. As an example, consider the property dbpedia:spouse as defined in DBpedia. In order to be able to answer natural language questions such as Who is Barack Obama married to? we need to know the di erent lexicalizations of this property, such as to be married to, to be the wife of, and so on. Our approach is able to find such lexicalizations on the basis of a su ciently large corpus. The approach relies on the fact that many existing knowledge bases are populated with instances, i.e. by triples relating entities through properties such as the property dbpedia:spouse. Our approach relies on such triples, e.g. hdbpedia:barack Obama, dbpedia:spouse, dbpedia:michelle Obamai to find occurrences in a corpus where both entities, the subject and the object, are mentioned in one sentence. On the basis of these occurrences, we use a dependency parser to parse the relevant context and generate a set of lexicalized patterns that very likely express the property or class in question. The paper is structured as follows: in Section 2 we present the general approach, distinguishing the case of inducing lexical entries for properties and for classes. The evaluation of our approach with respect to 80 pseudo-randomly selected classes and properties is presented in Section 3. Before concluding, we discuss some related work in Section 4. 2 Approach Our approach 6 is summarized in Figure 1. The input is an ontology and the output is a lexicon in lemon format for the input ontology. In addition, it relies on an RDF knowledge base as well as a (domain) corpus. 4 For detailed information, see Available at https://github.com/swalter2/knowledgelexicalisation Lexicalizing Linked Open Data 3 Fig. 1: System overview The processing di ers for properties and classes. In what follows, we describe the processing of properties, while the processing of classes, which does not rely on the corpus, is explained below in Section 2.5. For each property to be lexicalized, all triples from the knowledge base containing this property are retrieved. The labels of the subject and object entities of these triples are then used for searching the corpus for sentences in which both occur. Based on a dependency parse of these sentences, patterns are extracted that serve as basis for the construction of lexical entries. In the following, we describe each of the steps in more detail. 2.1 Triple retrieval Given a property, the first step consists in extracting from the RDF knowledge base all triples containing that property. In the case of DBpedia, for the property dbpedia:spouse, for example, triples are found, including the following 7 : (resource:barack Obama, dbpedia:spouse, resource:michelle Obama) (resource:alexandra of Denmark, dbpedia:spouse, resource:edward VII) (resource:hilda Gadea, dbpedia:spouse, resource:che Guevara) (resource:mel Ferrer, dbpedia:spouse, resource:audrey Hepburn) 7 Throughout the paper we use the prefixes dbpedia and resource for and respectively. 4 Sebastian Walter, Christina Unger, Philipp Cimiano 2.2 Sentence extraction and parsing For each triple (s, p, o), that was extracted for a property p, we retrieve all sentences from the domain corpus in which the labels of both entities s and o occur. This step is performed relying on an inverted index. An example sentence extracted from Wikipedia for the subject/object pair Barack Obama and Michelle Obama is the following: The current First Lady is Michelle Obama, wife of Barack Obama. Each of the retrieved sentences is parsed with the pre-trained Malt dependency parser 8. In order to avoid errors in parsing, the entity occurrences are replaced with a single word. For example, Queen Silvia of Sweden is replaced with QueenSilviaofSweden; this ensures that it is tagged as a named entity. Once a sentence has been parsed, the dependency parse is added to the index. This speeds up the process when the same sentence is retrieved again later. From the dependency parses, we extract all paths that connect the entities in question. For the sentence above, for example, the following path connecting Barack Obama and Michelle Obama is found: root appos prep pobj The current First Lady is MichelleObama, wife of BarackObama. 2.3 Pattern generation, postprocessing and filtering On the basis of the discovered dependency paths, patterns are generated by abstracting over the specific entities occuring in the parse. The above mentioned path would for instance be generalized to: appos prep pobj x wife of y In addition, the generalized patterns are postprocessed, e.g. by removing determiners such as the. To avoid unnecessary noise, only patterns with a length of at least three but not longer than six tokens are accepted. Also, if the entities x or y are related to another token by nn, i.e. are modifiers, the pattern is not considered. Additional processing, such as subsuming similar patterns under a single one, are planned but not yet implemented. Finally, for each property we compute the relative frequency of the found patterns, i.e. the number of sentences that yielded a certain pattern in relation to the overall number of sentences for that property. We then consider only those patterns that occur at least twice and surpass a certain threshold, which is determined empirically in Section 3.3 below. 8 Lexicalizing Linked Open Data Generation of lexical entries All patterns found by the above process, whose relative frequency is above a given threshold, are then transformed into a lexical entry in lemon format. For instance, the above mentioned pattern is stored as the following entry: 1 :wife a lemon:lexicalentry ; 2 lexinfo:partofspeech lexinfo:noun ; 3 lemon:canonicalform [ lemon:writtenrep ] ; 4 lemon:synbehavior [ rdf:type lexinfo:nounppframe ; 5 lexinfo:copulativearg :x_appos ; 6 lexinfo:prepositionalobject :y_pobj ] ; 7 lemon:sense [ lemon:reference 8 http://dbpedia.org/ontology/spouse ; 9 lemon:subjofprop :x_appos ; 10 lemon:objofprop :y_pobj ] :y_pobj lemon:marker [ lemon:canonicalform 13 [ lemon:writtenrep ] ]. This entry comprises a part of speech (noun), a canonical form (the head noun wife), a sense referring to the property spouse in the ontology, and a syntactic behavior specifying that the noun occurs with two arguments, a copulative argument that corresponds to the subject of the property and a prepositional object that corresponds to the object of the property and is accompanied by a marker of. 9 The specific subcategorization frame is determined by the kind of dependency relations that occur in the pattern. Currently, our approach covers nominal frames (e.g. activity and wife of ), transitive verb frames (e.g loves), and adjectival frames (e.g. Spanish). 2.5 Lexicalization of classes The lexicalization process for classes di ers from that for properties in that the corpus is not used. Instead, for each class in the ontology, its label is extracted as lexicalization. In order to also find alternative lexicalizations, we consult WordNet to find synonyms. For example, for the class org/ontology/activity with label activity, we find the additional synonym action, thus leading to the following two entries in the lemon lexicon 10 : 1 :activity a lemon:lexicalentry ; 2 lexinfo:partofspeech lexinfo:noun ; 9 From a standard lexical point of view the syntactic behavior might look weird. Instead of viewing the specified arguments as elements that are locally selected by the noun, they should rather be seen as elements that occur in a prototypical syntactic context of the noun. They are explicitly named as it would otherwise be impossible to specify the mapping between syntactic and semantic arguments. 10 As linguistic ontology we use ISOcat (http://isocat.org); in the examples, however, we will use the LexInfo vocabulary (http://www.lexinfo.net/ontology/2.0/ lexinfo.owl) for better readability. 6 Sebastian Walter, Christina Unger, Philipp Cimiano 3 lemon:canonicalform [ lemon:writtenrep ] ; 4 lemon:sense [ lemon:reference 5 http://dbpedia.org/ontology/activity ]. 6 7 :action a lemon:lexicalentry ; 8 lexinfo:partofspeech lexinfo:noun ; 9 lemon:canonicalform [ lemon:writtenrep ] ; 10 lemon:sense [ lemon:reference 11 http://dbpedia.org/ontology/activity ]. These entries specify a part of speech (noun), together with a canonical form (the class label) and a sense referring to the class URI in the ontology. 3 Evaluation In this section, we describe the methodology used in our evaluation as well as the evaluation measures, followed by a presentation and discussion of the results. Note that we evaluate our methodology in terms of how well it can support the creation of a lexicon. Of course the extracted patterns could also be used to find new instances of a relation within an information extraction paradigm. However, an evaluation of this potential use is out of the scope of the current paper. 3.1 Methodology and dataset We evaluate our approach in two modes: fully automatic and semi-automatic. In the automatic mode, we evaluate the results of our corpus-based lexicon induction method by comparing the automatically generated lexicon with a manually constructed lexicon for DBpedia. The manually constructed lexicon was created by two persons not directly involved in the development and evaluation of the approach presented in this paper. In particular, these lexicon engineers did not have access to the results of the algorithm proposed here when creating their lexica. For the evaluation of our approach in the semi-automatic mode, the above mentioned lexicon engineers and one of the authors inspected the automatically generated lexica and added all appropriate lexical entries to their manually created lexicon in case it was appropriate and missing in the lexicon. In this evaluation mode we thus compare the automatically generated lexicon with a superset of the manually constructed lexica. By this, we do not penalize our approach for finding lexical entries that are correct but not contained in the manually constructed lexicon, thus representing a fair evaluation of our approach with respect to the targeted setting in which a lexicon engineer validates the automatically constructed lexical entries. For the purposes of evaluation, we selected a training set for parameter tuning and a test set for evaluation, each consisting of 10 DBpedia classes and 30 DBpedia properties, in a largely pseudo-random fashion in the sense that we randomly selected properties from di erent frequency ranges, i.e. ranging from properties with very few instances to triples with many instances. We then filtered those that turned out to either have no instances leaving in only one Lexicalizing Linked Open Data 7 empty property per set, meltingpoint and sublimationpoint, in order to be able to evaluate possible fallback strategies or to not have an intuitive lexicalization, e.g. espnid. On average, the properties selected for training have instances (ranging from 15 to ), while the properties in the test set have instances on average (ranging from 9 to ). The training and test sets are also used in the ontology lexicalization task of the QALD-3 challenge 11 at CLEF We use the training set to determine the threshold, and then evaluate the approach on the unseen properties in the test set. 3.2 Evaluation measures For each property, we evaluate the automatically generated lexical entries by comparing them to the manually created lexical entries along two dimensions: i) lexical precision, lexical recall and lexical F-measure, and ii) lexical accuracy. In the first dimension, we evaluate how many of the gold standard entries for a property are generated by our approach (recall), and how many of the automatically generated entries are among the gold standard entries (precision), where two entries count as the same lexicalization if their lemma, part of speech and sense coincide. Thus lexical precision P lex and recall R lex for a property p are defined as follows: P lex (p) = entriesauto(p) \ entries gold(p) entries auto(p) R lex (p) = entriesauto(p) \ entries gold(p) entries gold (p) Where entries auto (p) is the set of entries for the property p in the automatically constructed lexicon, while entries gold (p) is the set of entries for the property p in the manually constructed gold lexicon. The F-measure F lex (p) isthendefined as the harmonic mean of P lex (p) and R lex (p), as usual. The second dimension, lexical accuracy, is necessary in order to evaluate whether the specified subcategorization frame and its arguments are correct, and whether these syntactic arguments have been mapped correctly to the semantic arguments (domain and range) of the property in question. The accuracy of an automatically generated lexical entry l auto for a property p w.r.t. the corresponding gold standard entry l gold is therefore defined as: A p(l auto) =(frameeq(l auto,l gold )+ args(lauto) \ args(l gold) args(l gold ) + P a2args(l auto ) map(a) )/3 args(l auto) where frameeq(l 1,l 2 ) is 1 if the subcategorization frame of l 1 is the same as the subcategorization frame of l 2, and 0 otherwise, where args(l) returnsthe 11 8 Sebastian Walter, Christina Unger, Philipp Cimiano syntactic arguments of l s frame, and where 8 1, if a in l auto has been mapped to the same semantic argument map(a) = of p as in l gold : 0, otherwise When comparing the argument mapping of the automatically generated entry with that of the gold standard entry, we only consider the class of the argument, simply being subject or object. This abstracts from the specific type of subject (e.g. copulative subject) and object (e.g. indirect object, prepositional object, etc.) and therefore allows for an evaluation of the argument mappings independently of the correctness of the frame and frame arguments. The lexical accuracy A lex (p) for a property p is then computed as the average mean of the accuracy values of each generated lexicalization. All measures are computed for each property and then averaged for all properties. In the sections below, we will report only the average values. 3.3 Results and discussion Figure 2a shows results for the 30 training properties in automatic mode in terms of P lex, R lex, and F lex, depending on the threshold. Accuracy is not plotted, as it is not influenced by. Neither are results for the classes, as they also do not vary with (recall is 0.73, precision is 0.55, and accuracy 0.9). The value is the likelihood that a specific pattern occurs given all the sentences expressing the property in question. On the basis of these results, we identify two values that are of interest: a low value around 0.2, which leads to high recall, and a high value around 5.0, which results in a drop in recall but an increase in precision. Having in mind a semi-automatic scenario, in which a lexion engineer validates and, if necessary, corrects the automatically generated lexical entries, we put more emphasis on recall, as it is easier and faster to filter out wrong entries than to discover and add missing one. Figure 2b gives the results in terms of the average precision P lex, recall R lex, and F-measure F lex, as well as average accuracy A lex for the test set in both evaluation modes, for both relevant values, and on all 40 URIs. As with the training set, precision increases and recall decreases for higher. In automatic mode, roughly half of the gold standard entries are generated, usually with a fair precision and accuracy, together with an additional amount of lexicalizations, ranging from 2 to 500, that are not in the gold standard. Of these additional lexicalizations, on average 1.4 entries are correct and were added to the gold standard lexicon. This improved precision and recall roughly by 0.2, accuracy even 0.3 and 0.4. The property programminglanguage is an example of a proeprty that performs quite bad in terms of precision. Here, six out of seven gold standard lexicalizations are found, leading to a recall of 0.85 and accuracy of 0.96, but also more than 500 wrong lexicalizations are created, yielding a precision of The main reason is that the entity labels are not yet preprocessed and therefore take Lexicalizing Linked Open Data Result (Liklihood of pattern) R P F R lex P lex F lex A lex Full Semi (a) Results on train in
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x