Semantic Web Technologies – Ontological Engineering

Summary of week five for the course Knowledge Engineering with Semantic Web Technologies 2015 by Harald Sack at OpenHPI.

Ontology Engineering

Pyramid for knowledge management:

  • Data: raw data, facts about event
  • Information: a message to change the receiver’s perception
  • Knowledge: experience, context, information + semantics
  • Wisdom: application of knowledge in context

In general it makes sense to follow some methodologies because creating an ontology is quit complex.

Ontology Learning

Can we create ontologies automatically? Ways to do this:

  • via text mining from text
  • via linked data mining from e.g. RDF graphs
  • concept learning in Description Logics and OWL (related to linked data mining, but also)
  • crowdsourcing via Amazon Mechanical Turk or games with purposein short there are three steps: term extraction – conceptualization – evaluation. Actual challenges in Ontology Learning:
  • Heterogenity
  • Uncertainty: the quality is low, you cannot be sure whether the information is right or not
  • You need consistency because otherwise you cannot do reasoning
  • Scalability: make sure that it is scalable
  • Quality: you neet to evaluate it and make sure it is right
  • Interactity: you need to involve users to help you improve the ontologies

Ontology Alignment
What is it? You try to find similarities between ontologies in order to combine them. But: an ontology only models reality, it is NOT the reality. The problems are similar to natural language: you run into ambiguities. You can also have problems with different conventions (time in seconds vs. time in time points), different granularities and different points of views.

You have differences on the syntactical, terminological. semantical or semiotic (pragmatic) level

Ontology Evaluation

This is the quality of an ontology in respect to a particular criterion. There are two basic principles:

  • Verification: it encoding and implementation correct (more the formal side)
  • Validation: how good is the model and how well does it match reality?

Criteria for validation:

  • correctness:
    • Accuracy (precision and recall)
    • completeness
    • consistency
  • quality:
    • Adaptibility
    • Clarity
    • computational efficiency
  • organizaional fitness (how well does it integrate in my software/organisation
  • conciseness

Artificial Intelligence – Breadth-First and Depth-First-Search

Summary of week one of the MOOC Artificial intelligence at edx.

The lecture deals with artificial intelligence, which means in this context, that we want to program computers that act logically, we do not want them to act like humans and we also do not want to build an artificial brain to understand how human thinking works.

The first problems are search problems, like finding the shortest path to a point in grid. For this you can use two types of search:

  • Breadth-First-Search: You try out every point around you and if do not reach the goal, you try the points next to these points and so on. The good thing about this is, you find the shortest way, the bad is it takes very long.
  • Depth-First-Search: You go down one way until the end (for instance, you always take a left turn) and if this does not work, you try another path. The good thing about this is that you probably find the solution faster than with Breadth-First, but you may not find the best solution.

In real-world-problems you sometimes also have costs, that you have to add to the problems, for example if you want to find the shortest path between two cities by train, you probably want to use the shortest distance and not the connection with least steps (edges in the graph).

A real improvement can be made if you add heuristics to your model, like you calculate the distance to your goal. Using this you can combine the  this heuristics with Breadth-First-Search in order to find the best solution faster because you know when your search brings you further away from your goal.

Semantic Web Technologies – RDF

Summary of week one for the course Knowledge Engineering with Semantic Web Technologies 2015 by Harald Sack at OpenHPI.

The first week covers the basic principles of the technologies for the semantic web, especially RDF, which is one of the languages you can use for encoding information semantically. The basic principle behind the technologies are triples, which consist of subject-predicate-object. So you encode all your knowledge in that way, for example: Pluto – discovered – 1930.

One problem is that because of the syntax the expressions tend to be very long, so you can use abbreviations with namespaces like in XML or turtle, which helps you also to shorten your syntax.

Mr. Sack claims that with Semantic Web technologies you can go one step further into a web of data, because it is very easy to create data that is machine-readable. He gives a lot of examples using DBPedia. This site also provides a good interface, where you can download data in different machine-readable formats like xml or json. Example-page for Pluto.

Named Entity Recognition

Named Entities are nouns that – simply speaking – refer to something in the real world. An example would be the noun Los Angeles, which refers to a city in the US, unlike the noun apple, which describes a fruit. For tasks in information retrieval it is very useful to know whether a noun refers to a named entity or not because it is a common task in search to find named entities, for example if you want to make a trip to Los Angeles. It will then be important because people do not want to find information about the two words los and angeles, but information about this particular city.

So how do you recognize it? There are several techniques, that are used and combined. One way is analyzing parts of speech and trying to detect when a certain pattern of for example two nouns occurs (like USB device). Another way is to look at the sentences for keywords that may refer to named entities and then analyze if in this part of the sentence there are named entities. For instance, in patent retrieval new inventions are described in a way that does not make clear what it really means in order to make the patent claims broader. For instance, a floppy disc drive can be

At the end you can combine these techniques with machine learning, so you can mark named entities at a data set and let an algorithm learn, which of the nouns are named entities.

Another difficulty is the mapping of named entities. For example you have a text about politics in Germany. This text talks about the chancellor of Germany. You can use this information, but you still do not know if Angela Merkel or one of her predecessors. You will need more information to figure out about whom this person is talking, like the date when the article was written. Another awesome example is Java, which is an island and a programming language. There is also a book that uses this ambiguity. It is named Java ist auch eine Insel – Java is also an island.

You can find more information about this topic for example at Marrero et al. and more general information at Wikipedia.

Query Clarity

In IR you got your query and from this query you get a result. But how good is this result? One way to measure this is by calculating the clarity of the result. The clarity means – generally speaking – how much the found results differ. You can measure this when you look a the result sets and try to find out how much the words in the found documents differ. Query clarity can tell you, how much ambiguity you have in your query.

Of course there are different ways to calculate the query clarity. The basic model is the one introduced by Cronen-Townsend et al. Others are the Improved Clarity Score by Hauff et al. and the Simplified Clarity Score by He and Ounis.