Named Entity Recognition

Named Entities are nouns that – simply speaking – refer to something in the real world. An example would be the noun Los Angeles, which refers to a city in the US, unlike the noun apple, which describes a fruit. For tasks in information retrieval it is very useful to know whether a noun refers to a named entity or not because it is a common task in search to find named entities, for example if you want to make a trip to Los Angeles. It will then be important because people do not want to find information about the two words los and angeles, but information about this particular city.

So how do you recognize it? There are several techniques, that are used and combined. One way is analyzing parts of speech and trying to detect when a certain pattern of for example two nouns occurs (like USB device). Another way is to look at the sentences for keywords that may refer to named entities and then analyze if in this part of the sentence there are named entities. For instance, in patent retrieval new inventions are described in a way that does not make clear what it really means in order to make the patent claims broader. For instance, a floppy disc drive can be

At the end you can combine these techniques with machine learning, so you can mark named entities at a data set and let an algorithm learn, which of the nouns are named entities.

Another difficulty is the mapping of named entities. For example you have a text about politics in Germany. This text talks about the chancellor of Germany. You can use this information, but you still do not know if Angela Merkel or one of her predecessors. You will need more information to figure out about whom this person is talking, like the date when the article was written. Another awesome example is Java, which is an island and a programming language. There is also a book that uses this ambiguity. It is named Java ist auch eine Insel – Java is also an island.

You can find more information about this topic for example at Marrero et al. and more general information at Wikipedia.