Deniz Yuret posts on his blog a cool presentation about how artificial intelligence/ deep learning evolved. In 2036 we will see if there has been the forth revolution in AI or not 😉
Today I want to present I paper which made me think about Digital Humanities. It is called “Algorithmic Criticism” by Stephen Ramsay.
Unlike most of the other papers that only focus new algorithms and new data, this one also focuses on methods how the two parts of the digital humanities can be combined together. He wants to develop a criticism (which is normally a method more used in humanities) that is based on algorithms.
He argues that even in literature research it could be possible to have approached that are a lot more empirical, meaning you have an experiment and quantitative measurement to proove your claims. Another important point that he states that computers might not be ready for that kind of analysis (the paper is from 2005 though), but in future may be, so he believes that these methods will become available.
One of the central points is that he argues every critic reads a text using his own assumptions and “sees an aspect” (Wittgenstein) in the text. So the feminist reader sees a feminist aspect of the text, and also the “algorithmic” reader can see the aspect of the computer or can read the text transformed by a computer. The paper at the end presents some research doing tf-idf measures at the novel The Waves by Virignial Woolf.
I really like this idea to have a certain way of reading a text by letting this be done by a machine and that it is considered similar to a human reader, which is also not completely effective and free of bias. This also is good for the researcher in NLP, because so you can admit that the judgement the computer gives is also not free of bias, for instance if you change the parameters in your algorithm.
This post I want to share a few things that just came to me the last couple of weeks and think there are worth sharing:
There is a new episode on Open Science Radio. This is a German podcast about Open Science and other stuff that is related. They also have some episodes. One thing they talk about it is Jeffrey Leek, a researcher in (bio-)statistics who wrote a book about being a modern scientist, which you can get for free or for a donation. And he also teaches classes via Cursera in Data Science. I can also recommend a lot episode 59 of open science radio about OpenML, which I think is also a very cool project
Google is open sourcing a tool for visualization of high dimensional data, Tensor Flow. The standard visualization shows word vectors. In my opinion this visualization is a little tricky because stuff that appears to be close in this three-dimensional view is in the real vector space with a couple hundred dimensions not close. But it is still a nice tool in order to explore how word vectors behave on a very large dataset that you do not even have to train yourself. You can also use the tool for plotting other high dimensional data.
Summary of week four for the course Knowledge Engineering with Semantic Web Technologies 2015 by Harald Sack at OpenHPI.
RDFS Semantics: we need this because there was no formal description of the semantics and then the same querys gave back different results. So you add semantics. Every triple encoded in RDF is a statement and also a RDF-graph is also a statement.
OWL: based on a description logic it consists of classes, properties and individuals (instances of classes)
OWL2 has different flavors with three different dialects (EL, RL, QL), DL (based on description logic and Full (DL is decidable, Full is not). There are different ways to create ontologies. The most important (and shortest) are Manchester Syntax and Turtle.
Classes, properties and individuals in OWL are comparable with the ones in RDFS.
OWL contains NamedIndividuals, which can be introduced directly.
Deeper knowledge about OWL in the extra lectures.
Summary of week three for the course Knowledge Engineering with Semantic Web Technologies 2015 by Harald Sack at OpenHPI.
This lecture deals with ontologies. If you want to speak a common language, you need:
- common symbols and concepts (Syntax)
- agreement about their meaning (Semantics)
- classification of concepts (Taxonomy)
- associations and relations of concepts (Thesauri)
- rules and knowledge about which relations are allowed and make sense (Ontologies) (Dr. Harald Sack: Knowledge Engineering with Semantic Web Technologies presentation slides; Lecture 3: Ontologies and Logic 3.1 Ontologies Basics, Autumn 2015)
We define knowledge as a subset of true beliefs. A formal representation of this are ontologies. In philosophy it is also defined as the study of the nature of being and existence and basic categories for beings. In Computer Science: An Ontology is an explicit formal specification of a shared conceptualization (Thomas Gruber – A Translation Approach to Portable Ontology Specifications) Its principles are:
- concept: model of the world
- explicit: must be specified
- formal: must be machine-readable
- shared: all must understand it the same way
You can divide ontologies by two ways:
- On their level of generality (top level ontologies that categorize everything in the world (example by John F. Sowa)), or specific for a model, a task or an application
- On their level of expressivity (how much can you get out of an ontology?)
- The next thing we need is formal logic. We need formal logic because with formal logic we can deduce things automatically which we cannot do using informal logic. We need propositional logic (make propositions based on true/false values)and first order logic to do this. For a formula the following terms are defined:
- tautological: all interpretations are true
- satisfyable: if a model exists for the formula
- refutable: if exists an interpretation which is not a model
- unsatisfiable: if no model exists
The next lecture was the tableaux algorithm. This one is used for automated reasoning. You basically create a decision tree and proof by refusion. Lecture 8 deals with description logic. It is important to know that there are several description logics out there. We do not use first order logic to build our ontologies because it would be too bulky. Part 9 deals with different assumptions for the logic:
- No Unique name Assumption: in description logic individuals can have more than one name, therefore you have to specify that.
- Open World Assumption: in an empty ontology, everything is possible. You define only what is forbidden.<> Closed World Assumption: Everything that cannot be shown to be true is false, so you have to define everything while creating it –> Databases.
Summary of week two for the course Knowledge Engineering with Semantic Web Technologies 2015 by Harald Sack at OpenHPI.
Reification allows you to make reference statements. Therefore a statement also gets an URL. It also allows you to make statements about statements and assumptions about assumptions (e.g. Sherlock Holmes thinks that the gardener murdered the butler).
RDFS or RDF Schema puts this further. It adds more semantic expressivity, you can get more knowledge out of the graph. It is also the simplest of the modelling languages (OWL is another, but will be covered later) and describes vocabularies for RDF. What can we do with it? We can build classes to model structures (Planet is class, satellite is subclass of planet, artificical satellite is subclass…, earth is planet, moon is planet and planet of earth, sputnik is artifical satellite and satellite of earth). From this we can infer some information like: an artificial satellite, is also a satellite. Sputnik is an artificial satellite of earth, so it is also a satellite of Earth!
The rest of the lecture focused on SPARQL. This is the language to query knowledge bases stored in RDF. It is similar to SQL syntactacally, but works somehow different because in RDF we are dealing with graphs. You can use it via an endpoint, which is an RDF database that has a SPARQL protocol layer and gives back HTML. It offers you:
- extraction of data
- construction of new graphs
- update graphs
- logical entailment (inferences)
Results are returned as a triple pattern in turtle + variables. You somehow define a subgraph that the query has to match. Query example at DBPedia. As you can see the syntax is close to SQL, it also offers you filters to reduce the amount of results. If you want to try out SPARQL, OpenHPI recommend the use of Fuseki or Wikidata.
This is not about NLP, but I think it is worth sharing, so here we go 😉
I really like the project PiMusicbox, the raspberry pi is just the perfect device to host a music server, especially the model B1, which is also a little bit slow when it comes to video playback. But there are a few drawbacks, like that updates are quite hard and you cannot easily customize it. So I just set up the system in a different way, directly from scrath on Raspbian. In the next steps I show you how.
Thoughts before you start
This is what comes to my mind if someone asks me if she whether should take the normal version of PiMusicbox or mine.
- You can update all the time. PiMusicbox you have to reinstall with every new release.
- You can customize it as you want. The normal PiMusicbox does not provide Podcasts or Files.
- Configuration is done manually. You should be able to connect to your pi via ssh. Alternatively use the Websettings package.
- You do not have a way shutdown button integrated. But you can use RaspiCheck.
- It is slower. In PiMusicbox a few tweaks are done to improve booting of the system. I did not do that.
Download Raspbian minimal image
You can get this directly from the website of Raspbian: https://www.raspberrypi.org/downloads/raspbian/
Install mopidy + run as system service
In order to start it when the system boots, you only have to type (source)
sudo systemctl enable mopidy
I used Spotify, Podcast and files, because these are all I use with the raspi. You can get an overview at the documentation of Mopidy. Generally speaking you can download all extensions either via pip or via apt, depending on in which repo they are. This makes it sometimes a little bit confusing, but you’ll find everything, I am sure 😉 You can find all of the extensions for playback (the documentation calls them backend-extensions in the mopidy-documentation).
Also, you need an extension for a HTML-frontend. I used the one made for the PiMusicbox. An overview can be found here.
Mounting your external storage automatically
For this I used usbmount, which is a small programm that just mounts external storage devices automatically. This can of course be done via scripts as well, but I did not want to mess around with scripts, so I used this approach.
The config file for mopidy can be found at
/etc/mopidy/mopidy.conf, if you run it as a system service, not in your user’s directory. To shorten this paragraph I just paste my config file here and make some comments:
[core] cache_dir = /var/cache/mopidy config_dir = /etc/mopidy data_dir = /var/lib/mopidy [logging] config_file = /etc/mopidy/logging.conf debug_file = /var/log/mopidy/mopidy-debug.log [local] data_dir = /var/lib/mopidy/local media_dir = /var/lib/mopidy/media [m3u] playlists_dir = /var/lib/mopidy/playlists [musicbox_webclient] enabled = true [spotify] username = # your username password = # your password bitrate = 320 # better sound quality [http] hostname=0.0.0.0 # VERY important. Otherwise you cannot reach it from outside [mpd] hostname=0.0.0.0 # VERY important. Otherwise you cannot reach it from outside [podcast] enabled = true #only need to activate it. browse_root = #Path where your .opml-file with all your podcasts is [file] enabled = true media_dirs = /media/usb #this is where your external storage is mounted via usbmount show_dotfiles = false follow_symlinks = false metadata_timeout = 1000
I hope this tutorial helped you.
Bonus: Setting up SMB share
If you have a hard disk connected to your Raspi, you can easily share the files in all your network. I used the tutorial given by putokaz to install and configure it. I just configured it like the share for the torrent files at the end of the tutorial, but this is up to you.
Summary of week sic for the course Knowledge Engineering with Semantic Web Technologies 2015 by Harald Sack at OpenHPI.
Linked Data Engineering
In general it is difficult to get data, because it is distributed into different databases and you need different APIs to get the data -> data islands. Applying semantic web technologies allows you a standardized interface to access this data. This allows easier reuse and sharing of data.Tim Burners-Lee: “Value of data increases if data is connected to other sources.”There are four principles for linked data:
- Use URIs as names (not only web pages, but also real objects, abstract concepts and so on)
- Use HTTP, so people can look up the names, but also machines
- Provide useful information using RDF and SPARQL
- Include links to other URIs, so people can discover more things
If you want to create linked open data, you should have:
- data available with an open licence
- machine-readable format
- non-proprietary format
- use open standards from W3C
- link to other data sources
Tour through the linked data cloud (Uni Mannheim):
- Government data (data.gov.uk)
- media data
- user-generated content (semanticweb.org)
- linguistic data
- bibliographic data (bibsonomy.org)
- life sciences
- cross-domain (dbpedia.org, w3.org, lexvo.org)
- social networking (quitter.se)
- geographic (geonames.org)
All these sources are hold together by ontologies. Examples for ontologies:
- OWL (owl:sameAs or owl:equivalentClass)
- SKOS (simple knowledge organization system) applied for definitions and mappings of vocabularies and ontologies. Allows you to give relations like narrower or broader, relations and matches.
- umbel (upper mapping and binding exchange layer) maps into DBPedia, geonames and Wikipedia
Linked Data Programming
How to publish data for Semantic Web? The best way is via a SPARQL endpoint via OpenLink Virtuoso, Sesame, Fuseki. These endpoints are RESTful Web Services, that you can query via JSON, XML and so on. Another way is via Linked Data Endpoints (Pubby, Jetty). There are overlays over the SPARQL endpoint. Another way is via D2R servers, that translate data from non-RDF databases into RDF data. A source for availability is datahub.io.
Metadata and Semantic Annotation
Semantic Annotation: you attach semantic data to your source. Formal:
- subject of the annotation, (a book, represented by isbn-number)
- object of the annotation, the author
- predicate that defines the type of relationship,relationship, that the author is author of the book
- context, in which the annotation is made (who did the annotation and when?)
Open Annotation Ontology (developed by W3C)
Named Entity Resolution
When we do semantic annotation we want to get the meaning of this string, like additional information (you annotate Neil Armstrong and get more information about him). The main problem is ambiguity, if you enter “Armstrong” in a search engine, you also get pictures of Lance Armstrong and Louis Armstrong. Context helps us to specify the search and overcome this problem.
Resolution: mapping the word to a knowledge base in order to solve ambiguity
Recognition: locating and classifying entities into predefined categories like names, persons, organizations
Example: Armstrong landed the eagle on the moon
From that you do every kind of combination for these entities and if there are co-occurences in the texts, you can find the best matches. Another way is to look at dbpedia where you can see which of the possible options do have connections to each other.
When you use a search engine, you will also find ambiguious results. With semantic annotated texts, you can overcome ambiguity. Based on this you can do entity-based IR, so it is language independent. You could also include information from the underlying knowledge base or use content-based navigation and filtering (filter pictures vs. drawings).
You can use it for:
Query String refinement (like auto-completion, query enrichment)
cross referencing (additional information for the user taken from knowledge base)
Fuzzy search (give nearby results, helpful if you have very few results)
exploratory search (visualize and navigate in search space)
resoning (complement search results with implicitly given information)
Another example is entity based search. You match a query against semantically annotated documents (simple entity matching). You can also get similarities, like between Buzz Aldrin and Neil Armstrong (similarity-based entity matching)
relationship-based entity matching: You have the entities astronaut and apollo 11. There are also relationships between astronaut, apollo 11 and Neil Armstrong.
—> these results can complement your search!
Another approach is directly selecting named Entities. So you directly click in the entity that you want. Example are the articles at blog.yovisto.com
Extension of traditional search and Semantic Search
- Retrieval: You look for something specific (like a book) and know how to specify it
- Exploration: You already read “1984” and want to read a book that is close to this one. In a library you would ask the librarian and he will tell you what to read next. We want to have this in our search system as well!! In a traditional library you can look at the shelves and can maybe also find another book that is similar.
For whom is it made?
- People that are unfamiliar with the domain
- People who are unsure about the ways to archive their goals
- People who are unsure about their goals, you want to find something, but you cannot specify it
You can make graphs with the semantic information you have in order to give the user more information about the original result (more books by one author). You could also get broader results (you read a book by Jules Verne and get as a recommendation books by H.G. Wells, who was influenced by Jules Verne). Another example: start with Neil Armstrong — Apollo 11 and other crew members — Apollo 11 is part of apollo program and you find other apollo programs — you find apollo 13 and find out that there was an accident — you find the crash of the space ship “challenger”.
Summary of week five for the course Knowledge Engineering with Semantic Web Technologies 2015 by Harald Sack at OpenHPI.
Pyramid for knowledge management:
- Data: raw data, facts about event
- Information: a message to change the receiver’s perception
- Knowledge: experience, context, information + semantics
- Wisdom: application of knowledge in context
In general it makes sense to follow some methodologies because creating an ontology is quit complex.
Can we create ontologies automatically? Ways to do this:
- via text mining from text
- via linked data mining from e.g. RDF graphs
- concept learning in Description Logics and OWL (related to linked data mining, but also)
- crowdsourcing via Amazon Mechanical Turk or games with purposein short there are three steps: term extraction – conceptualization – evaluation. Actual challenges in Ontology Learning:
- Uncertainty: the quality is low, you cannot be sure whether the information is right or not
- You need consistency because otherwise you cannot do reasoning
- Scalability: make sure that it is scalable
- Quality: you neet to evaluate it and make sure it is right
- Interactity: you need to involve users to help you improve the ontologies
What is it? You try to find similarities between ontologies in order to combine them. But: an ontology only models reality, it is NOT the reality. The problems are similar to natural language: you run into ambiguities. You can also have problems with different conventions (time in seconds vs. time in time points), different granularities and different points of views.
You have differences on the syntactical, terminological. semantical or semiotic (pragmatic) level
This is the quality of an ontology in respect to a particular criterion. There are two basic principles:
- Verification: it encoding and implementation correct (more the formal side)
- Validation: how good is the model and how well does it match reality?
Criteria for validation:
- Accuracy (precision and recall)
- computational efficiency
- organizaional fitness (how well does it integrate in my software/organisation
Summary of week one of the MOOC Artificial intelligence at edx.
The lecture deals with artificial intelligence, which means in this context, that we want to program computers that act logically, we do not want them to act like humans and we also do not want to build an artificial brain to understand how human thinking works.
The first problems are search problems, like finding the shortest path to a point in grid. For this you can use two types of search:
- Breadth-First-Search: You try out every point around you and if do not reach the goal, you try the points next to these points and so on. The good thing about this is, you find the shortest way, the bad is it takes very long.
- Depth-First-Search: You go down one way until the end (for instance, you always take a left turn) and if this does not work, you try another path. The good thing about this is that you probably find the solution faster than with Breadth-First, but you may not find the best solution.
In real-world-problems you sometimes also have costs, that you have to add to the problems, for example if you want to find the shortest path between two cities by train, you probably want to use the shortest distance and not the connection with least steps (edges in the graph).
A real improvement can be made if you add heuristics to your model, like you calculate the distance to your goal. Using this you can combine the this heuristics with Breadth-First-Search in order to find the best solution faster because you know when your search brings you further away from your goal.