Shelter.nu : Thinking of Knowledge Management? Maybe you shouldn't; tips, tricks and plum pudding

Blog archive Knowledge and information Topic maps Information architecture Interface and interaction design Knowledge management Content management Technical development General Work and technology Ego ergo sum About this site

Filesystems, taxonomies, ontologies, persistant identifiers and plum pudding

More often than not our information is spread far and wide, it is poorly organised, lacks sensible metadata and is semantically challenging to sift through. So we concuct various centralised and smart solutions to this widespread problem; portals, data-banks, databases, repositories, search engines, agents, wiki's, blogs, and so forth. They have different names and work differently, even serving quite different versions of the paradigm "How to serve information in the best possible way", but they all have at least a few things in common, such as to;

Centralise the point of access to your information.
Have as much information as possible for you to access.
Make that information as relevant as possible to you.

The rest is technology. But first; the problem!

The filesystem and the directory structure

Inspired by the second use case of Thomas Passin's "Explorers Guide to the Semantic Web", here is a simple way to make any directory structure with files in them make a whole lot more sense. It is quite common in intranets for example that you have a wealth of files spread throughout a big directory structure, and then slap some kind of web interface on top to try to make sense of it all. Mostly they fail, and I don't think I have to argue strongly for this; people all over the world talks about this specific problem.

What's wrong with a directory structure, then? Let's assume you have one, and you've got tons of files in it. Let's go out on a limb and even pretend that your structure is, well, structured in a clever way. Let's even be crazy enough to assume that files placed in these directories belong there, that documentation really is in the documentation folder, and so on. Yes, it is an impossibility, I know, but bare with me for a moment.

Exhibit A : the file "//intranet/projects/pro1/docs/specs/specs-v1_2.doc"

Exhibit B : the file "//intranet/projects/pro2/specification-v2_2.rtf"

Exhibit C : the file "//intranet/division/div6/policy/development.doc"

Is there any relationships between these files? If we simply navigate through the structures knowing what we're looking for, then we might get some glimpse of relationships, but most of the time I would claim we don't know. So, let's break the path names down into traditional property values;

A:intranet, A:projects, A:pro1, A:docs, A:specs, A:specs-v1_2, A:doc

B:intranet, B:projects, A:pro2, B:specification-v2-2, B:rtf

C:intranet, C:division, C:div6, C:policy, C:development, C:doc

This gives us some assumed and distant relationships between files, and we already can do simple queries such as "find all files starting with 'spec' in projects", but what we really want to do is to see the close relationships between these files, dig into the semantic values of those relationships to uncover more exact patterns of recognising the meat. Let's take our first step into the world of Knowledge Representation, and create an ontology based on a list of the most frequently used words;

Documentation, specification, policy, division, project

Let's talk a little bit about this. In the world of Knowledge Representation (misguidedly also referred to as Knowledge Management) there is a misnomer that "unique identification of things leads to understanding of what those things are". This thought is what drives people to create ontologies, and the bigger and more complex the ontology, the bigger and better our understanding will apparently become. As an example we point to ourselves, and say "Homo Sapien Sapiens", find this item in our massive ontology and how it relates to other things, and derive some sentiment from this, such that a "Homo Erectus" was a distant ancestor and that I don't think I have many "Homo sexual" tendencies. And so forth.

But this is all a bit misleading, because in the real world these are very different things, namely two biological classes and one sexual preference. Apples and oranges, as they say, but it is even worse than that; it is apples and chapels; Not related, apart from the somewhat similarity in the order and idexes of the characters in those words of the English language. See where does this lead us? Down a very common but misleading path.

We think that similarities in certain things gives semantic similarity in meaning, but it is a false trap. Let's return to our exhibits;

A:intranet, A:projects, A:pro1, A:docs, A:specs, A:specs-v1_2, A:doc

B:intranet, B:projects, A:pro2, B:specification-v2-2, B:rtf

Quick, what are the similarities between these two files? They're on the intranet, they're both under a project, they both have filenames starting with 'spec', and they're both textual documents (doc and rtf). Good, good, but what does that mean? Is that valuable information? In fact, did all this info gain us any knowledge about those files at all that we couldn't have aquired from simply browsing the directory structure?

The first step in understanding something, is to group the similarities. We all know this, so let's do that;

A:intranet, B:intranet, C:intranet = A, B and C are on the intranet

A:projects, B:projects = A and B are in a project

A:specs-v1_2, B:specification-v2-2 = A and B might be specifications

A:doc, C:doc = A and C are documents, possibly B

These words are typically what we want to put in our ontology. Let's go back to it;

Documentation, specification, policy, division, project

Yes, this makes sense; we can now start matching some stuff up, because we live in a perfect world where people do the perfect thing? Before I answer that silly self-referetial question, let's have a look at the group for all things mismatched;

A:pro1, A:docs, A:specs, A:pro2, B:rtf, C:division, C:div6, C:policy, C:development

Getting some form of taxonomical "knowledge" from the grouped list is easy enough, and works well for visual humans browsing through lists like this. This is the primary way Knowledge Systems today work. Meaning, they're not really there to understand or interpret or otherwise push us on our way; they're simply systems for indexing and sorting and displaying things, hopefully in the best given way considering we're doing the pulling.

A few problems to consider, though; what if something is a 'project' instead of 'projects' (note the plural), or if someone writes 'spesification' instead of 'specification' (note the spelling mistake), or maybe they're Norwegian and writes 'spesifikasjon', and maybe they thought the specs are more related to the 'systems' folder, put it there, and called it 'systems_spec_001'? What if? What if?

Yes, we can create a humongous ontology to cater for all these things. Is that maintainable? Hell no.

First stab at the ontology

So, let's talk about that second ungrouped list, then. That's where the interesting things show up, because - and this may come as a shock to some people - we humans don't always put things were they belong, name them the best possible way or otherwise give hints to what they might be. We get things like 'pro1'. What the heck is that?

Now, this is where a dictionary / thesaurus / ontology sometimes is handy. We look up 'pro1' in our 'Corporate Ontology' (a magical system, indeed), and find out that it is designated to the following;

The project "FIB"
The software application that runs our finances
The nickname for Andy

In our grouped list, one of the keywords (which they shall be know by from now on) were 'project', so we could place a rather safe bet on our first option. Seems trivial enough, doesn't it? We can create simple systems that traverse structures and match exploded keywords with items from a dictionary or thesaurus. Perfect.

But hang on; what if option 1 and 2 were the same? Or what if Andy was the name of a project and not some person? At this stage, we simply don't know, but it doesn't take too many of such false assumptions to render our "knowledge system" useless. The dictionary or the thesaurus must be totally clear about what they are describing. A word or an explanation means nothing to a sorting computer, and it is obvious that we need something more.

Um, that means you. The person. You assign the metadata 'project' to the file in question. This is obviously not going to happen to a filesystem with hundred of thousands of files. We need better means of matching resources (the files in our filesystem) with our ontology.

"Ah! I know! Uh, uh, pick me! Pick me!"

Yes, you in the back ... ? Using Persistant Identifiers as a means to make sure that we're talking about the absolute and specific same things in our ontologies? Well, yes, you obviously didn't read the beginning of this article, but I'll recap; Persistant Identifiers are fine and well in a closed environment where you can control all meassures, including the ontology, your users, your stakeholders, your objects and the things addressing it, the weather, the pitch of a perfect A (440Hz, no less) and a host of other measures.

There is a thick grey fog between your ontology and your data which cannot be expressed easily; knowledge. Because even when you - as a person sitting in front of the computer - know what bits belong to what part of the ontology, the computer has no idea. None. Zilch. Our quest for a "knowledge system" needs more than the tools we've discussed; The data itself, the ontology, a dictionary or a thesaurus, and persistant identifiers together can form quite good systems of course, and if you want a somewhat complex categorisation system, it's perfect. But it still has nothing to do with knowledge. You or someone - as a person - have to provide this.

Knowledge Representation

We need some piece of information to represent something, so that when I talk about my mum's plum pudding, I don't mean the one you find down the supermarket, but my mum's. Not as easy as it sounds.

First of all, who am I? We need some kind of representation for the object which is me. Let's create one;

https://shelter.nu

That's my homepage, so is that a good representation? Nah, not really, because there is more to my homepage than just 'me' (whatever that might be). Let's try to be a bit more specific;

https://shelter.nu/people/alexander.johannesen

Is this a piece of information we can safely use to pinpoint we're talking about 'me'? Well, it is good for a lot of purposes, but in this everchanging world we can't be too sure if;

Is there only one Alexander Johannesen in the world right now?
Will there only be one Alexander Johannesen in the world, in the past, now and in the future?

And once we've solved that;

Will https://shelter.nu/people ever mention any other Alexander Johannesen's than this one we're trying to define?

Here we have the problem of a) identifying one single instance of something, and b) authentication and validity of the pointer to that identity. And to find the right plum pudding we need to go through this exercise with me, my mum and the plum pudding. There is an indentity crisis abound.

Identification

More and more people are heading the ways of trying to establish a set of rules to identity instead of the classic persistant identifier route. Here's a rule;

If person name is 'Alexander Johannesen', born in Norway, born in 1971AD, married to 'Julie Anne Johannesen', lives in 'Canberra, Australia' in the year '2004' using the 'AnnoDomine' system.

We're getting a bit closer. Genealogists have this down pat. But there is a problem, and that is that there's more to the world than people, and we're not really all that great at tracking this kind of metadata for things not a person. We have newspapers and books with editions and dates and authors, and this can be tracked fairly well, but what about our project documentation? Umm, yes, the stuff we really need to know now just before the board meeting is somewhere out there, and no one bothered to give a lot of metadata info about that file. Where is that fabled Knowledge Management system when we need it?

If we need to create persistant identifiers to all things, then that's all that we'll have time doing, because there is context that also needs persistant identifiers that map our project specification; people writing it, people who edit it, the people who thought of it, the resources needed to do it, the hours spent doing it, the picture that explains the concept of it, the graph that tells us the validity of it, the speech held my Mr. Dawson about why it's a bad idea, his collagues who agrees with him, and - of course - the black-list of people who disagree. And we need to map the conventions, such as editing the specs, writing the specs, agreeing to them, promote them, and so forth. And then we need corporate associations, such as DL3 is a higher position than a AB2, James is a curator, DL3, project manager and a promoter of things, the elevators are in the B section of the 3rd departement, and the toilets are bloody out of order!. Sigh. There is this, and so much much more.

Some practical thoughts

Surely some middle ground can be made. Surely using a bit of this and a bit of that is good enough to locate my mums plum pudding well enough for general use? Do we really need to map everything in the universe to make some small assumed guesses work? Of course not; there are thousands of systems doing that already. Assumption is the next black, you know, so let's get back to our examples and perform a little black art.

We've identified (pardon the pun) that the stuff we're interested in doing extra stuff with is this;

A:pro1, A:docs, A:specs, A:pro2, B:rtf, C:division, C:div6, C:policy, C:development

A simple and time-consuming solution to this is to assign some graduate student to add every single instance to our ontology, ask some weighted question to senior staff, and put the aquired "knowledge" into the ontology as well. Maybe this will work for you, but if you've got more than 5 people in your organisation, I'd suggest you look to other solutions.

Here's a few things we can do to squeeze a little more out of our ungrouped list;

A list of all projects you know of that we might talk about; pro1, pro2, FImlBLE, whatever they are called. Make sure the list never changes. For persistance, use a thesauri and update the "use instead" links as you go.
A list of all people you know that we might talk about. Make sure this list never changes either, see above. If people quit it doesn't mean the document they wrote magically disappears too.
If you have a search engine, compare the list of "searched for words" against our ungrouped list. There might be hints there to what things need to be addressed.

Within the scope of this entry, there isn't too much more to do. We need a graduate student for sure. But, if you were to pursue a better tomorrow to enjoy my mums one and only true plum pudding, here's a few hints to how you can do Knowledge Representation that tad bit better;

Don't underestimate the human interfaces to your systems. Use more resources on usability testing than on flashy designer. The simpler the system is to use, the simpler it will be for people to add value to it.

That rule needs a line on its own, to give it some air and a pause to ponder. Read it five times, think seriously about it, and then read on;

First, create a filesystem harvester that can create simple ontologies.
Add a processing ontology to your framework, and use resource metadata extracting tools to get more data.
Never create specialist solutions! Never ever! Ever! Generalize your framework to cater for absolutely everything.
Design plugin architectures in every system you have, so that importing and exporting of metadata is there.
Install Google. 'Nough said.

In my next followup to this, I'll present a simple filesystem harvester that can create simple ontologies, ready to be downloaded. Stay tuned.

Permalink (Wed, 10 November 2004 13:00:00 GMT)| Comments (0) | Knowledge and information

Top of page

Metadata

Revisiting the past

Mapping myself?

Lilje pictures!

Graphical topic maps demo

Living with topic maps and RDF

I'm sick of "specifications"!

Where is the knowledge in a CMS?