Metadata Indexing and Structure: Bringing Order to the Chaos

Having a lot of data, in whatever form it might come – traditional paper records, microfilm or fiche or any of the multitude of electronic formats in use – is presumably a good idea, otherwise why bother having it? And indeed, we live in an information society, where pretty much everything we do is dependent in one way or another on someone’s ability to access large quantities of data very quickly and accurately.

But the utility of that data – ours, the utility company’s, Amazon’s, the state department of motor vehicles, your health care provider, indeed anybody we transact any business of any kind with – is utterly dependent upon being able to actually find the data when it’s needed. That means you need some way of organizing it so that it can be found. And the larger the data set, the more this is so. You can shuffle through a stack of a couple of dozen sheets of paper to find what you need without too much difficulty. Make it a hundred and it’s a lot tougher. A couple of thousand, and it’s very difficult and time-consuming – and error-prone. A million – not a chance. And the reality is that a commercial records system, even a small one, is going to look a lot more like that million pages – or a whole lot more – than it is a couple of thousand, whether it’s electronic or paper or a mix.

The implications of that are straightforward, but profound: you need a method of organization, otherwise you’ll never find things – most of your information will be effectively lost to you. And, if you have an organizational method, the utility of your information directly corresponds to the quality of your method. Indices, metadata and data structures are all about bringing that order to your data set.

Is your metadata a signpost, or a map?

The problem might best be considered by analogy. If you’ve ever driven a car in a very big, very old city – London, say – you’ll appreciate that it’s very difficult, to say the least. The streets are not organized according to any scheme, they’re all just a jumble that has accrued over a thousand years. It can be done, thankfully, but only because the streets each has a metadata tag in the form of a name. And so, very painfully, those of us who have not memorized the entire map (as London cab drivers, indeed do) can sort of navigate our way around if we have a map or someone gives us detailed directions.

Absent street names, it would be pretty much impossible. But because this metadata is in no way organized, its value is limited. That’s why the cab drivers memorize the whole lot.

Compare that to my city, Denver. The entire city – indeed, the entire metro area – is laid out on a grid. That organizational fact alone simplifies navigation. Colfax Avenue, for example, runs straight east-west for 40 miles. But the grid is also numbered. There’s a street corner that is the zero point – zero east-west, zero north-south. And in addition to a name, every street has a number – 100, 200 and so on.

So if I tell you that Pennsylvania Avenue is 500 east, you know pretty much exactly where that is. And if I tell you that my office is at 4340 South Pennsylvania, you know that it’s 5 blocks east, 43 blocks south. And they went even further, naming blocks of streets as consecutively numbered avenues after trees, historical figures, and so on. The cross-streets in my neighborhood are all named after colleges – I’m a couple of blocks from Pennsylvania and Oxford. Once you know the system and the naming conventions, Denver is a very easy town to get around in. And that’s because the entire city has a consistent metadata scheme, and the objects in it are laid out in an ordered, systematic way.

Managing data sets isn’t too far removed from this sort of comparison. Absent some sort of information about what’s in the data set, you’re left to rummage around randomly, like driving through London without a map and no street signs. So you need some metadata. And the more consistency and order you can put into that metadata, the more effective it becomes for you.

Filing systems – metadata for paper files

A well-designed and well-managed paper filing system is highly illustrative of this. There are an assortment of filing systems in common use – numeric, alpha-numeric, terminal digit and so on, but they all achieve some of the same important goals:

Files have systematic, predictable metadata tags attached to them
Files are stored in systematic, predictable locations
Files on the same or similar topics are either physically grouped (by location) or logically grouped (by the coding on the file labels)

Special-purpose systems like terminal-digit systems may not seem logical to the uninitiated, but they do indeed accomplish the above. They must, otherwise you’d never find anything. And if you’ve ever dealt with a physical filing system where they’ve become careless and imprecise about their labeling and filing, you appreciate how important that system and its consistent application are.

Applying those lessons to metadata for electronic systems

The same logic applies equally to electronic systems. When you look at something like the file structure in Windows Explorer, or the MAC OS Finder, you’re looking at the direct electronic analogue of that paper filing system. There was a fad for a bit whose central tenet was that you didn’t need any sort of systematic data structure, that freeform metadata tags were the only needed tool; but it went away pretty quickly, because freeform metadata quickly becomes inadequate as the number of data objects in the collection grows. If you consider it in the context of a paper filing system, you quickly realize why: If you have say, a thousand filing cabinets to fill, filling them all in random order with unlabeled file folders results in a useless system.

Randomly label the folders with whatever comes to mind at the moment and it’s a bit better… but not much. You’d still have a terrible time finding a particular folder. It’s only when you systematically label the folders and systematically file them that the setup really begins to function. Precisely the same thing is true of electronic systems. The data objects aren’t necessarily contiguous physically, but the metadata scheme ties them together logically to achieve the same outcome.

You often hear someone say “We don’t use indices, we have a metadata schema.” Is metadata somehow different than an index?

No, it’s not: an index is a kind of metadata schema. Consider this simple index:

Accounting
- Accounts payable
- Accounts receivable
Human resources
- Applications and resumes
- Personnel files

Every data object in this system will have at least two metadata tags associated with it, for example, ‘human resources’ and ‘personnel files’. The first one places the object in a particular group of data objects, and the second one puts it in a smaller subgroup. Each personnel file would then have at least one additional metadata tag in the form of a name or employee ID number to permit identification of a particular file. When we see such a system represented as a folder structure, it’s important to remember that the folder structure doesn’t actually exist anyplace – the objects in it are typically stored on the hard drive – often hard drives – randomly. It’s just a graphical representation of a structured, hierarchical metadata scheme. When we drop a file into a folder, we’re really just attaching a metadata tag to it.

Empowering a better search experience with deep metadata

The beauty of a well-designed electronic system is that this logical hierarchy can be paired with all sorts of additional metadata – date and time stamps, keywords, author, the list is virtually endless. And with a good search engine, these metadata fields can be paired, sorted, filtered and displayed in a multitude of ways that can’t be done with a physical filing system, allowing you to search in powerful ways – “Show me all of the payable invoices created by Joe Smith between January 10 and July 7 that relate to Acme Corp.”

But to put that power to use, your scheme has to be systematic and ordered. It’s the predictability and consistency that make for the power. If we don’t consistently label them as invoices or we don’t consistently put Joe’s name to them, we can’t do this.

So what’s the best way to lay out an index? Well, there isn’t one. Consider this simple index for business tax forms. Which structure is better: by form number, or by year?

Structured by form number

Tax returns
- Form 1120
  - 2018
  - 2019
  - 2020
- Form 941
  - 2018
  - 2019
  - 2020
- Form 940
  - 2018
  - 2019
  - 2020

Structured by year

Tax returns
- 2018
  - Form 1120
  - Form 940
  - Form 941
- 2019
  - Form 1120
  - Form 940
  - Form 941
- 2020
  - Form 1120
  - Form 940
  - Form 941

Which one is better depends on the nature of the searches within the system. Your goal in building the index is to provide the shortest search path possible to anyone searching the system. You can’t optimize the index for every kind of search, so you optimize it for the most common.

So, if you mostly look for all tax forms for a single year, the second scheme makes more sense. If you’re mostly dealing with a single kind of form over multiple years, the first makes more sense. As a result, a scheme optimized for day-to-day accounting work is likely to look very different than one optimized for responding to audits or lawsuits.

Success is all in planning

So, in building the index and other metadata tags, you must first get a sense of who is doing the searching, how they do their work and how they search for things. The same is true for the index terms themselves – they have to be meaningful to your users, otherwise they’ll be hunting around randomly. It’s also worth noting that in a good electronic system you may be able to reorder your index and display it in other ways as well – reordered, combined with other metadata fields, flattened out, moved around and whatever.

So, in our accounting example, you could have both layouts available as needed. The ability to make these dynamic changes to the display allows powerful searching unattainable otherwise. But always, the keys are thoughtfulness and care in devising your metadata terms and structure, and a high degree of consistency in its application. You need not have a powerful engine to have an effective index and metadata scheme; and a powerful engine without a good metadata scheme isn’t going to perform that powerfully. The challenge is always in building the intellectual capital – index, data structure, metadata set or whatever you call it – that will drive the system. Build it well, and it will be effective in any environment, because it contains the key information and the key relationships that allow effective searching.

For more on how metadata can be used to enable better operations, check out this webinar recording where Susan Cisco joined us for an Out of the Box Live! session titled Retention Beyond the Curve