Structured data vs Unstructured data

Giving Structure to Your Data

Posted by Suewon Bahng on October 12, 2015

Unstructured data

The other day, to a friend of mine, I explained a rough concept of PKM (Personal Knowledge Management) and How Civilizer can help PKM activities. He argued that using tools like Civilizer is an overkill, and a simple tool like Windows Notepad is enough.

As far as I’m aware, note applications like EverNote are being used widely for activities similar to PKM. That makes sense because I think EverNote is so much sophisticated and useful enough. But Windows Notepad? I don’t think so. The data model, which Windows Notepad is supposed to work with, is just a series of text. In other words, it is just a form of Unstructured Data.

Before discussing what is unstructured data, we need to discuss what is Structured Data. Actually, the concept of structured/unstructured data may be subject to slightly different definitions and translations by people. Note that What’s described here is just my interpretation.

Structured Data

Structured Data is data that follows some carefully predefined rules and format which can be used to identify/categorize/link or relate/constrain data. These additional attributes can make humans or machines to effectively collect or track or filter the data.

Example of highly structured data is every database system these days. ( multiple unique columns representing various distinct aspects of the table, primary keys to identify particular entity, foreign keys for relationships, constraints to limit errors slipping in, and so on ). Carefully structured data model in databases can leverage powers of SQL queries so that humans or machines can effectively extract useful information from the database.

Giving Structure to Your Data

Just a bunch of text, that Windows Notepad deals with, doesn’t follow so many significant rules. The data of raw text is so primitive that there is not much Windows Notepad can do for you. What it can provide you best is simple text search functionality. The structure of raw text data is almost none.

Let’s imagine there exists certain database which contains a bunch of rows of a single table with a single column just full of raw text, which covers wide range of content, such as people’s name and address and their job, etc. Various useful features of SQL can’t help with this imaginary database because this data has almost no useful structure. That means your smart SQL skill is useless to this data. What a waste!

Only having a single column means it has only one attribute to identify a particular data entity, but that attribute is actually a mixture of other attributes (name, address, job, …). Text data on Windows Notepad looks like this, which is similar to mass of diverse information. This mixed or blended form of data is usually difficult to identify or filter or modify. The situation almost looks like this. Let’s say we have data of A + B, and the mixture always goes together. That means when we need only A, we have no choice but to retrieve A + B altogether, which sounds inefficient. Or when we try to modify only B, we need to be so careful so that modification doesn’t touch A, which sounds so much error-prone and stressful.

On the other hand, Excel can express its data with columns and rows. This row-column structure of Excel data actually make Excel a lot useful and powerful software, even if you may think the structure is so simple and trivial. My point is this. Providing some structure to your data will give it an extra value and usefulness.

Structured data and its short term cost

Basically, we need to convert unstructured form of data (such as collection of text data) into a structured form to make it useful to us. The more highly structured your data is, the more value you can extract from that data. You can argue that unstructured data itself can be valuable enough if it contains useful information for us. That’s right. But it would be more correct if you say that it is “potentially” valuable. I agree what the data contains is an important issue. But how we extract that useful information from the data is important as well. Unstructured data has arbitrary format and follows arbitrary rules and that imposes a big challenge to us.

Most structured data comes with its extra cost to convert its original unstructured form to a structured form. There have been so many part time jobs out there to insert data written in papers or documents into database consoles. These jobs are basically for converting unstructured data to structured data. Paying money for humans for the converting is a traditional way. Some of data or content companies may have automation tools to do the converting instead of human labors, but developing and maintaining those tools don’t come with no cost.

However, generally speaking, we can say that those costs are short term, and worth paying compared to long term benefits we can get by the converting. And that’s why most businesses have been converting their data, paying the money.

PKM and Structured data

The friend, arguing Windows Notepad is useful enough, is an web engineer. In the mean time, I know someone that personally using database software to manage data for his job. He, even not an engineer, writes SQL statements to a database console to efficiently store and retrieve his data. ( The data is for his personal duty and not worth sharing with other coworkers. ) I guess some of his coworkers probably won’t understand why he is doing that. While some non-technical person understands the concept of structured data, some engineers don’t even care for the concept. I wonder.

He advocating Windows Notepad might think personal data is so trivial that it’s an overkill to adapt some database concepts to practice a PKM activity. In my opinion, the benefits of structured data does not distinguish if the data is personal use or enterprise use. Besides, I don’t think anyone’s personal data is such trivial these days. Even though it is for a personal use, your data might grow up to a significant volume as you create and edit it over your career. Scalability is not an issue that only enterprise data deserves these days, at least in my opinion. There is no such rule that only enterprise data should be in a structured form.

I mentioned in my earlier blog post that whether or not practicing PKM is more primary question and which tool to choose is a secondary question. If you think you’ll be happy with Windows Notepad, I recommend you try PKM with that. Maybe, collaborating with other extra tools like grep can be a little more productive, but, still it sounds not so convincingly productive to me.

As discussed earlier, structuring your data has a caveat, it costs efforts or time in a short term. That may be why many people hate tagging or organizing their data. But, it definitely has a lot more beneficial long term value and the short term cost can be significantly decreased once you’ve got the knack.