At Celaton we have been talking about the different types of data for many years and despite the plethora of definitions available, there is still some confusion over what exactly unstructured data is?
It is first important to define data, as there is certainly a lot of it in the world and it can mean different things to different people. The Cambridge dictionary defines data as:
‘Information, especially facts or numbers, collected to be examined and considered and used to help decision-making or information in an electronic form that can be stored and used by a computer’.
In Celaton’s world, data is anything that is submitted to an organisation. That might be facts or numbers, as per the dictionary definition, but it is also invoices, sales orders, claims and customer correspondence. Data is anything that an organisation receives, via post, email, web form or any other channel that needs to be processed and understood in order to perform a further action and gain insight. In this sense, data is fundamental to an organisation’s daily operational activities.
For an organisation looking to streamline the processing of documents with automation technologies such as RPA or Intelligent Process Automation (IPA), there are three main types of data to consider. These categories depend on the complexity of the documents for the respective technology to process and include ‘Structured’, ‘Semi-Structured’ and ‘Unstructured’ data.
Structured data is defined by Tech Target as ‘data that has been organised into a formatted repository, typically a database’. This type of data is more typically found in spreadsheets or in formatted tables and traditional Digital Process Automation (DPA) solutions such as OCR and RPA have already been proven to be very effective at processing this type of data. DPA solutions take the appropriate manual tasks within a process and make use of computer systems to help organise and perform them more efficiently, eliminating unnecessary repetitive tasks by having the computer system carry them out instead. DPA works well when structured data needs to be processed because data fields within the document are fixed and there is low variation in templates, making it simple for pre-programmed rules to be applied to process information.
Semi-Structured, on the other hand, is data that contains semantic tags but does not conform to the structure associated with typical relational databases. For organisations, semi-structured data is the most common and often found within invoices, sales orders and some forms. The data contained within these types of documents can often move around the page, for example, one supplier's sales order format may vary from another and are typically more labour intensive to process. Due to the variation present in semi-structured documents, traditional DPA solutions may struggle to process them due to their rules-based approach to data identification and extraction. For example, within an Accounts Payable process, an organisation might receive 10,000’s of invoices from a wide variety of suppliers that need processing for payment. It is challenging for traditional DPA solutions to process such a wide variety of documents at volume because of the time and cost involved with reprogramming the software with every new format received requiring an amendment to a template.
IPA platforms, such as Celaton’s inSTREAM™, enable organisations to process documents with higher variation and at high volume, because of its application of Machine Learning algorithms in a system called ‘Human in the Loop’. inSTREAM learns through the natural consequence of processing and through collaborating with human operators who teach it about each new document or exception. This means there is no need for the platform to be reprogrammed with every new document type received, not only reducing cost and time but also significantly improving process optimisation and scalability.
The final category, unstructured data, is defined as having no predefined format and is typically text-heavy and written in the human voice. This makes it much more difficult to collect, process, and analyse. Organisations tend to receive unstructured data within customer correspondence and in some claims. As such, this data is often linked to customer experience and so delays in processing can impact a company’s reputation and potential competitive advantage.
The complexity of unstructured data means IPA platforms can be applied to process these documents because of the use of Machine Learning and ‘Human in the Loop’. However, it is important to note that in some instances, manual processing may still be required because of the complexity of the document, for example, it may be handwritten, or learning is limited if low volumes are received. In these instances, it may not be cost-effective to deploy Intelligent Automation technology as it can be difficult to achieve ROI.
In conclusion, despite how broad and confusing the terminology surrounding data might be, organisations can no longer ignore the important role effectively processing data has on business success. Through understanding the different types of documents and data contained within them, organisations can better identify the most effective technology applications for their processes and achieve sustainable long-term efficiencies.