Big data is a buzzword in the IT industry and is often associated with personal data collected by large and medium scale enterprises. The term is often used to describe the large-scale acquisition of data through various types of systems. The large quantities of data are often used as pattern recognition and predictive behavioral systems.
There is a regular misconception about big data in that it is often mistakenly associated with the infringement of personal privacy. In fact, appropriate analysis of structured, semi- and unstructured data could be used to enhance the personal experience of the user, to predict useful behaviors and potentially help make smart business decisions.
Although the use of big-data analytics is underutilized, it is finding its way into digital and mobile forensics. The data is often located on seized or analyzed artifacts in semi-structured form, typically in application databases. It could also be found as unstructured data when investigating officers acquire a digital snapshot from the storage chipsets on mobile devices, computers or any other electronic devices with digital storage. The collected data is then migrated to servers with very little, if any, intelligence processing upon successful extraction.
The four main pillars of big data are often defined as volume (amount of data), variety (multiple types of data) velocity (speed of data processing) and value (not a tangible price but the valuable intelligence that can be collected from the data).
The four pillars of big data could be restated in relation to digital forensics in the following way:
Volume is often used to reference the amount of data collected from an individual or multiple seized devices.
Variety, to reference the different types of files or data present within the medium (for example this could be allocated data from known file systems and unallocated data from volume and file slack spaces).
Velocity is concerned with the amount of time needed to process the acquired data and indeed the time often needed to acquire the data initially.
And, finally, the value of the data. This is not the resale value, but the value of the actual intelligence collected when the data is processed correctly.
Challenges and adaptations in digital forensics
Big (Data) Storage
There is no formal threshold for the size of data that can be referred to as “big data.” Its classification is much more complex than a number. The ever-changing size of data is what can be classified as big data. Although 1 terabyte (TB) could be accepted according to literature as a dataset that qualifies as “big data.”
According to the International Data Corporation (IDC), every person online will create an average of 1.7 megabytes of new data every second by 2020, and only 37% of all big data could be analyzed, leaving a plethora of untapped information that could be processed by law enforcement agencies to solve crime efficiently.
Structured vs. unstructured data processing
When discussing structured data, it is usually referred to data that has a defined known structure. This could be numbers, dates, groups of words, simply strings accessed within the storage medium. It is data regularly tapped into during an investigation and generally stored in database files.
The processing of structured data could be completed by using MSAB’s innovative tools which would parse users’ data from supported and unsupported applications, decode encoded dates or times and present stored data in a presentable form, while concentrating on the database file format and known structures.
Unstructured data is information that either does not have a pre-defined data model or cannot be structured in an orderly fashion (such as in ordered rows and columns as found in databases). Unstructured data can include text in all forms, emails, video, audio files, web pages and social media. Making sense of unstructured data is often done by implementing complex search queries to extract and present all the data in a better presented structure. The search queries will enable an examiner to find data in with the same contextual structure as the investigation.
To generate tangible intelligence from big data, a structured intelligence extraction process is used to examine the data and present the examiner with an actionable dataset. The actions implemented are usually completed in phases. The main aim of these phases is to put structured, unstructured and semi-structured data in a tangible format that resemble the evidence examined. It is perceived that a high proportion of law enforcement agencies do not process their stored data for actual intelligence that can later be used in behavioral analysis, possibility of re-offending or digital profiling.