Big data is a buzzword in the IT industry and is often associated with personal data collected by large and medium scale enterprises. The term is often used to describe the large-scale acquisition of data through various types of systems. The large quantities of data are often used as pattern recognition and predictive behavioral systems.

There is a regular misconception about big data in that it is often mistakenly associated with the infringement of personal privacy. In fact, appropriate analysis of structured, semi- and unstructured data could be used to enhance the personal experience of the user, to predict useful behaviors and potentially help make smart business decisions.

Although the use of big-data analytics is underutilized, it is finding its way into digital and mobile forensics. The data is often located on seized or analyzed artifacts in semi-structured form, typically in application databases. It could also be found as unstructured data when investigating officers acquire a digital snapshot from the storage chipsets on mobile devices, computers or any other electronic devices with digital storage. The collected data is then migrated to servers with very little, if any, intelligence processing upon successful extraction.

The four main pillars of big data are often defined as volume (amount of data), variety (multiple types of data) velocity (speed of data processing) and value (not a tangible price but the valuable intelligence that can be collected from the data).

The four pillars of big data could be restated in relation to digital forensics in the following way:

Volume is often used to reference the amount of data collected from an individual or multiple seized devices.

Variety, to reference the different types of files or data present within the medium (for example this could be allocated data from known file systems and unallocated data from volume and file slack spaces).

Velocity is concerned with the amount of time needed to process the acquired data and indeed the time often needed to acquire the data initially.

And, finally, the value of the data. This is not the resale value, but the value of the actual intelligence collected when the data is processed correctly.

Challenges and adaptations in digital forensics

Big (Data) Storage

There is no formal threshold for the size of data that can be referred to as “big data.” Its classification is much more complex than a number. The ever-changing size of data is what can be classified as big data. Although 1 terabyte (TB) could be accepted according to literature as a dataset that qualifies as “big data.”

According to the International Data Corporation (IDC), every person online will create an average of 1.7 megabytes of new data every second by 2020, and  only 37% of all big data could be analyzed, leaving a plethora of untapped information that could be processed by law enforcement agencies to solve  crime efficiently.

Structured vs. unstructured data processing

When discussing structured data, it is usually referred to data that has a defined known structure. This could be numbers, dates, groups of words, simply strings accessed within the storage medium. It is data regularly tapped into during an investigation and generally stored in database files.

The processing of structured data could be completed by using MSAB’s innovative tools which would parse users’ data from supported and unsupported applications, decode encoded dates or times and present stored data in a presentable form, while concentrating on the database file format and known structures.

Unstructured data is information that either does not have a pre-defined data model or cannot be structured in an orderly fashion (such as in ordered rows and columns as found in databases). Unstructured data can include text in all forms, emails, video, audio files, web pages and social media. Making sense of unstructured data is often done by implementing complex search queries to extract and present all the data in a better presented structure. The search queries will enable an examiner to find data in with the same contextual structure as the investigation.

Intelligence prediction

To generate tangible intelligence from big data, a structured intelligence extraction process is used to examine the data and present the examiner with an actionable dataset. The actions implemented are usually completed in phases. The main aim of these phases is to put structured, unstructured and semi-structured data in a tangible format that resemble the evidence examined.  It is perceived that a high proportion of law enforcement agencies do not process their stored data for actual intelligence that can later be used in behavioral analysis, possibility of re-offending or digital profiling.

Forensic solutions:

Training

A key element of utilizing big data is to understand how to process the knowledge it contains, and the best method of understanding the knowledge is by applying investigative and extraction techniques. MSAB offers data analysis training courses which can be used to make the most of knowledge residing within structured and unstructured data and to provide digital forensic examiners with the skills they need to understand and utilize what they are seeing.

Ecosystem approach

Ecosystems are regularly utilized to enable the processing lifecycle of big data. These systems would typically start with identifying the media containing the data, identifying the extraction profile needed for each type and processing the data once the extraction is completed. The processing techniques will ensure that structured and unstructured data are indexed and categorized. And upon completion of this process, the advance indexing processes will run queries that make big data much more presentable to examiners during the investigation.

Learn more in this whitepaper: How taking the ecosystem approach to mobile forensics helped the Metropolitan Police fight terrorism.

Practical application

Examiners concerned with the correct processing of big data should combine customized vender solutions, vender-neutral software and their extraction knowledge to ensure digesting solutions are heterogeneous in nature, maximizing the results they can see.

Conclusion

Big data analytics can provide significant advantages to forensic examiners, especially when the content is processed correctly. The examiner would generally utilize advanced search techniques built into the software to address individual items within emails or search history. These techniques and more are built into the acquisition software to present instant results to an examiner when they search for names, phone numbers or data strings. In addition to storing the data and extracting specific datasets, forensic examiners are missing out by not utilizing the potential of the intelligence analysis. The analysis techniques assist law enforcement agencies in predicting behavioral patterns, or potentially identifying if a suspect has been involved in other forms of crime or has links to any other persons known to the agency.