DATA SCIENCE METHODOLOGY FOR
CYBER SECURITY PROJECTS SURVEY

Teena Thomas
11 min readMay 16, 2021

Abstract

As with the growth of multiple user digital environments, there has been an abundance of user-rich data. “The data generated is a reflection of the environment it is produced out of, thus we can use the data we get out of systems to figure out the inner workings of that system”[2]. Big data analytics is a term derived by researchers for the process of extracting, processing, and storing large amounts of data. The concept of big data has now become an important feature in cybersecurity. In cybersecurity the main emphasis is on protecting assets and thus uses big data as a tool for this matter. This paper aims to survey recent research advances in the field of cybersecurity in relation to big data, how data is protected, and how big data could be made use of as a tool in cybersecurity.

Section 1: Introduction

As the world encroaches new grounds in the digital world, the growth of cyber intrusions also increases. Worldwide, nations are making use of many means in trying to secure their digital grounds. Within a span of just 15 years, the amount of data being generated has risen exponentially across various domains and this inturn makes way for a new era, an era of big data. As with technological advances and powers there comes the potential for malicious use of such powers. Initially, hacking had been done for the sake of personal content and/or for monetary benefits. “However, these days, attacks are more calculated and motivated. Nations are accusing each other of hacking. There is also a significant rise in industrial espionage which can either be from nation-state or competing entities trying to gather information or to take away a competitor’s edge as to increase their own” [2]. As a result of the unpredictable nature of attacks cybersecurity is an important field within the realm of computer science. The main aim of cyber security is to secure attack points and keep the susceptibility of sensitive points to a minimum. Cybersecurity can be summed by the PDR paradigm which stands for prevent, detect and respond (respectively). Similarly, big data can be summed by 3Vs which stands for volume, velocity, and variety. “Volume represents the fact that the data being generated is enormous, velocity represents the fact that data is being generated at an alarming rate, and variety represents the fact that the data being generated comes in all types of forms” [2]. Data science involves the extraction of information, trends, patterns, etc.. from an immense amount of data. One important element of data science is data analytics and there are two main approaches for conducting this.

1.1 Data-Driven Decision Making:

Data driven decision making (DDD) is used to describe a technique/process for making decisions based on data analysis rather than just basing decisions on an individual’s innate knowledge. The DDD is used in the field of cybersecurity at various levels of engagement. “1) decisions which are based on data discovery 2) decisions which are based on frequent decision-making processes particularly at considerable dimensions or massive scale. This kind of decision-making process might gain from an even minor increase in reliability and precision based on information evaluation and data analysis”[1]. Data analytics along with a solid understanding of essential concepts aids cybersecurity experts in making data-driven decision making processes.

1.2 User Data Discovery:

The User Data Discovery (UDD) is the process of creating profiles for users based on previous details on the user as well as the user’s historical information/records. The information would include geographic location details, academic records, private activities, and other personal information. “The primary function of the user profiling process is capturing user’s information about the interest domain. This information may be used to understand more about an individual’s knowledge and skills and to improve user satisfaction or help to make a proper decision. Typically, it evolves data mining techniques and machine learning strategies. UDD process is a type of knowledge discovery in databases or the new version, knowledge data discovery model and requires similar steps to be established”[1]. This is also referred to as user profiling and it can be done as knowledge based or behavior based. The real-time recognition of user behavior is a crucial element in cybersecurity projects.

Section 2: Approaches

2.1 KDD Process

The KDD process stands for Knowledge Discovery in Databases. The KDD is used to extract non trivial information from data stored in databases. The KDD is an iterative process and what this means is that enhancements can be made such as the integration of new data in order to well tune the results being generated. The KDD process consists of five steps: selection, pre-processing, transformation, data-mining, and interpretation/evaluation, respectively.

“Selection: It means generating or producing a target data set or concentrating on a subset of variables or data samples in a database.

Pre-processing: This phase tries to obtain consistent data by cleaning or pre-processing selected data.

Transformation: Reducing feature dimensionality by using data transformation methods is performing in this phase.

Data Mining: Trying to recognize patterns of attention or behaviours by using data mining techniques in a specific form should perform in this step. (Typically, prediction)

Interpretation/Evaluation: Mined pattern should be assessed and interpreted in the final phase”[1].

2.1.1 Benefits

  • Every step of the KDD Process is laid out with much detail, making the comprehension of the process easier.
  • Suited for machine learning projects where there is a clear object/problem to solve.

2.1.2 Problems

  • Algorithm heavy
  • Not suitable for discovery but rather for problem solving.
  • Does not cover the business understanding and deployment phases
  • Comparatively used less frequently

2.2 CRISP-DM Process

The CRISP-DM stands for cross-industry process for data mining. This process allows for a more structured approach when planning a project related to data mining. The CRISP-DM Process can be broken into 6 phases: business understanding, data understanding, data preparation, modelling, evaluation,and deployment.

Business Understanding: In this phase the main goal of the project or the problem at hand is understood and using the knowledge gained, come up with a preliminary approach to the data mining problem. It is during this stage the objectives are set, the project plan is produced, and the business success criterias are determined.

Data Understanding: This is the initial phase of data collection. It is during this phase the quality or the veracity of the data is recognized.

Data Preparation: In this phase all of the activities as well as the tasks needed for constructing a dataset from the raw data is decided upon.

Modelling: The modeling techniques as well as the strategies are decided upon in this stage. The parameters as well as the prerequisites are also identified and considered.

Evaluation: At this point, the obtained model or models which seem to provide high quality based on loss function are completely evaluated and the actions executed to ensure they generalise against hidden data and to be certain it correctly archives the key business goals. The final result is the selection of sufficient models or models.

Deployment: This stage means deploying a code representation of the final model or models in order to evaluate or even categorise new data as it arises to generate a mechanism for the use of new data in the formula of the first problem. Even if the goal of the model is to provide knowledge or to understand the data, the knowledge acquired has to be organised, structured and presented in a means which could be used. This includes all the data preparation and required steps which are needed to treat raw data to achieve the final result in the same way as developed during model construction. ”[1].

2.2.1 Benefits

  • There are many case studies which have used the CRISP-DM process thus is a very well proven process.
  • “ There are many methodologies and advanced analytic platforms which are actually based on CRISP-DM steps because the use of a commonly practised methodology gains quality and efficiency”[1]

2.2.2 Problems

  • CRISP-DM “details and specifics needed to be updated for cybersecurity projects such as those including Big Data”[1]
  • There are six stages to the CRISP-DM process but their sequence order is loosely kept.

2.3 Foundational Methodology for Data Science Process

The Foundational Methodology for Data Science (FMDS) Process is one which shares many of its features with the KDD Process and CRISP-DM process. This process is very well suited practice when dealing with very large volumes of text, image, artificial intelligence, etc.. data processing. There are ten steps in the FMDS process that includes: business understanding, analytic approach, data requirements , data collection, data understanding, data preparation, modelling, evaluation, deployment, and feedback.

Business Understanding: In this phase the main goal of the project or the problem at hand is understood and using the knowledge gained, come up with a preliminary approach to the data mining problem. It is during this stage the objectives are set, the project plan is produced, and the business success criterias are determined.

Analytic Approach: Once the problem has been identified, in this phase, a machine learning technique is made use of in order to identify an appropriate analytic approach.

Data Requirements: Once an appropriate analytic approach has been identified,,in this phase, the data requirements needed are defined.

Data Understanding: This is the initial phase of data collection. It is during this phase the quality or the veracity of the data is recognized.

Data Preparation: In this phase all of the activities as well as the tasks needed for constructing a dataset from the raw data is decided upon.

Modelling: The modeling techniques as well as the strategies are decided upon in this stage. The parameters as well as the prerequisites are also identified and considered.

Evaluation: At this point, the obtained model or models which seem to provide high quality based on loss function are completely evaluated and the actions executed to ensure they generalise against hidden data and to be certain it correctly archives the key business goals. The final result is the selection of sufficient models or models.

Deployment: This stage means deploying a code representation of the final model or models in order to evaluate or even categorise new data as it arises to generate a mechanism for the use of new data in the formula of the first problem. Even if the goal of the model is to provide knowledge or to understand the data, the knowledge acquired has to be organised, structured and presented in a means which could be used. This includes all the data preparation and required steps which are needed to treat raw data to achieve the final result in the same way as developed during model construction. ”[1].

Feedback: This is the last phase of the FMDS and it is in this phase that the “outcomes from the implemented edition of the analytic model to analyse and feedback its functionality, performance and efficiency in accordance with the deployment environment”[1].

2.3.1 Benefits

  • This process is well suited for large data volumes
  • This process combines many of the features of KDD Process and CRISP_DM as well as providing a number of new practices.

2.3.2 Problems

  • This is not suitable for small data volumes
  • This process is very lengthy in terms of the number of phases.

2.4 Team Data Science Process

The Team Data Science Process (TDSP) is a technique used in data science in order to conduct predictive analytics. The solutions of the TDSP make use of artificial intelligence and machine learning. There are five stages in TDSP and this includes: business understanding, data acquisition and understanding, modeling, deployment, and customer acceptance.

Business Understanding: In this phase the main goal of the project or the problem at hand is understood and using the knowledge gained, come up with a preliminary approach to the data mining problem. It is during this stage the objectives are set, the project plan is produced, and the business success criterias are determined. In order to find which cleaning procedures are required data visualization is needed.

Data Acquisition and Understanding: It is in this stage the data collection begins by making use of analytic operations and transferring the data into a target destination. Then the raw data is cleaned and all the incomplete/incorrect data are accounted for.

Modelling: The modeling techniques as well as the strategies are decided upon in this stage. The parameters as well as the prerequisites are also identified and considered.

Deployment: Predictive model and data pipeline are needed to be produced in this step. It could be a real-time or a batch analysis model depending on the required application. The final data product should be accredited by the customer.

Customer Acceptance: The final phase is customer acceptance which should be performed by confirming data pipeline, predictive model and product deployment. ”[1].

2.4.1 Benefits

  • The TDSP is based on best practices and successful structure from Microsoft.
  • The TDSP works very well with both exploratory and ad-hoc projects
  • The TDSP is compatible with projects which have already employed CRISP-DM and KDD.
  • The TDSP is customizable based on the project’s size.

2.4.2 Problems

  • There are five stages in TDSP but their sequence order is loosely kept.

Section 3: Analysis

All of the processes described in this paper share four common iterative phases. The four common iterative phases include: problem definition/formulation, data gathering, data modeling, and data production. When comparing the KDD process with the CRISP-DM process, it is to be noted that the KDD process does not include the business understanding and the deployment phases. These are two significant phases crucial in understanding the objectives of the solutions needed as well as for modeling the code into a system or application in order to build a data product. When comparing the CRISP-DM with the FMDS, the CRISP-DM does not follow an analytic approach nor contain a feedback phase. The analytical phase is where the statistical or machine learning techniques are recognized before the data is gathered. When comparing the FMDS and TDSP they are similar in nature but the FMDS contains more detailed steps. The feedback phase of the FMDS process is where the requirements needed for improving the data product are created. The detailed steps of the FDMS process could be more versatile thus more suited for a wider range of projects whereas the TDSP makes use of very specific Microsoft tools and other features in order to come forth with machine learning or Al models. “It is significant to consider quality during model simplification by ensuring that decision elements such as missing data reduction, synthetic features generation and unseen data holdout are properly managed. The evaluation, deployment and feedback cycle in the FMDS could bring this need better than simple quality insurance in the evaluation phase of CRISP-DM. Data science lifecycle is very well defined in the FMDS and connections are clearly determined between every stage but TDSP’s stages are all linked together (except customer acceptance) and it is possible to move into any stage from anyone else. ”[1]

Section 4: Summary and Directions

As with the immense growth of data being generated by the day, there is also an increased risk in security. As will all cybersecurity data science projects there are four main phases. The first phase is to formulate the security problem definition. This is because without knowing the problem a solution cannot be formulated. The second the required information is compiled based on the problem defined in phase one. “The collected information should be employed in the third step and in an analysis process to provide adequate data which is expected to predict or provide a resolution for the defined problem. The final step is a production step which deploys relevant modules and a system to run the whole process automatically and regularly when it is needed”[1]. There is not a clear-cut answer to which process is the best at all times. This is because the appropriateness of each process will depend on the volume of the problem at hand as well as the overall nature of the project.

Section 5: References

[1] Foroughi, F., & Luksch, P. (2018). Data Science Methodology for Cybersecurity Projects.arXiv preprint arXiv:1803.04219.

[2] Rawat, Danda B., Ronald Doku, and Moses Garuba. “Cybersecurity in big data era: From securing big data to data-driven security.” IEEE Transactions on Services Computing (2019).

[3] Sharma, A. (2019, September 11). Applying Data Science to Cybersecurity Network Attacks & Events. Retrieved August 29, 2020, from https://towardsdatascience.com/applying-data-science-to-cybersecurity-network-attacks-events-219fb6312f54

[4] “Crisp DM Methodology.” Smart Vision Europe, 17 June 2020, www.sv-europe.com/crisp-dm-methodology/.

--

--

Teena Thomas
0 Followers

Master of Science in Software Engineering