“Data science is the transformation of data using mathematics and statistics into valuable insights, decisions, and products”
As data science evolves and gains new “instruments” over time, the core business goal remains focused on finding useful patterns and yielding valuable insights from data. Today, data science is employed across a broad range of industries and aids in various analytical problems. For example, in marketing, exploring customer age, gender, location, and behavior allows for making highly targeted campaigns, evaluating how much customers are prone to make a purchase or leave. In banking, finding outlying client actions aids in detecting fraud. In healthcare, analyzing patients’ medical records can show the probability of having diseases, etc.
The data science landscape encompasses multiple interconnected fields that leverage different techniques and tools.
There’s a difference between data mining and very popular machine learning. Still, machine learning is about creating algorithms to extract valuable insights, it’s heavily focused on continuous use in dynamically changing environments and emphasizes adjustments, retraining, and updating of algorithms based on previous experiences. The goal of machine learning is to constantly adapt to new data and discover new patterns or rules in it. Sometimes it can be realized without human guidance and explicit reprogramming.
Machine learning is the most dynamically developing field of data science today due to a number of recent theoretical and technological breakthroughs. They led to natural language processing, image recognition, or even the generation of new images, music, and texts by machines. Machine learning remains the main “instrument” of building artificial intelligence.
Machine Learning Workflow
Generally, the workflow follows these simple steps:
Collect data. Use your digital infrastructure and other sources to gather as many useful records as possible and unite them into a dataset.
Prepare data. Prepare your data to be processed in the best possible way. Data preprocessing and cleaning procedures can be quite sophisticated, but usually, they aim at filling the missing values and correcting other flaws in data, like different representations of the same values in a column (e.g. December 14, 2016 and 12.14.2016 won’t be treated the same by the algorithm).
Split data. Separate subsets of data to train a model and further evaluate how it performs against new data.
Train a model. Use a subset of historic data to let the algorithm recognize the patterns in it.
Test and validate a model. Evaluate the performance of a model using testing and validation subsets of historic data and understand how accurate the prediction is.
Deploy a model. Embed the tested model into your decision-making framework as a part of an analytics solution or let users leverage its capabilities (e.g. better target your product recommendations).
Iterate. Collect new data after using the model to incrementally improve it.