The world’s most valuable resource is no longer oil, but data, with 2.5 quintillion bytes of data generated each day. And due to cheaper storage nowadays, thanks to cloud, this massive data is rather being keep than deleted for future usage. Airline industry is not excluded from this situation especially data coming from its historical booking and passengers personal information which would help the industry to widen its margin using machine learning methods.
Steps on how to choose quality over quantity:
Being able to explore such big data, what are the main steps to ensure that we are not going anywhere off-track when diving deep into the sea of data, before jumping to the data modelling straight away. Firstly, we need to outline the expected product outcome from data science perspectives. In the context of Airline industry, it varies from marketing to facial recognition. For the purpose of Travlytix, we are targeting to produce a Marketing Automation product as a start, for example, segmentation of customers to assist the marketing team producing more targeted campaigns in the future.
Secondly, identifying which tables in the database provide the information that is aligned with the expected outcome. This is crucial because there will be large number of tables available and maybe only few that are really useful for our initial model(s). As a start, we can choose the most basic table, table that provide the demographic information of the passengers. We can also discuss further with the marketing team to get their validation and understand their campaign purposes and targets. With that, we are ready to pick several data points from the chosen table.
The next step would be to know your data size. This may not be an obvious step to introduce but very important to take note of it. Before we can start with the data processing, we need to have a rough estimation of how big is the future data can grow. Because, this will determine the technology we should use when we want to process our data. We do not want to apply an incompatible technology as opposed to the data size. For example, if we are using a real-life data, we can leverage on hadoop technology as this will help fasten our data processing. Other common platform include Python, Spark etc.
Usually, the exact data points that we want is not readily available from the dataset tables, hence we need to derived it. This is the fourth step – massage the raw data to derived new useful field. This step could also includes the data cleansing process like changing a particular data point to a desired format. In the process of deriving the new data, we probably need to get the data from other tables – hence it is important to identify the unique ID for each table so it can be linked between tables easily. For example, we want to know the type of traveler, whether they are solo, business, family etc., finding how many number of passengers per booking we can quickly categorised them into its type.
As a conclusion, before we can move to data modelling, we need to know the purpose of the outcome for each data science project. By applying this step to the our huge datasets – it will save our time understanding what is available and what is needed. Eventually, we can move to data modelling stage smoothly.
Atiqah is Data Scientist at GoQuo. She has extensive experience in analyzing data from various industries including cyber-security, oil & gas and investments. Prior joining GoQuo, she has worked with Dell and one of the largest Pension Fund in Asia, Employees’ Provident Fund (EPF). She holds a master degree in Financial Mathematics from Cass Business School and recently earned Data Science Associate Certificate from EMC.