DS6: How should I acquire and manage data?
Short version
Connect to your organisation/company data cataloguing and integration initiatives.
Promote continuous data integration and data FAIR-ification principles.
Make sure that the Data Scientist perspective is taken into account during metadata standardisation processes. On each step of data acquisition, evaluate data for "fit for purpose".
Use the best practices in exploratory data analysis; data preprocessing: data cleaning, normalising, scaling; feature engineering.
Promote continuous data preprocessing/feature engineering practices and data versioning.
Use the “feature store” concept that allows to re-use already pre-processed/cleansed data, promotes collaboration, and removes silos (TO-DO: ADD DETAIL BELOW)
Use the “data passport” concept as an extension to data provenance (TO-DO: ADD DETAIL BELOW)
Long version
Introduction
Data science is an umbrella term that encompasses data management, data analytics, data mining, machine learning, MLOps (machine learning operations) and several other related disciplines.
The main business goal of data science is to ensure the increase of a company's ability to compete by leveraging data and creating added value out of it. Continuous integration and deployment practices are the way to achieve this goal as effectively as possible.
There are three significant steps in the added value creation process:
Data acquisition and management - the process of data collection and unification;
Data analysis, predictive analysis, insights generation – the process of exploratory data analysis and pre-processing followed by predictive analysis;
Model and data operations – the processes of deploying prediction models in production, making them available for the end-users.
At present, a very rough estimate of the amount of time and effort that data scientists spend on each step is approximately 80% for data acquisition and management, 15% for actual data/predictive analysis and insights generation, and finally, on the basis of need, 5% for the data/model production activities.
This estimate assumes that each step is implemented independently; data scientists are not included in the decision-making processes as the stakeholders.
Following our recommendations, it is possible to improve efficiency and achieve a time and effort distribution of 30%, 59%, and 11% accordingly, when most of the resources are spent on data/predictive analytics and insights generation.
The fundamental strategy here is "continuous data integration and management, continuous data/model operations integration and deployment". Basically, continuous processes throughout all three steps of added value creation.
Data acquisition and management
We can define the technical goal of data acquisition and management as the insurance that data is continuously integrated and standardised, and data is FAIR – findable, accessible, interoperable, and reusable. Continuity of the process is an important point here.
A company can achieve the above defined technical goal through the application of:
best data modelling practices for data storage,
Figure 1 Data Modelling Process (from https://www.talend.com/resources/data-model-design-best-practices-part-1 )
data curation (https://www.aidataanalytics.network/data-science-ai/articles/5-things-to-know-about-data-curation),
industry standards for metadata collection (https://guides.lib.unc.edu/metadata/best-practices ),
controlled vocabularies synchronised with ontologies (https://www.scibite.com/news/how-ontologies-are-unlocking-the-full-potential-of-biomedical-data/ ),
established ETL processes, where ETL is a process that extracts, transforms, and loads data from multiple sources to a data warehouse or other unified data repository (https://www.stitchdata.com/etldatabase/etl-process/ ),
semantic enrichment (https://doi.org/10.1145/3462462.3468881).
Other methodologies to consider include data quality assessment and master data management (MDM) when business and information technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the data.
Figure 2 Data Integration
Let us assume that the company has an established continuous data integration process. As a result, the data science team is not spending time and effort on manual data integration and efficiently focusing on data/predictive analytics and insights generation. The first task to perform there is data pre-processing. It starts after data requirement gathering and data selection/generation processes.
Here is the typical data pre-processing workflow:
exploratory data analysis (EDA) when we are looking for outliers, data distribution, missing values, balanced and unbalanced categories;
data cleaning, when we are dealing with the problems we identified during EDA;
data transformation - any type of raw data modifications to achieve needed quality, e.g., data normalisation and data scaling.
Nowadays, some of the pre-processing techniques are grouped under the title "feature engineering", particularly when raw data is modified to produce new values (data transformation, data aggregation, etc.).
Here is the list of the feature engineering techniques and best practices:
Embedding techniques – mathematical representation of the information e.g., textual information (https://www.kdnuggets.com/2021/11/guide-word-embedding-techniques-nlp.html ) or graph information (https://doi.org/10.1137/20M1386062; https://medium.com/@st3llasia/graph-embedding-techniques-7d5386c88c5 );
Encoding methods – conversion of categorical variables to numbers, in other words, embedding techniques for categorical data (https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding );
Grouping operations or aggregation techniques;
Feature split techniques.
References
Alice Zheng, A. C. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O'Reilly Media, Inc.