Data Preprocessing Explained
A machine learning algorithm doesn’t differentiate between preprocessed data and raw data. When you feed a dataset to a machine learning algorithm, it will produce an output even if the output is wrong. For sufficiently preprocessed input data, the machine learning algorithm produces correct outputs. For rubbish data, the algorithm produces rubbish output. This is called the concept of garbage in and garbage out (GIGO). It basically means that the quality of output is always dependent on the quality of input. In this article, we will discuss different data preprocessing techniques that help us create good quality input data for machine learning applications.
What is Data Preprocessing?
Data is the new oil. It is fueling the growth of companies as big as Google and as small as a startup in its ideation stage. However, Just like you cannot use the crude oil as a fuel in your car, you cannot use the raw data available in the system logs to make machine learning applications and data analytics tools. You need to transform the data into a suitable format that has the right set of attributes with minimum missing values and outliers.
Data preprocessing is the way through which we get the data in the desired format. It involves several steps like data cleaning, data transformation, and data reduction. We will discuss all these steps in the following sections.
Importance of Data Preprocessing
The raw data that is generated from various sources exists in various formats. It may be in numeric form, text, image, video, or even a combination of these formats. The data may also contain outliers, wrong values, as well as missing values. If the data is in text format, it might contain unwanted characters, URLs, and hashtags that are unnecessary. Similarly, an image may contain unwanted pixels and might be blurred.
Feeding a dataset with these impurities to any machine learning algorithm will not yield good results. If you feed a good dataset with defined attributes to a machine learning application, you will get good accurate results. Therefore, data preprocessing has significant importance in the process of creating machine learning applications and business analytics tools.
Data Preprocessing Techniques
Data preprocessing involves many steps in which we use different data preprocessing techniques to make the data suitable for use. Let us discuss each of these steps one by one.
Data Quality Assessment
Data quality assessment is the first step in the process of data preprocessing. It involves different data preprocessing techniques such as identifying mismatched data types, identifying mixed data values for a single attribute, identifying missing values and identifying the outliers.
Identifying Mismatched Data Types
Different data sources can collect data in different ways. For example, some sources of data can store numbers in integer format. Other sources might use strings to store the numbers in the dataset. In this case, we need to identify whether each attribute in the dataset has the same data type or not. Otherwise, the machine learning algorithms will not be able to process the data.
Identifying Mixed Data Values for a Single Attribute
It is also possible that different data sources use different standards to store the data. For instance, one data source can store the height of a person in centimeters. On the other hand, a data source can choose to store the height in inches. While data quality assessment, you need to identify the mixed data values for a single attribute if any.
Identifying Missing Values
The entries in the dataset do not always have values for all the attributes. You also need to analyze the entries with missing values in the data quality assessment.
Identifying The Outliers
Outliers in a dataset are entries that naturally seem to be different from the other entries in the dataset. While analyzing data, outliers can affect the statistics as well as the machine learning models a lot. Due to this, weaned to remove the outliers from the dataset. So, you also need to identify if there are outliers in the dataset while doing a data quality assessment.
In data preprocessing, after data quality assessment and identifying the impurities in the dataset, we perform data cleaning. Let us discuss the data preprocessing techniques involved in data cleaning.
Handling Missing Data
In a data set, the entries can contain missing data due to human errors or negligence. You can handle missing data using the following data preprocessing techniques.
Ignore The Data Entries
If a significant number of attributes in an entry in the dataset are missing, you may choose to discard the entries with missing values. You can use this approach when the number of entries removed from the dataset is very small compared to the entire dataset.
If the number of entries that have missing values is significant, you need to manually fill the values for the missing attributes. For numeric values, you can use a mean, median, or mode of the attribute to fill the missing values. For categorical attributes, you can devise a strategy based on the frequency of the values. Manual Entry is an appropriate data preprocessing technique for handling missing values in small datasets.
After handling missing values, we need to handle the noise in the data.
Cleaning Noisy Data
Noise in the data consists of values that can affect the data processing process in a negative way. Examples of noise are outliers, unstructured text, and absurd values. Noise in a dataset unnecessarily increases the required storage. It also increases the computational power required for data processing while adversely affecting the results.
As the noise in a dataset can force the machine learning algorithms to produce inaccurate results, we need to clean the noise in the data. For this, we use the following data preprocessing techniques.
Binning, as the name suggests, is used for segregating data into smaller groups(bins). We normally group the data entries based on their similarity. In the process of binning, we first sort the entire dataset based on one or more attributes. After that, the data is divided into groups of similar size. Once the bins are created, we can use different data preprocessing techniques to remove noise from the data.
Clustering is an unsupervised process. It groups the data into different clusters based on their attributes. After clustering, each entry in a group is somewhat similar to the other entries. On the other hand, entries from two different clusters might differ a lot.
Once the clusters are created, outliers are easily discarded from the dataset as they are not included in the clusters having the majority of data entries.
Regression is a data preprocessing technique that is used to identify the data attributes in the dataset that are important for analysis. In regression, we identify the attributes that have no correlation with the target variable. Then, we discard those attributes. It helps us to remove the redundant dataset from analysis, which helps reduce the required computational power and time.
Cleaning Text Data
Raw text data contains various components like URLs, short-hands, emojis, and hashtags. This makes raw text data unsuitable for analysis.
If you are going to work with text data, you need to use various data preprocessing techniques to make the text data suitable for analysis.
Removing Unwanted Characters
First, we need to remove the hashtags, punctuation marks, emojis, and other non-alphanumeric characters that might not be useful for analysis.
While removing unwanted characters, we also need to translate the text if the text is present in another language. Similarly, we also need to convert the entire text data into either uppercase or lowercase characters. This is done due to the reason that statistical tools treat uppercase and lowercase characters as different.
After removing unwanted characters, translating the data into one language, and making the case of the letters uniform, we perform tokenization. In tokenization, each sentence in the text data is converted into a list of words. For example, “This is a sentence” is converted into [“this”, “is”, “a”, “sentence”].
Removing Stop Words
Stop words are those words that do not have semantic meaning in the sentence. For example, words like “the” and “an” are called stop words. While data preprocessing, all the stop words are removed from the data.
After removing stop words from the text data, we perform lemmatization. In lemmatization, each word is converted into its root form. For example, “writing” is converted into “write”.
While cleaning the text data, it is not necessary that all the above-discussed data preprocessing techniques are used. The preprocessing techniques vary depending on the analysis techniques. After data cleaning, we transform the data into a suitable format for analysis.
With data transformation, we tune the data into suitable formats using various data preprocessing techniques.
In aggregation, as the name suggests, we aggregate the data from various cleaned datasets.
Normalization is used to fit the attributes of the dataset into a finite range, generally between -1 to 1. Normalization makes the data independent of the range of attributes. If the data is not normalized, inaccuracies may arise in the data.
After normalization, we select the features that are most important to our analysis. In feature selection, we also drop some of the features if they are highly correlated with one another.
Sometimes, we also divide the data into discrete intervals. This process is called discretization. It is almost similar to binning. However, it is performed after data cleaning for better analysis and not for noise removal as in the case of binning.
The cost of data analysis is directly proportional to the amount of data being analyzed. Therefore, if we can derive the same results using a lesser amount of data, we prefer to do it. To reduce the amount of data, we use the following data preprocessing techniques.
Attribute selection is used to select the suitable attributes for analysis. While attribute selection, we perform correlation analysis. After analysis, we often discard one or more attributes from the highly correlated attributes. This doesn’t affect the analysis as we can derive results for one attribute using another highly correlated attribute.
In this step, data preprocessing techniques such as principal component analysis and wavelet transform are used to reduce the size of the data.
In this article, we have discussed various data preprocessing techniques. Data preprocessing is an important step in building machine learning applications. Therefore, you need to analyze the raw data and process it in order to build machine learning applications that provide accurate results.
I hope you enjoyed reading this article. If you are looking to process raw data for analysis, you can read this article on machine learning tools.
Stay tuned for more informative articles.