In data science, we work with data to produce insights that can help businesses solve problems. In real-world applications, most of the data is produced in a categorical format. Most of the attributes like gender, day of the week, names, etc can only be presented in textual or categorical format. On the contrary, most of the machine learning algorithms or statistical methods work only with numerical data. To process categorical data, we need to find a way to convert them into a numerical format. In this article, we will discuss different types of categorical data with examples and how we can convert them into numerical format.
What is Categorical Data?
Categorical data contains data points that represent distinct categories or groups and cannot be ordered or measured on a numeric scale. Based on their nature, we divide categorical data into the following types.
- Nominal data
- Ordinal data
- Binary or Dichotomous Data
In the following sections, we will discuss these categorical data types with examples. However, let us first discuss the features of categorical data.
Features of Categorical Data
A dataset containing categorical data has only strings and labels. Due to this, categorical data shows the following properties.
- Categorical data often represents qualitative attributes. Examples may include gender, education, customer satisfaction level, proficiency level, etc.
- We often need to convert categorical data into a numerical format using different encoding methods. However, we can still analyze categorical data directly for the probability of occurrence, frequency, etc.
- We can also visualize categorical data using bar charts and pie charts. We use a bar chart to analyze the frequency of values in categorical data. On the other hand, we can use a pie chart to analyze the probability or percentage of a categorical value in data.
- We can also represent categorical data using numeric values. However, they impart no real meaning to the data and work only as a label. We cannot perform arithmetic operations on such data. In the case of ordinal data, values can represent the level of the data point.
- Categorical data must have discrete and finite values. Also, each data point should contain only one value for a single attribute. It will be very hard to analyze the data if it contains an infinite number of values or if a data point contains two or more categorical values for a single attribute.
- Categorical data does not have a consistent unit of measurement or a fixed scale. The differences between categories are qualitative rather than quantitative. For example, the difference between “male” and “female” in a gender variable is not measurable in a numeric sense.
Different Types of Categorical Data
As discussed above, we can divide categorical data broadly into three categories. Let us discuss each of these one by one.
Nominal data is used to represent names. We use nominal data to represent data containing brand names, colors, places, etc. The nominal values in a dataset have no particular order.
As the name suggests, ordinal data represent categorical data that has some inherent order. Examples of ordinal data include the level of education, product ratings, customer satisfaction, etc. We can represent ordinal data in the numeric format using the Likert scale.
Binary or Dichotomous Data
Binary or Dichotomous data includes data that can contain only two mutually exclusive values. Examples of binary data include values represented using Pass/Fail, Yes/NO, True/False, etc.
Examples of Categorical Data
In real-world interactions, we use categorical data to represent data in various activities as shown in the following examples.
Brand names are represented using nominal data. For example, you can represent the brand names of mobile phones as shown in the following data.
|Sl. No.||Brand Name|
Level of Education
We can represent the level of education using ordinal data as they have an inherent order. The following table contains different levels of education with their designated level from lowest to highest.
|Sl. No.||Level of Education|
In the above table, the level of education from Primary to Doctorate can be represented in an order. Hence, it is an example of ordinal categorical data.
In surveys, we often use interval scales to represent age, weight, marks, etc. The data represented using interval scales can be classified as ordinal data. For example, the following table contains different values for age in the interval scale.
|1||0 to 18 Years|
|2||19 to 25 years|
|3||26 to 40 years|
|4||41 years and above|
As you can observe, the above table classifies age into four categories using different intervals. Hence, we can say that these categories are examples of ordinal categorical data.
How to Convert Categorical Data to a Numeric Format for Analysis?
We cannot perform statistical analysis directly on the categorical data. Therefore, we need to convert the categorical data into numeric format. For this, we use different encoding techniques such as Label encoding, one-hot encoding, integer mapping, entity encoding, binary encoding, etc. All these methods to convert categorical data into numeric format have been discussed in this article on Encoding categorical data in Python.
In this article, we discussed categorical data, its types, and examples. To learn more about data mining and machine learning concepts, you can read this article on how to implement the apriori algorithm in Python. You might also like this article on data cleaning.
I hope you enjoyed reading this article. Stay tuned for more informative articles.