Select Page

1.10 Original Variables and Derived Variables


Another way to sort our variables is to distinguish original variables (ones that are based on some direct count, or measurement) from derived variables (ones that are generated by combining information from directly-measured variables).  

A derived variable allows us to express a large amount of information in a single value – this can be effective for enabling direct comparisons, and for data visualization. Also known as feature engineering, a derived variable is  good for machine learning models because a simpler model is quicker to run, easier to explain, and a more efficient way of presenting data. After all, if you can communicate the status of someone’s performance, their health, or their potential, with a single number, this will be far easier to convey than would be the case if you needed four or five separate metrics to achieve this.  

Gross Domestic Product (GDP) is a good example of a well-known derived variable.  If we have macroeconomic data about a country, it might include Consumption, Government Spending, Investment, and Net Exports, each listed as separate columns.  By summing those individual components into GDP, we can compare the size of countries’ economies with a single statistic. Now, using a simple, two-dimensional scatterplot, we can show GDP vs. some other numeric variable, such as life expectancy, inflation, military spending, or average education level.  

Another well-known example of a derived variable is Body Mass Index, or BMI.  BMI is derived from two original variables (mass in kg, and height in meters), using the formula shown below.  BMI is far from perfect (its simple formula implies that bodybuilders and many professional athletes are obese), but its usage in medicine persists, as it may help to flag everyday individuals as being at risk from obesity-related health conditions.

Finally, derived statistics play an important role in sports analytics.  Fans of American football are likely to be familiar with the “Quarterback Rating” or “Passer Rating” statistic.  Derived from five original statistics (Completions, Attempts, Yards, Touchdowns, and Interceptions), it enables commentators, sportswriters, and fans to describe a player’s performance as a single number.  The Passer Rating formula is shown below:

The pandas library makes it fairly easy for us to generate a derived variable from an existing dataframe.  As demonstrated in the example below, we can simply name the new variable on the left-hand side of an assignment statement, and then define the operation to be performed on existing variables on the right-hand side.

After we perform this step, the new variable will move to the right-most column in the dataframe:

With this newly-derived variable in place, Lobster Land management can quickly determine which of the days during that season scored the highest, or the lowest, for this metric.  After setting the index to the ‘Date’ variable, we will first see the 10 dates with the highest ‘Staffing Efficiency’ scores, with the help of pandas’ nlargest() function:

Alternatively, we can use the nsmallest() function to view the 10 days with the lowest ratios of unique visitors to staff hours, as shown below:

In and of themselves, these numbers are hard to interpret without more context.  We do not know, for instance, about Lobster Land’s flexibility with daily staffing decisions, or about particular events that might have led to staffing shortages on particular days.   However, the values shown above could be the basis for further analysis, should Lobster Land management wish to deeply explore its options with regards to optimal staff allocation.