Select Page

1.25 Filtering, Sorting, Renaming, and Removing


Next, we will take a look at four more fundamental data wrangling operations:  filtering, sorting, renaming, and column removal.  

When we originally import Lobster Land’s 2021 data, we are working with all of the original observations.  What if, however, management wishes to see some detailed analysis involving only the days for which there were more than 4000 unique visitors?  We can perform a filtering operation in the manner shown below.

Note that we saved this new dataframe with a different name, lobster21_busyday.  

What if, after seeing the detailed look at the days with more than 4000 unique visitors, management asks for yet another condition – can we just zero in on the Saturdays on which more than 4000 unique visitors arrived?  We can do this using the procedure shown below.  Note that here, each condition is included in a separate parenthetical clause.

Next, what if management asks you to modify this yet again – this time, they want to see all of the Fridays, Saturdays, and Sundays on which the park exceeded the 4000 unique visitor threshold.  One handy way to achieve this is by using the isin() function as one of our conditions, and passing a pre-defined list to that method, as shown below.

We can arrange the rows in a dataset based on the values in some particular column, by using the sort_values() function, as shown below.

To instead view the rows from top to bottom, we need to include the ascending=False parameter inside of the sort_values() function.

Column renaming is a very useful part of your data preparation toolkit.  Oftentimes, you will encounter column names that are vague, wordy, or inconsistent with the standards of your organization.

Next, we will take a quick look at yet another very common data preparation operation – column removal.  Whenever you have columns in a dataset that you know you will not need, whether because their values are unnecessary, redundant, or laden with too many NaNs to be useful, you may wish to just remove them entirely.

Cleaning up the dataset by removing such variables can make your subsequent analysis much easier and more streamlined.

In summary, the EDA process yields important insights with implications for our analysis. This table gives you an idea about how those findings can be utilized, but is not by any means, an exhaustive list:

EDA techniqueFindingWhat does it mean?
Reviewing the dataset description (if available), and examining the variables’ typesA categorical variable in the dataset is represented by numerical values, and is currently seen as numeric by PythonThis variable should be converted to a categorical type so that it can be used in subsequent models and/or visualizations
 A dataset description is not readily available, and some variable’s meaning is unclearIf possible, try to search online for the original dataset source and/or contact the original dataset author.  
Checking for missingness, impossible values, and inconsistent formatsThere are no missing values, impossible values, or inconsistent formatsFurther analysis can proceed without any need to modify the dataset to address these issues
 There are missing values, impossible values, and/or inconsistent formatsThe modeler will need to make a judgment call to determine the best way to handle missing values and/or irregularities.   

Some data modeling techniques (e.g. regression) require ‘complete cases’ for rows, which could greatly reduce your available data
Mean and medianMedian and mean are closeThe empirical rule can be applied in subsequent analysis since data is likely to be normally distributed
 Median and mean are far apartCheck for the presence of outliers and examine them before deciding your next course of action. Boxplots will also help identify outliers. 

Outliers will need to be considered when building a linear or logistic regression model, or when constructing distance-based machine learning models such as K-nearest neighbors, k-means clustering, hierarchical clustering.
ModeOne number or character is more common than the othersCan be used to impute missing values
SkewnessNormal distributionThe empirical rule can be applied in subsequent analysis
 Left or right skewedIf your goal is to use the variable as a predictor in a linear regression model, applying a variable transformation may improve model accuracy (this will be covered in Chapter 16)
CorrelationThere is a very strong linear relationship between two variablesInforms an analysis; helps identify the risk of multi-collinearity among independent variables; impacts linear and logistic regression models
 There is a perfect (1.0 ) correlation between two variablesThis merits further investigation. It may be the case that two variables are providing redundant information (e.g. a monetary variable coded in one currency, and then again in another, or a distance provided in both miles and kilometers)