Getting started with artificial intelligence (AI): Part 3 - Transforming dataframes
Nathaniel Tjandra
Growth
TLDR
Terraforming a planet requires large scale projects to inhabit other planets for survival. We’ll begin by terraforming datasets to calculate the cost of survival on the Titanic.
Introduction
Before we begin
Functional programming
Applying Function
Aggregating Data
Transforming Data
Data Analysis
Conclusion
From the
SHAParticle, we know that people in some groups were more likely to survive when the Titanic crashed. But what does it cost to survive the titanic?
Titanic meets Iceberg (Source: Britannica)
In “Product developers’ guide to getting started with AI — Part 3: Terraforming dataframes”, we’ll look at the price point of a “golden ticket” that ensures the best chance of survival. Based on the SHAP values calculated there is a direct correlation between the sex, passenger class, fare, and age.
Mage Analyzer Page (Source: SHAP)
Manipulating datasets are a quick and easy way to rearrange data and extract everything. In this series we’ve gone over how to pick and search through data so it’s time to look at transforming the underlying data.
It is highly advised to have read
part 2before continuing forward.
In this guide, we’ll be using the
Titanic datasetalong with
Google Collab.
I’ll be briefly reusing techniques from previous contents such as surfing and extracting to quickly start us off with an ideal dataframe for applying transformations and functions.
Part 2: Surfing through dataframes
Python is a functional programming language, which means that all operations can be expressed as a function. This is important as later on in this guide we’ll be looking at creating functions and passing lambda expressions to
apply
and
transform
. For those that are comfortable enough with Python, you may skip this section. Otherwise, keep reading for a quick refresher on the syntax for defining functions and lambda expressions.
In Python, a function is created by the “def” keyword and takes in a number of arguments.
Basic Adder that adds 1 to the value
Rewrite the adder function as a lambda expression to shorthand.
Lambda expression of the adder
For a small operation, like the adder above, it’s best practice to use a lambda expression. But, for more complex calculations that are used multiple times use a function. When in doubt check if there is a simpler way or how much repeating will occur.
The simplest form of manipulating a dataframe is by using
apply
. Apply takes in a function and repeats it for either all columns or rows within a dataframe. The applications of this are for quickly calculating or encrypting data.
Based on the SHAP values, we form a hypothesis that women and children are more likely to survive, possibly due to the fact that they can board first and when living in upper class areas of the ship there is less population density allowing them to quickly escape in comparison to the lower class.
Lifeboats on the Titanic (Source: DailyMail)
To find the average price point of the winning ticket: ticket for a young lady in 1st class, we first need to filter down our rows and columns. In the dataframe, “Pclass” represents whether a passenger is located in the 1st class, 2nd class, or 3rd class area of the Titanic. The average is calculated as the
sum
of the prices divided by the total number or
count
of items, but may also be calculated by the
mean
method.
Using what we’ve learned in
part 2, we filter the rows down to only contain items from the sex, passenger class, and age columns. We define our filter as
Having the sex of a female
Passenger class of only 1st class
Age must be no lower than 40 years old
Then reduce it to only show the relevant information: ‘Fare’ or price of golden ticket.
Then, we take the sum of the ‘Fare’ column and divide by the total number of items.
The total price of all golden tickets are $6484.80
Average price of $113.77
Unlike
part 2, where we overwrite the values, instead store the data inside a new variable called average_price to hold the results of the calculations. This lets us preserve the old data.
We can confirm this is the same when calculating the
mean
of the prices.
The mean matches the average price of $113.77
Pandas has multiple other built-in mathematical functions, such as
median
and more.
Median is $86.50
Unfortunately, all of this must be done separately, which makes
apply
good for short functions, but what about longer functions? That’s where
aggregate
or
agg
shines in removing repeatability.
If you know which aggregate you want to apply ahead of time, use agg instead. When doing multiple calculations of summation, mean, or standard deviation,
aggregate
is a neater way to calculate than using apply.
For instance, if we were to use
aggregate
instead, we could grab multiple types all at once. For our next section, we’ll need the standard deviation so let’s calculate that as well. Note: The shorthand is
agg
, which is functionally equivalent to
aggregate
.
1 liner for sum, mean, max, and median
Another way of manipulating a dataframe is by using
transform
. This is similar to
apply
, except that it applies the function to itself and repeats it for all columns within a dataframe. Since it can be applied to itself, the applications are more extended and can complete multiple operations by passing values back to itself.
Because transform applies it to itself, the result must be the same length of the original input. This means that functions such as sum(), mean(), and max/min() don’t work as they condense or aggregate all the data into 1 value.
Calculate individual percentages
Back to the original problem, find out what percentage of passengers have a “golden ticket”. Using transform, we can combine aggregation using a series to calculate the individual values. This makes transform more useful at looking at the finer details.
Calculate individual percentages
Likewise, summing the individual results should result in 1.0 (100%)
Sanity Check
To find out how many passengers paid top dollar, first we take the original dataset and calculate the percentages. We leverage transform’s ability to maintain length, along with groupby to sort our data.
What slice of the “pie” do the golden ticket passengers make out?
23% of all income on the ship is from golden ticket sales.
What percentage of passengers own a golden ticket?
Only 6% of all passengers purchased a golden ticket.
Key Differences
Transform returns based on self, the equal length must be satisfied. Therefore, transform can’t handle aggregate methods (sum, mean, std deviation, etc…)
Apply doesn’t take in multiple aggregations (one column at a time), while agg can.
Transform is best used to create a new entry into a table to see fine detail.
Aggregate and apply are useful at calculating a single summary value.
That’s it now, you’re ready to tackle future problems in data science. Using your newfound knowledge I suggest modifying the steps to calculate what percentage of golden ticket holders survive, as your next step in familiarizing yourself with these core AI concepts. As always, stay tuned for future guides where we’ll go over more topics ranging from joining datasets to deploying a machine learning model to the Cloud.
I’ve got a Golden Ticket! (Source South Park)
Start building for free
No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.
No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.