import pandas as pd9 DataFrames
A DataFrame is like a spreadsheet, but more dynamic and versatile. The Pandas library contains many useful functions for working with DataFrames, and we include Pandas with:
Below are the results of the Rubik’s WCA World Championship 2025 which took place at the Seattle Convention Center on July 6, 2025. The sixteen finalists are contained within a list of sixteen lists below, including their country and their times from five rounds of solving. We use that list to create a Pandas DataFrame with column names:
rubiks_final = [
["Yiheng Wang (王艺衡)","China",5.45,3.80,4.31,3.94,4.43],
["Xuanyi Geng (耿暄一)","China",4.94,4.30,4.24,3.95,5.10],
["Tymon Kolasiński","Poland",4.61,4.89,5.31,4.74,5.80],
["Max Park","USA",6.07,5.21,5.90,6.05,5.81],
["Luke Garrett","USA",10.36,4.68,5.83,6.04,6.27],
["Luke Griesser","USA",6.47,5.67,5.94,5.73,6.70],
["Qixian Cao (曹岂娴)","China",7.00,5.03,5.72,4.41,5.55],
["Timofei Tarasenko","Russia",6.61,4.28,5.58,5.82,5.41],
["Teodor Zajder","Poland",5.57,5.85,6.47,5.09,5.07],
["Neo Cuares","Philippines",6.27,5.23,5.16,6.24,6.44],
["Dylan Miller","USA",5.22,5.46,6.47,7.57,5.61],
["Alexey Tsvetkov","Russia",7.17,6.34,5.15,8.05,6.27],
["Matty Hiroto Inaba","USA",5.11,5.34,5.72,5.35,8.37],
["Seung Hyuk Nahm (남승혁)","South Korea",5.65,6.58,5.80,7.89,6.60],
["Kai-Wen Wang (王楷文)","Chinese Taipei",6.84,6.17,8.67,9.66,8.20],
["Olaf Kuźmiński","Poland",6.62,7.19,5.88,4.76,5.74],
]
columns = ["Name","Country","Round1","Round2","Round3","Round4","Round5"]
df = pd.DataFrame(rubiks_final, columns=columns)
df| Name | Country | Round1 | Round2 | Round3 | Round4 | Round5 | |
|---|---|---|---|---|---|---|---|
| 0 | Yiheng Wang (王艺衡) | China | 5.45 | 3.80 | 4.31 | 3.94 | 4.43 |
| 1 | Xuanyi Geng (耿暄一) | China | 4.94 | 4.30 | 4.24 | 3.95 | 5.10 |
| 2 | Tymon Kolasiński | Poland | 4.61 | 4.89 | 5.31 | 4.74 | 5.80 |
| 3 | Max Park | USA | 6.07 | 5.21 | 5.90 | 6.05 | 5.81 |
| 4 | Luke Garrett | USA | 10.36 | 4.68 | 5.83 | 6.04 | 6.27 |
| 5 | Luke Griesser | USA | 6.47 | 5.67 | 5.94 | 5.73 | 6.70 |
| 6 | Qixian Cao (曹岂娴) | China | 7.00 | 5.03 | 5.72 | 4.41 | 5.55 |
| 7 | Timofei Tarasenko | Russia | 6.61 | 4.28 | 5.58 | 5.82 | 5.41 |
| 8 | Teodor Zajder | Poland | 5.57 | 5.85 | 6.47 | 5.09 | 5.07 |
| 9 | Neo Cuares | Philippines | 6.27 | 5.23 | 5.16 | 6.24 | 6.44 |
| 10 | Dylan Miller | USA | 5.22 | 5.46 | 6.47 | 7.57 | 5.61 |
| 11 | Alexey Tsvetkov | Russia | 7.17 | 6.34 | 5.15 | 8.05 | 6.27 |
| 12 | Matty Hiroto Inaba | USA | 5.11 | 5.34 | 5.72 | 5.35 | 8.37 |
| 13 | Seung Hyuk Nahm (남승혁) | South Korea | 5.65 | 6.58 | 5.80 | 7.89 | 6.60 |
| 14 | Kai-Wen Wang (王楷文) | Chinese Taipei | 6.84 | 6.17 | 8.67 | 9.66 | 8.20 |
| 15 | Olaf Kuźmiński | Poland | 6.62 | 7.19 | 5.88 | 4.76 | 5.74 |
df is short for DataFrame and is a common variable name. While this entire dataset is readable, it is tiny compared with those we will see later.
9.1 Familiarize with Data
One of the first steps when working with a new dataset is to get a sense of its size and structure. The following functions are often the first we will run on any new dataset. Run these on the our Rubik’s DataFrame to get a sense of the functionality of each.
| Pandas | Description | |
|---|---|---|
| Structure | df.shape |
Number of rows and columns |
df.columns |
Column labels | |
df.info() |
Summary of columns | |
| Inspecting | df.head() |
First 5 rows |
df.tail() |
Last 5 rows | |
df.sample(5) |
5 Random rows | |
| Missing Data | df.isnull().sum() |
Number of missing values |
df.isnull().mean() * 100 |
Percent of values missing | |
| Summaries | df.describe() |
|
df.describe(include='all') |
9.2 Slicing a DataFrame
There are several ways to select a subset of a DataFrame.
9.2.1 Selecting a Column
Given our Rubik’s DataFrame as df, we can select all of the countries by specifying the column label:
df['Country']0 China
1 China
2 Poland
3 USA
4 USA
5 USA
6 China
7 Russia
8 Poland
9 Philippines
10 USA
11 Russia
12 USA
13 South Korea
14 Chinese Taipei
15 Poland
Name: Country, dtype: object
We can select multiple columns by specifying a list of columns:
df[ ['Round1', 'Round2', 'Round3'] ]| Round1 | Round2 | Round3 | |
|---|---|---|---|
| 0 | 5.45 | 3.80 | 4.31 |
| 1 | 4.94 | 4.30 | 4.24 |
| 2 | 4.61 | 4.89 | 5.31 |
| 3 | 6.07 | 5.21 | 5.90 |
| 4 | 10.36 | 4.68 | 5.83 |
| 5 | 6.47 | 5.67 | 5.94 |
| 6 | 7.00 | 5.03 | 5.72 |
| 7 | 6.61 | 4.28 | 5.58 |
| 8 | 5.57 | 5.85 | 6.47 |
| 9 | 6.27 | 5.23 | 5.16 |
| 10 | 5.22 | 5.46 | 6.47 |
| 11 | 7.17 | 6.34 | 5.15 |
| 12 | 5.11 | 5.34 | 5.72 |
| 13 | 5.65 | 6.58 | 5.80 |
| 14 | 6.84 | 6.17 | 8.67 |
| 15 | 6.62 | 7.19 | 5.88 |
9.2.2 Selecting Rows and Columns with iloc
The .iloc[] indexer works by selecting rows and columns just like array slicing, by specifying values or ranges. We can select all of the rows and just the round columns with:
df.iloc[:, 2:7]| Round1 | Round2 | Round3 | Round4 | Round5 | |
|---|---|---|---|---|---|
| 0 | 5.45 | 3.80 | 4.31 | 3.94 | 4.43 |
| 1 | 4.94 | 4.30 | 4.24 | 3.95 | 5.10 |
| 2 | 4.61 | 4.89 | 5.31 | 4.74 | 5.80 |
| 3 | 6.07 | 5.21 | 5.90 | 6.05 | 5.81 |
| 4 | 10.36 | 4.68 | 5.83 | 6.04 | 6.27 |
| 5 | 6.47 | 5.67 | 5.94 | 5.73 | 6.70 |
| 6 | 7.00 | 5.03 | 5.72 | 4.41 | 5.55 |
| 7 | 6.61 | 4.28 | 5.58 | 5.82 | 5.41 |
| 8 | 5.57 | 5.85 | 6.47 | 5.09 | 5.07 |
| 9 | 6.27 | 5.23 | 5.16 | 6.24 | 6.44 |
| 10 | 5.22 | 5.46 | 6.47 | 7.57 | 5.61 |
| 11 | 7.17 | 6.34 | 5.15 | 8.05 | 6.27 |
| 12 | 5.11 | 5.34 | 5.72 | 5.35 | 8.37 |
| 13 | 5.65 | 6.58 | 5.80 | 7.89 | 6.60 |
| 14 | 6.84 | 6.17 | 8.67 | 9.66 | 8.20 |
| 15 | 6.62 | 7.19 | 5.88 | 4.76 | 5.74 |
9.2.3 Selecting Rows by Value
Sometimes we only want rows with certain values. We can achieve this with a logical expression. For example:
df[df['Country'] == 'USA']| Name | Country | Round1 | Round2 | Round3 | Round4 | Round5 | |
|---|---|---|---|---|---|---|---|
| 3 | Max Park | USA | 6.07 | 5.21 | 5.90 | 6.05 | 5.81 |
| 4 | Luke Garrett | USA | 10.36 | 4.68 | 5.83 | 6.04 | 6.27 |
| 5 | Luke Griesser | USA | 6.47 | 5.67 | 5.94 | 5.73 | 6.70 |
| 10 | Dylan Miller | USA | 5.22 | 5.46 | 6.47 | 7.57 | 5.61 |
| 12 | Matty Hiroto Inaba | USA | 5.11 | 5.34 | 5.72 | 5.35 | 8.37 |
This one is weird at first. The inner statement df['Country'] == 'USA' creates a series of True and False values, then the outer df[] will provide only rows where the inner series is True.
This approach also works with inequalities. For example, we can find all of the rows with a round 2 result less than 5 seconds:
df[ df['Round2'] < 5 ]| Name | Country | Round1 | Round2 | Round3 | Round4 | Round5 | |
|---|---|---|---|---|---|---|---|
| 0 | Yiheng Wang (王艺衡) | China | 5.45 | 3.80 | 4.31 | 3.94 | 4.43 |
| 1 | Xuanyi Geng (耿暄一) | China | 4.94 | 4.30 | 4.24 | 3.95 | 5.10 |
| 2 | Tymon Kolasiński | Poland | 4.61 | 4.89 | 5.31 | 4.74 | 5.80 |
| 4 | Luke Garrett | USA | 10.36 | 4.68 | 5.83 | 6.04 | 6.27 |
| 7 | Timofei Tarasenko | Russia | 6.61 | 4.28 | 5.58 | 5.82 | 5.41 |
Summary of DataFrame Slicing
| Python | Description |
|---|---|
df['column_name'] |
Select a column by name. |
df[['col1', 'col2']] |
Select multiple columns. |
df.iloc[5:10,:] |
Select rows and columns by index. |
df['Country'] == 'USA |
Select by value. |
9.3 Working with Series
Each column in a DataFrame is a series. For a given column, here we will call it ‘feature’, we can find:
| Pandas | Description |
|---|---|
df['feature'].unique() |
List of unique values |
df['feature'].value_counts() |
Series - Number of occurances |
df['feature'].mean() |
Mean |
df['feature'].median() |
Median |
df['feature'].std() |
Standard deviation |
df['feature'].min() |
Minimum value |
df['feature'].max() |
Maximum value |
For example, the best time on Round 1 of the 2025 World Championship was:
df['Round1'].min()4.61
9.4 Rubik’s Cube 2025 Final Rankings
In a Rubik’s speed-cubing tournament, competitors solve five different shuffles. Standard scoring is to drop their best and worst times from those five solves and average their middle three times. This is called the Average of Middle, or AoM.
One way that we can calculate the AoM and determine rankings for players is as follows:
# Grab the times
times = df.iloc[:,2:7]
# Calculate and insert AoM as a new column
times_sum = times.sum(axis=1)
times_min = times.min(axis=1)
times_max = times.max(axis=1)
df['AoM'] = (times_sum - times_min - times_max) / 3
# Sort
df.sort_values('AoM')| Name | Country | Round1 | Round2 | Round3 | Round4 | Round5 | AoM | |
|---|---|---|---|---|---|---|---|---|
| 0 | Yiheng Wang (王艺衡) | China | 5.45 | 3.80 | 4.31 | 3.94 | 4.43 | 4.226667 |
| 1 | Xuanyi Geng (耿暄一) | China | 4.94 | 4.30 | 4.24 | 3.95 | 5.10 | 4.493333 |
| 2 | Tymon Kolasiński | Poland | 4.61 | 4.89 | 5.31 | 4.74 | 5.80 | 4.980000 |
| 6 | Qixian Cao (曹岂娴) | China | 7.00 | 5.03 | 5.72 | 4.41 | 5.55 | 5.433333 |
| 12 | Matty Hiroto Inaba | USA | 5.11 | 5.34 | 5.72 | 5.35 | 8.37 | 5.470000 |
| 8 | Teodor Zajder | Poland | 5.57 | 5.85 | 6.47 | 5.09 | 5.07 | 5.503333 |
| 7 | Timofei Tarasenko | Russia | 6.61 | 4.28 | 5.58 | 5.82 | 5.41 | 5.603333 |
| 10 | Dylan Miller | USA | 5.22 | 5.46 | 6.47 | 7.57 | 5.61 | 5.846667 |
| 9 | Neo Cuares | Philippines | 6.27 | 5.23 | 5.16 | 6.24 | 6.44 | 5.913333 |
| 3 | Max Park | USA | 6.07 | 5.21 | 5.90 | 6.05 | 5.81 | 5.920000 |
| 4 | Luke Garrett | USA | 10.36 | 4.68 | 5.83 | 6.04 | 6.27 | 6.046667 |
| 5 | Luke Griesser | USA | 6.47 | 5.67 | 5.94 | 5.73 | 6.70 | 6.046667 |
| 15 | Olaf Kuźmiński | Poland | 6.62 | 7.19 | 5.88 | 4.76 | 5.74 | 6.080000 |
| 13 | Seung Hyuk Nahm (남승혁) | South Korea | 5.65 | 6.58 | 5.80 | 7.89 | 6.60 | 6.326667 |
| 11 | Alexey Tsvetkov | Russia | 7.17 | 6.34 | 5.15 | 8.05 | 6.27 | 6.593333 |
| 14 | Kai-Wen Wang (王楷文) | Chinese Taipei | 6.84 | 6.17 | 8.67 | 9.66 | 8.20 | 7.903333 |
Exercises
Given the data from the 2025 Rubik’s WCA World Championship,
- Create a box plot of the times for rounds 1 through 5.
- Create a box plot of the five times for each player. These should all be in one figure.
Your first figure should have five box plots (one for each round). Your second figure should have 15 boxplots (one for each player).
How would the standings of the Rubik’s WCA World Championship 2025 differ if results were based on:
- The average of all five times?
- The average of the three best times?
- The best single time?
Create a dataframe called
dicewhere the first column is 500 random values from 1 to 6, the second column is 500 random values from 1 to 6, and the third column is the sum of the first two columns. The column names should be “Roll1”, “Roll2”, and “Sum”. Create a histogram of the “Sum” column.Netflix
Trending YouTube Statistics
FEV Data