9  DataFrames

A DataFrame is like a spreadsheet, but more dynamic and versatile. The Pandas library contains many useful functions for working with DataFrames, and we include Pandas with:

import pandas as pd

Below are the results of the Rubik’s WCA World Championship 2025 which took place at the Seattle Convention Center on July 6, 2025. The sixteen finalists are contained within a list of sixteen lists below, including their country and their times from five rounds of solving. We use that list to create a Pandas DataFrame with column names:

rubiks_final = [
    ["Yiheng Wang (王艺衡)","China",5.45,3.80,4.31,3.94,4.43],
    ["Xuanyi Geng (耿暄一)","China",4.94,4.30,4.24,3.95,5.10],
    ["Tymon Kolasiński","Poland",4.61,4.89,5.31,4.74,5.80],
    ["Max Park","USA",6.07,5.21,5.90,6.05,5.81],
    ["Luke Garrett","USA",10.36,4.68,5.83,6.04,6.27],
    ["Luke Griesser","USA",6.47,5.67,5.94,5.73,6.70],
    ["Qixian Cao (曹岂娴)","China",7.00,5.03,5.72,4.41,5.55],
    ["Timofei Tarasenko","Russia",6.61,4.28,5.58,5.82,5.41],
    ["Teodor Zajder","Poland",5.57,5.85,6.47,5.09,5.07],
    ["Neo Cuares","Philippines",6.27,5.23,5.16,6.24,6.44],
    ["Dylan Miller","USA",5.22,5.46,6.47,7.57,5.61],
    ["Alexey Tsvetkov","Russia",7.17,6.34,5.15,8.05,6.27],
    ["Matty Hiroto Inaba","USA",5.11,5.34,5.72,5.35,8.37],
    ["Seung Hyuk Nahm (남승혁)","South Korea",5.65,6.58,5.80,7.89,6.60],
    ["Kai-Wen Wang (王楷文)","Chinese Taipei",6.84,6.17,8.67,9.66,8.20],
    ["Olaf Kuźmiński","Poland",6.62,7.19,5.88,4.76,5.74],
]

columns = ["Name","Country","Round1","Round2","Round3","Round4","Round5"]

df = pd.DataFrame(rubiks_final, columns=columns)

df
Name Country Round1 Round2 Round3 Round4 Round5
0 Yiheng Wang (王艺衡) China 5.45 3.80 4.31 3.94 4.43
1 Xuanyi Geng (耿暄一) China 4.94 4.30 4.24 3.95 5.10
2 Tymon Kolasiński Poland 4.61 4.89 5.31 4.74 5.80
3 Max Park USA 6.07 5.21 5.90 6.05 5.81
4 Luke Garrett USA 10.36 4.68 5.83 6.04 6.27
5 Luke Griesser USA 6.47 5.67 5.94 5.73 6.70
6 Qixian Cao (曹岂娴) China 7.00 5.03 5.72 4.41 5.55
7 Timofei Tarasenko Russia 6.61 4.28 5.58 5.82 5.41
8 Teodor Zajder Poland 5.57 5.85 6.47 5.09 5.07
9 Neo Cuares Philippines 6.27 5.23 5.16 6.24 6.44
10 Dylan Miller USA 5.22 5.46 6.47 7.57 5.61
11 Alexey Tsvetkov Russia 7.17 6.34 5.15 8.05 6.27
12 Matty Hiroto Inaba USA 5.11 5.34 5.72 5.35 8.37
13 Seung Hyuk Nahm (남승혁) South Korea 5.65 6.58 5.80 7.89 6.60
14 Kai-Wen Wang (王楷文) Chinese Taipei 6.84 6.17 8.67 9.66 8.20
15 Olaf Kuźmiński Poland 6.62 7.19 5.88 4.76 5.74

df is short for DataFrame and is a common variable name. While this entire dataset is readable, it is tiny compared with those we will see later.

9.1 Familiarize with Data

One of the first steps when working with a new dataset is to get a sense of its size and structure. The following functions are often the first we will run on any new dataset. Run these on the our Rubik’s DataFrame to get a sense of the functionality of each.

Pandas Description
Structure df.shape Number of rows and columns
df.columns Column labels
df.info() Summary of columns
Inspecting df.head() First 5 rows
df.tail() Last 5 rows
df.sample(5) 5 Random rows
Missing Data df.isnull().sum() Number of missing values
df.isnull().mean() * 100 Percent of values missing
Summaries df.describe()
df.describe(include='all')

9.2 Slicing a DataFrame

There are several ways to select a subset of a DataFrame.

9.2.1 Selecting a Column

Given our Rubik’s DataFrame as df, we can select all of the countries by specifying the column label:

df['Country']
0              China
1              China
2             Poland
3                USA
4                USA
5                USA
6              China
7             Russia
8             Poland
9        Philippines
10               USA
11            Russia
12               USA
13       South Korea
14    Chinese Taipei
15            Poland
Name: Country, dtype: object

We can select multiple columns by specifying a list of columns:

df[ ['Round1', 'Round2', 'Round3'] ]
Round1 Round2 Round3
0 5.45 3.80 4.31
1 4.94 4.30 4.24
2 4.61 4.89 5.31
3 6.07 5.21 5.90
4 10.36 4.68 5.83
5 6.47 5.67 5.94
6 7.00 5.03 5.72
7 6.61 4.28 5.58
8 5.57 5.85 6.47
9 6.27 5.23 5.16
10 5.22 5.46 6.47
11 7.17 6.34 5.15
12 5.11 5.34 5.72
13 5.65 6.58 5.80
14 6.84 6.17 8.67
15 6.62 7.19 5.88

9.2.2 Selecting Rows and Columns with iloc

The .iloc[] indexer works by selecting rows and columns just like array slicing, by specifying values or ranges. We can select all of the rows and just the round columns with:

df.iloc[:, 2:7]
Round1 Round2 Round3 Round4 Round5
0 5.45 3.80 4.31 3.94 4.43
1 4.94 4.30 4.24 3.95 5.10
2 4.61 4.89 5.31 4.74 5.80
3 6.07 5.21 5.90 6.05 5.81
4 10.36 4.68 5.83 6.04 6.27
5 6.47 5.67 5.94 5.73 6.70
6 7.00 5.03 5.72 4.41 5.55
7 6.61 4.28 5.58 5.82 5.41
8 5.57 5.85 6.47 5.09 5.07
9 6.27 5.23 5.16 6.24 6.44
10 5.22 5.46 6.47 7.57 5.61
11 7.17 6.34 5.15 8.05 6.27
12 5.11 5.34 5.72 5.35 8.37
13 5.65 6.58 5.80 7.89 6.60
14 6.84 6.17 8.67 9.66 8.20
15 6.62 7.19 5.88 4.76 5.74

9.2.3 Selecting Rows by Value

Sometimes we only want rows with certain values. We can achieve this with a logical expression. For example:

df[df['Country'] == 'USA']
Name Country Round1 Round2 Round3 Round4 Round5
3 Max Park USA 6.07 5.21 5.90 6.05 5.81
4 Luke Garrett USA 10.36 4.68 5.83 6.04 6.27
5 Luke Griesser USA 6.47 5.67 5.94 5.73 6.70
10 Dylan Miller USA 5.22 5.46 6.47 7.57 5.61
12 Matty Hiroto Inaba USA 5.11 5.34 5.72 5.35 8.37

This one is weird at first. The inner statement df['Country'] == 'USA' creates a series of True and False values, then the outer df[] will provide only rows where the inner series is True.

This approach also works with inequalities. For example, we can find all of the rows with a round 2 result less than 5 seconds:

df[ df['Round2'] < 5 ]
Name Country Round1 Round2 Round3 Round4 Round5
0 Yiheng Wang (王艺衡) China 5.45 3.80 4.31 3.94 4.43
1 Xuanyi Geng (耿暄一) China 4.94 4.30 4.24 3.95 5.10
2 Tymon Kolasiński Poland 4.61 4.89 5.31 4.74 5.80
4 Luke Garrett USA 10.36 4.68 5.83 6.04 6.27
7 Timofei Tarasenko Russia 6.61 4.28 5.58 5.82 5.41

Summary of DataFrame Slicing

Python Description
df['column_name'] Select a column by name.
df[['col1', 'col2']] Select multiple columns.
df.iloc[5:10,:] Select rows and columns by index.
df['Country'] == 'USA Select by value.

9.3 Working with Series

Each column in a DataFrame is a series. For a given column, here we will call it ‘feature’, we can find:

Pandas Description
df['feature'].unique() List of unique values
df['feature'].value_counts() Series - Number of occurances
df['feature'].mean() Mean
df['feature'].median() Median
df['feature'].std() Standard deviation
df['feature'].min() Minimum value
df['feature'].max() Maximum value

For example, the best time on Round 1 of the 2025 World Championship was:

df['Round1'].min()
4.61

9.4 Rubik’s Cube 2025 Final Rankings

In a Rubik’s speed-cubing tournament, competitors solve five different shuffles. Standard scoring is to drop their best and worst times from those five solves and average their middle three times. This is called the Average of Middle, or AoM.

One way that we can calculate the AoM and determine rankings for players is as follows:

# Grab the times 
times = df.iloc[:,2:7]

# Calculate and insert AoM as a new column 
times_sum = times.sum(axis=1)
times_min = times.min(axis=1)
times_max = times.max(axis=1)

df['AoM'] = (times_sum - times_min - times_max) / 3

# Sort 
df.sort_values('AoM')
Name Country Round1 Round2 Round3 Round4 Round5 AoM
0 Yiheng Wang (王艺衡) China 5.45 3.80 4.31 3.94 4.43 4.226667
1 Xuanyi Geng (耿暄一) China 4.94 4.30 4.24 3.95 5.10 4.493333
2 Tymon Kolasiński Poland 4.61 4.89 5.31 4.74 5.80 4.980000
6 Qixian Cao (曹岂娴) China 7.00 5.03 5.72 4.41 5.55 5.433333
12 Matty Hiroto Inaba USA 5.11 5.34 5.72 5.35 8.37 5.470000
8 Teodor Zajder Poland 5.57 5.85 6.47 5.09 5.07 5.503333
7 Timofei Tarasenko Russia 6.61 4.28 5.58 5.82 5.41 5.603333
10 Dylan Miller USA 5.22 5.46 6.47 7.57 5.61 5.846667
9 Neo Cuares Philippines 6.27 5.23 5.16 6.24 6.44 5.913333
3 Max Park USA 6.07 5.21 5.90 6.05 5.81 5.920000
4 Luke Garrett USA 10.36 4.68 5.83 6.04 6.27 6.046667
5 Luke Griesser USA 6.47 5.67 5.94 5.73 6.70 6.046667
15 Olaf Kuźmiński Poland 6.62 7.19 5.88 4.76 5.74 6.080000
13 Seung Hyuk Nahm (남승혁) South Korea 5.65 6.58 5.80 7.89 6.60 6.326667
11 Alexey Tsvetkov Russia 7.17 6.34 5.15 8.05 6.27 6.593333
14 Kai-Wen Wang (王楷文) Chinese Taipei 6.84 6.17 8.67 9.66 8.20 7.903333

Exercises

  1. Given the data from the 2025 Rubik’s WCA World Championship,

    1. Create a box plot of the times for rounds 1 through 5.
    2. Create a box plot of the five times for each player. These should all be in one figure.

    Your first figure should have five box plots (one for each round). Your second figure should have 15 boxplots (one for each player).

  2. How would the standings of the Rubik’s WCA World Championship 2025 differ if results were based on:

    1. The average of all five times?
    2. The average of the three best times?
    3. The best single time?
  3. Create a dataframe called dice where the first column is 500 random values from 1 to 6, the second column is 500 random values from 1 to 6, and the third column is the sum of the first two columns. The column names should be “Roll1”, “Roll2”, and “Sum”. Create a histogram of the “Sum” column.

  4. Netflix

  5. Trending YouTube Statistics

  6. FEV Data