Iceberg Schema Design for Real-Time Game Analytics¶
Learning Objectives¶
By the end of this chapter, you will:
- Understand how to design Iceberg tables for streaming game analytics data
- Learn effective partitioning strategies for different query patterns
- Gain insights into optimizing sort orders for performance
- Master the configuration of write properties for efficient updates
- See how to structure tables for various analytical perspectives (player performance, time-based analysis, and color-based trends)
Context and Overview¶
Our balloon pop game generates a stream of events when players interact with the game. These events are sent to Kafka and need to be stored in a way that enables both real-time dashboards and historical analysis.
The balloon_pops
database is designed with five Iceberg tables, each focusing on different analytical dimensions:
- Leaderboard: Tracks overall player performance
- Balloon Color Stats: Records statistics by balloon color for each player
- Realtime Scores: Captures time-windowed scoring data
- Balloon Colored Pops: Tracks detailed pop statistics in time windows
- Color Performance Trends: Analyzes effectiveness of different balloon colors
All tables use Iceberg Format Version 2, enabling advanced features like merge-on-read operations for efficient updates without rewriting entire files—particularly important for frequently updated gaming data.
Important
Throughout this tutorial, we'll explore common design patterns and trade-offs in schema design. Pay special attention to how partitioning strategies differ based on expected query patterns.
Table Specifications¶
Leaderboard¶
Table Identifier: balloon_pops.leaderboard
Schema¶
Field ID | Field Name | Data Type | Required | Description |
---|---|---|---|---|
1 | player | STRING | Yes | Player identifier |
2 | total_score | LONG | Yes | Player's cumulative score |
3 | bonus_hits | LONG | Yes | Number of bonus hits achieved |
4 | event_ts | TIMESTAMPTZ | Yes | Timestamp when the event occurred |
Partitioning¶
The table is partitioned by:
- Day (from
event_ts
) player
This partitioning strategy optimizes queries filtering by date or specific players.
Sort Order¶
Data is sorted by:
total_score
(DESC)bonus_hits
(DESC)
This sorting enables efficient retrieval of top-scoring players.
Optimized For¶
- Querying leaderboard data by day
- Filtering operations by player
- Fast retrieval of top-scoring players
- Efficient upsert operations with merge-on-read configuration
Balloon Color Stats¶
Table Identifier: balloon_pops.balloon_color_stats
Schema¶
Field ID | Field Name | Data Type | Required | Description |
---|---|---|---|---|
1 | player | STRING | Yes | Player identifier |
2 | balloon_color | STRING | Yes | Color of the balloon |
3 | points_by_color | LONG | Yes | Points earned for specific balloon color |
4 | bonus_hits | LONG | Yes | Number of bonus hits achieved |
5 | event_ts | TIMESTAMPTZ | Yes | Timestamp when the event occurred |
Partitioning¶
The table uses a multi-level partitioning strategy:
- Day (from
event_ts
) balloon_color
player
This allows efficient queries when filtering by any combination of these dimensions.
Sort Order¶
Data is sorted by:
points_by_color
(DESC)bonus_hits
(DESC)
This sorting prioritizes high-scoring color statistics.
Optimized For¶
- Querying balloon color statistics by day
- Filtering by balloon color and/or player
- Fast retrieval of top-scoring color statistics
- Analysis of performance by color across players
- Efficient upsert operations
Time Series¶
Realtime Scores¶
Table Identifier: balloon_pops.realtime_scores
Schema¶
Field ID | Field Name | Data Type | Required | Description |
---|---|---|---|---|
1 | player | STRING | Yes | Player identifier |
2 | total_score | LONG | Yes | Player's cumulative score |
3 | window_start | TIMESTAMPTZ | Yes | Start timestamp of the scoring window |
4 | window_end | TIMESTAMPTZ | Yes | End timestamp of the scoring window |
Partitioning¶
The table is partitioned by:
- Hour (from
window_start
)
This enables efficient time-based queries and analysis.
Write Properties¶
Property | Value | Description |
---|---|---|
format-version | 2 | Enables advanced Iceberg features |
Optimized For¶
- Time-series analysis of player scores within defined windows
- Hourly aggregation and querying
- Real-time tracking of player performance
- Temporal filtering based on scoring window start time
Balloon Colored Pops¶
Table Identifier: balloon_pops.balloon_colored_pops
Schema¶
Field ID | Field Name | Data Type | Required | Description |
---|---|---|---|---|
1 | player | STRING | Yes | Player identifier |
2 | balloon_color | STRING | Yes | Color of the balloon |
3 | balloon_pops | LONG | Yes | Number of balloons popped |
4 | points_by_color | LONG | Yes | Points earned for specific balloon color |
5 | bonus_hits | LONG | Yes | Number of bonus hits achieved |
6 | window_start | TIMESTAMPTZ | Yes | Start timestamp of the scoring window |
7 | window_end | TIMESTAMPTZ | Yes | End timestamp of the scoring window |
Partitioning¶
The table is partitioned by:
- Hour (from
window_start
) player
This dual partitioning strategy optimizes for both time-based and player-specific queries.
Sort Order¶
Data is sorted by:
balloon_color
(ASC)
This sorting enhances performance for color-based filtering.
Write Properties¶
Property | Value | Description |
---|---|---|
format-version | 2 | Enables advanced Iceberg features |
Optimized For¶
- Tracking balloon pop statistics by color for each player
- Time-series analysis within defined windows
- Querying performance by specific time periods
- Filtering by player identifier
- Fast retrieval when filtering by balloon color
Color Performance Trends¶
Table Identifier: balloon_pops.color_performance_trends
Schema¶
Field ID | Field Name | Data Type | Required | Description |
---|---|---|---|---|
1 | balloon_color | STRING | Yes | Color of the balloon |
2 | avg_score_per_pop | DECIMAL(10,28) | Yes | Average score earned per balloon pop |
3 | total_pops | LONG | Yes | Total number of balloon pops |
4 | window_start | TIMESTAMPTZ | Yes | Start timestamp of the analysis window |
5 | window_end | TIMESTAMPTZ | Yes | End timestamp of the analysis window |
Partitioning¶
The table is partitioned by:
- Hour (from
window_start
) balloon_color
This enables efficient queries for analyzing color performance over time.
Sort Order¶
Data is sorted by:
balloon_color
(ASC)
This optimizes lookups for specific colors.
Write Properties¶
Property | Value | Description |
---|---|---|
format-version | 2 | Enables advanced Iceberg features |
Optimized For¶
- Analyzing performance trends by balloon color over time
- Time-series analysis with hourly granularity
- Comparing efficiency metrics across different balloon colors
- Tracking popularity and eLeaderboard
Create Tables¶
Open the notebook $PROJECT_HOME/notebooks/workbook.ipynb
and execute the following cells
- Imports
- Retrieve Principal Credentials
- Define Variables
- Working with Catalog
- Create Database(Namespace)
- Create Tables
- Leaderboard
- Balloon Color Stats
- Realtime Scores
- Balloon Colored Pops
- Color Performance Trends
Key Takeaways¶
When designing Iceberg tables for streaming analytics, remember these principles:
- Match partitioning to query patterns: Partition by time for time-series analysis, by entity (player/color) for entity-specific queries
- Set appropriate sort orders: Sort by metrics you'll often use for filtering or ranking
- Enable merge-on-read for frequently updated tables to improve write performance
- Consider multiple analytical dimensions: Design separate tables optimized for different query patterns
These principles apply not just to game analytics, but to any streaming analytics application where you need to balance write throughput with query performance.
In the next chapter, we'll explore how to implement this data model using sources and sinks in our streaming pipeline.