FIN7880 Take Home Exercise 1
Building a Lakehouse
Version: 0.3
Take Home Exercise 1: Building a Lakehouse
Pre-requisite:
• Get a Github ID as we will do the project on Github Codespaces.
Preparation of the the Github environment for this project.
Before starting your hands-on exercise, ensure you adjust the idle timeout setting and configure the
machine size appropriately.
• You can adjust the idle timeout settings in GitHub to prevent it from restarting if you're away for too
long (default is only 30 minutes which definitely is not enough).
• Also choose a larger machine (like 16-core) to handle the project.
• To Launch Codespace: Go to the GitHub repo, click on Code > Codespaces.
• Use this template to create the Github Codespaces project. This provides a virtual machine with
preinstalled tools for the project.
https://github.com/hkbu-samsontai/complete-guide-to-data-lakes-and-lakehouses-3865060
About Idle time-out: A codespace will stop running after a period of inactivity. By default this period is 30
minutes, but you can specify a longer or shorter default timeout period in your personal settings on
GitHub.
After clicking the “Create Codespaces”, you will be getting something like this after a while.
Project Organization:
You will find the following folders in this project. The project is organized to handle data from ingestion
through transformation to orchestration, making it easier to manage and analyze data efficiently in a
lakehouse environment.
• Docker Compose YAML file: Defines configurations to launch components like Dremio, MinIO,
Nessie, Superset, and JupyterLab using Docker.
• Data Folder: Contains source data files for initial data ingestion.
• Ingestion Folder: Holds PySpark scripts for loading data into the bronze layer.
• Transformation Folder: Includes DBT projects for managing data transformations in the silver
and gold layers.
• Orchestration Folder: Contains Dagster project files to manage the data pipeline.
• Notebooks Folder: Stores Jupyter Notebooks for data analysis.
• RAG Folder: Holds scripts for the generative AI application.
Key steps in setting up Lakehouse infrastructure
• Starting Docker Containers from the terminal (bottom of the UI)
o Enter this command from the terminal: docker-compose up -d
§ This command pulls all necessary docker images and creates containers for the
data lakehouse components.
• After starting up all the docker containers, you can access the services provided in the container such
as MinIO, Nessie and Dremio.
• Set up the MinIO.
o To access MinIO use the port 9000.
§ Check the Ports tab in your terminal to find the URL for MinIO (usually on Port
9000).
o Login: Log in MinIO with username “minio” and password “minioadmin”.
o Creating Access Keys in MinIO: Create an access key for authentication by clicking on
"Access Keys" and then "Create Access Key”.
o Copy and paste the access key details to your environment file (.env.example).
o Setting Up the Lakehouse Bucket:
In MinIO, create a bucket named “lakehouse” to store your data.
• Steps: Go to Buckets > Create Bucket.
• Bucket Name: Use lakehouse as the name (important for code compatibility).
After setting up MinIO, you have to setup the Dremio.
• Setting up Dremio
• Access Dremio via the URL for Port 9047.
• Create an admin account at the login page and store the details safely.
• Adding a Source: Add a new source in Dremio pointing to the MinIO bucket using the Nessie
Catalog. Ensure the root path is “lakehouse” and enter your AWS credentials.
• Detail Steps:
o Go to Sources > Add New Source.
o Select Nessie Catalog and configure it to point to your MinIO bucket.
o Use the same AWS credentials and endpoint details from the environment file.
• Rename .env.example (at the project folder on the left of the UI) to .env to finalize the
setup.
• With everything set up, you can now load your raw data into the lakehouse.
.
Here is the summary of what you have done.
• Docker: start and run containers for your data lakehouse components.
• MinIO: Manages your data storage.
• Dremio: Acts as your query engine.
You are now ready to ingest the raw data into the Iceberg table.
Raw Data Ingestion:
In this section, we will learn how to use the provided scripts to load the data into the datalake. You can
skip the key concepts if you just would like to get the loading done first.
Key Concepts:
• Configuration Setup:
• config_loader script: This script sets up your environment by loading configuration
settings from an .env file. It pulls in important environment variables
like aws_s3_endpoint, secret_key_id, secret_access_key, and lakehouse_s3_path.
• DataIngestor class: This class is responsible for the core data ingestion process. It
initializes a Spark session and creates or updates Iceberg tables based on the data
schema. It ensures data is correctly formatted and logs any errors during the process.
• Ingesting Data:
• Ingest file to bronze: This method reads data from CSV or JSON files and writes it to the
Iceberg table. It verifies the table exists before loading the data and logs the success or
errors.
• DocumentDataIngestor class: This class handles the ingestion of document data into the
MinIO bucket without creating Iceberg tables. It uploads files directly to the specified S3
or MinIO bucket.
• Managing File Paths:
• FilePathManager class: This utility manages file paths in the project, generating paths
dynamically based on the base directories provided.
• Spark Session Initialization:
• spark_utils script: Initializes a Spark session with configurations tailored for Iceberg and
Nessie integrations.
• Running Ingestion Scripts:
o ecoride_ingest.py script: This script sets up the environment, initializes the DataIngestor
and FilePathManager, and ingests various datasets into the bronze layer. It processes
and loads files into the appropriate Iceberg tables
Ingesting the Data
All the ingestion scripts are located in the ingestion folder.
Steps to Execute the ingestion scripts:
Before running the ingestion script, make sure the containers are up, you can check the ports.
If not, enter “docker-compose up -d” to restart all the containers.
• Run the ingestion script in the bronze folder
• Use the below commands to execute the ingestion scripts to load the raw datasets into the
bronze layer.
• python -m ingestion.bronze.ecoride_ingest
• python -m ingestion.bronze.chargenet_ingest
• python -m ingestion.bronze.vehicle_health_ingest
• python -m ingestion.bronze.documents_ingest
You can verify if the ingestion has been completed properly by following the steps below:
• Verify Ingestion from MinIO
• Check the MinIO Console and Nessie UI to ensure data is ingested correctly and is part of the
catalog.
MinIO
• Verify from Nessie Catalog:
Nessie(Port 19120)
By now the data are loaded properly into lakehouse. You can query these using the Dremio.
Querying the Data
By following the above steps, you can transform raw data files (CSV, JSON) into a queryable format
(Iceberg tables) in your data lakehouse, enabling you to run SQL queries on the data.
• Use Dremio UI to query the ingested data.
’
Transformer Model Overview
Before releasing the raw data for use by data consumers, we need to transform it. DBT handles the
transformations in our Lakehouse.
DBT, or Data Build Tool, is an open-source command-line tool that empowers data analysts and engineers
to transform and model data within their lakehouse environment. It allows users to write SQL queries to
convert raw data into a more structured format, which is crucial in a lakehouse setting where data is stored
in its raw form and requires processing for analytical purposes.
All the transformation scripts have been prepared for you in the "transformation" folder of the project.
You can skip the key concepts for now and revisit them after completing the lab exercise titled "Exercise:
Create a Sales Insights Gold Model in DBT and Execute Models in Silver and Gold Layers."
Key Concept:
• Transformation Layers: In this project, we have define two layers in the Lakehouse - Silver layer and
Gold layer.
• Silver Layer: Focuses on transforming raw data formats and filtering out unnecessary columns
to enhance data quality and privacy. For example, last names and other sensitive details are
not propagated to this layer.
• Gold Layer: Contains more complex models that perform aggregations and joins across
different datasets from the silver layer to create refined datasets.
• Configuration Files:
• profiles.yml: Configures connection settings, such as host details and user credentials.
• dbt_project.yml: Defines project settings, including versioning and model configurations,
also how models are materialized (as tables in Silver and as views in Gold).
DBT Projects: DBT (Data Build Tool) is used to manage these transformations. Each layer (Silver
and Gold) has its own DBT project with configuration files that define how data should be
processed and stored. The DBT projects are in the “transformation folder”.
• Models Folder: There are two separate model folder, Silver and Gold. The Silver layer focuses on
transforming raw data and improving data quality by filtering out unnecessary columns. The gold
layer further refines this data, creating more complex datasets.
• Contains folders for different business units. Each folder includes SQL scripts for
transforming data specific to that unit.
• Example: The customers.sql script in the Ecoride folder transforms raw customer data,
enhancing data quality and privacy.
• Source and Schema Files:
• sources.yml: Defines the source data, including database schema and tables (where the
source data comes from).
• schema.yml: Adds details about models, such as descriptions and tests, which are important
for documentation and model testing.
• Materialization:
• In the silver layer, models are materialized as tables.
• In the gold layer, models are materialized as views, which are more complex and include
aggregations and joins.
• Process Overview:
o Silver Layer: Transforms raw data, filters out unnecessary columns, and enhances data
quality.
o Gold Layer: Aggregates and joins data from the silver layer to create refined datasets.
This layered approach helps manage data privacy and quality, ensuring that only necessary information is
accessible to different business units.
Customer_segmentation.sql from Model folder in the Gold Layer:
Transformation Exercise: Create a Sales Insights Gold Model and execute the silver/gold
transformation models using dbt
In this exercise, we will run all the SQL files that have already been defined for you in the
transformation/silver and transformation/gold folders. You will use the DBT command to execute these
files. If you're interested in learning how to create the transformation file, please refer to the description
in step 1. By the end of step 1, you will have created a SQL file named "sales_insights.sql."
To save you time, I've already created the sales_insights.sql file for you. You can skip step 1 and go directly
to step 2.
Step 1: Create a new transformation SQL file (if you want to learn how to create one, otherwise, you
can skip to step 2)
The goal of this exercise is develop an sql file to join sales data with vehicles and customers tables to
enhance the sales data with customer names and vehicle model names.
The model should use tables from the silver layer as its source.
Instructions: Following these steps, to create a unified gold model that provides enriched sales insights.
• Create/Review a new model
o Create a new model by making a new SQL file named sales_insights.sql in the ecoride folder
of the gold dbt project. <= note, I have created this file for you as some of the students might
not familiar with SQL.
o However, I still suggest you to take a look of the sales_insights.sql file to understand its syntax.
• If you would like to build this sql file from scratch, follow the steps below:
1. Create a SQL File: Start by creating a new file named sales_insight inside the ecoride folder.
2. Set a Macro: Define a macro to handle the nessie_branch. This sets the branch to main.
3. Write the Query: Begin with a basic SELECT statement to choose the columns you need from
the sales table. For example:
sql SELECT sales_id, sale_date, sale_price, payment_method FROM silver.sales;
4. Join Tables: Add joins to include columns from the customer and vehicle tables. This helps in
combining relevant data from different tables:
sql
JOIN silver.customers ON sales.customer_id = customers.customer_id
JOIN silver.vehicles ON sales.vehicle_id = vehicles.vehicle_id;
5. Add Columns: Include additional columns as needed for your analysis.
By following these steps, you can create a structured and refined data model using SQL, which
can then be used for further analysis and insights.
6. Execute and Validate: After creating the model, execute your DBT (Data Build Tool) project to
validate and materialize the models. <= I will show you how to do it in the next section.
In step 2, you will learn how to do the data transformation based on the models in the silver and gold layer
subfolders in the transformation folder.
Step 2: Executing Data Transformation
The models located in the gold/silver subfolders of the transformation folder will be utilized by the DBT
tool. You can run the DBT command directly from the terminal.
However, before executing the DBT command, you need to set up two environment variables. These
variables are essential for the DBT (Data Build Tool) to connect to Dremio and execute the SQL
commands.
• Setting up the Environment Variables:
o Set up the two environment variables by entering the following commands directly from the
terminal.
export DREMIO_USER=your_dremio_userid
export DREMIO_PASSWORD=your_dremio_password
• Running dbt Commands:
o You can use the “dbt debug” command to verify that dbt can connect to Dremio and that the
setup is correct.
dbt debug
o Executing the silver model
• Silver models represent intermediate transformations, whereas gold models are the
final, refined datasets.
• Ensure that you have navigated to the silver model folder before executing the “dbt
run” command.
• Use the command: `cd transformation/silver`
• Then, run `dbt run` to execute your silver models.
• Verifying Results
o You can see the name silver shows up in the Dremio catalog. You can also click into it and
find the new datasets.
From Dremio:
You can also check the existence of silver layer at Nessie.
From Nessie:
• Execute the gold model.
• Now move to the gold folder and run the dbt command again for your gold models. If there
are errors, debug the SQL queries and rerun the command.
• cd ../gold/
• dbt run
• Verifying Results for Silver and Gold
o Check the different platforms (MinIO, Nessie, and Dremio) to ensure the models have been
created and materialized correctly
o MinIO: Check the lakehouse bucket for the silver layer.
o Nessie: Verify the new silver folder and models.
o Dremio: Ensure the silver models are materialized as tables and the gold layer is available as
a virtual dataset.
By following the above steps, you have executed data transformations and ensured that your data
lakehouse has been populated with raw and transformed data.
Data Orchestration Overview
Data orchestration involves the automated management and scheduling of data workflows. In a production
environment, we would prefer not to execute these models manually if our lakehouse were operational.
You can skip the key concepts for now and revisit them after completing the lab exercise titled "Exercise:
Create a Sales Insights Gold Model in DBT and Execute Models in Silver and Gold Layers."
Key Concepts:
Dagster is an orchestrator used to automate the execution of data models. Instead of running models
manually, Dagster schedules and manages these tasks.
• Code Components:
o assets.py: Defines silver and gold DBT assets, mapping them to DBT models.
o constants.py: Locates directories for DBT projects and initializes DBT CLI resource objects. It
also sets up environment variables to generate DBT manifests, which are metadata
descriptions for the projects.
o definitions.py: Configures the Dagster project, including DBT assets, schedules, and
resources.
o schedules.py: Placeholder for defining any scheduling you want for your project.
Dagster helps automate and manage the execution of data models, ensuring that dependencies are
respected and tasks are scheduled efficiently.
All the Dagster’s artifacts (.py files) have been created for you in the orchestration folder.
Orchestration folder:
Data Orchestration Execution Steps
This lab exercise guides you through the setup and execution process, emphasizing the significance of
environment variables, DBT assets, and the DAG structure.
• Switch to the orchestration directory,
o cd ../../orchestration
• Set up environment variables to generate manifest files.
o export DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1
• Launch Dagster using the command dagster dev.
o dagster dev
• Open the Dagster UI in your browser to view the asset lineage, also known as a directed acyclic graph
(DAG). This shows the dependencies between bronze, silver, and gold models.
o Open the dagster UI at port 3000 to see the flow
• Click on "Materialize all" to start materializing the models. Dagster ensures that silver models are
executed first, followed by gold models, respecting their dependencies.
• Verify that all models are materialized correctly.
Take Home Exercise 2: BI, Advanced Analytics and ML in Lakehouse
In this exercise, you will execute queries, create virtual datasets (SQL View), and develop an interactive
dashboard using Dremio and Apache Superset. At the end of this exercise, you will learn how to:
• Execute Queries and Create Virtual Datasets:
o You'll use Dremio to run SQL queries and create complex virtual datasets.
• Connect Dremio to Apache Superset:
o You'll link Dremio with Apache Superset to create interactive dashboards.
• Develop a Marketing Dashboard:
o You'll build a marketing dashboard to visualize data insights.
• Use Jupyter Notebooks:
o You'll connect Dremio to Jupyter Notebook for advanced analytics, including tasks like
product reviews analytics and vehicle health analytics.
These activities will help you gain practical experience in using business intelligence tools and performing
advanced data analytics.
Key Concepts:
Terminologies used in this exercise:
• Sources: These are connections to storage systems like S3 or databases, containing datasets
known as physical datasets.
• Physical Datasets: These immutable datasets, marked by a purple-blue icon, are the original data
stored in sources like S3 or databases. Due to their sensitive nature, access is restricted to a
select group of data engineers or administrators. You can view details such as the source, last
update, columns, and data types, and perform actions like adding columns, grouping, joining,
and changing data types directly from the dataset overview screen. They can be downloaded in
formats such as JSON, CSV, or Parquet.
• Virtual Datasets: These are built on top of physical datasets and function like enhanced table
views, allowing for flexible data manipulation and business logic development. Data analysts can
create virtual datasets by executing queries and saving the results. They can be stored in either a
private home space or a public shared space. Adding descriptions and labels, such as "customers
with more than three purchases" or tags like "marketing" and "segmentation," provides context
and improves searchability.
• Spaces: These are locations that organize virtual datasets shared with other users. They can be
grouped by project, business unit, or region and have configurable sharing and access privileges.
• Home Space: A private area for your physical and virtual datasets where you can curate data
until it's ready to be shared.
Create a simple virtual datasets in Dremio
Follow the below steps to create a new virtual dataset with the name “customers_vip” by adding a new
columns to an existing table “customer_segementation”.
Steps:
• Open the Dremio UI
• selecting a physical dataset, "customer_segmentation" dataset.
(catalog.gold.customer_segmentation)
• Execute a simple SQL select statement to view the dataset.
• Add a New Column with the column name “is_vip”
• Click the Add Column
• Name this new Field Name as “is_vip”
• This is a new calculated column to identify VIP customers.
• VIP customers is defined as those with more than three purchases.
• Uses the “total_purchases” field to create this new “is_vip” column, setting it to true for
customers with more than three purchases.
• Apply and Save this to a virtual dataset (i.e. view): The new column is applied and
saved as a view named customers_vip.
• Organize the Dataset: Save this new virtual dataset to a new folder named "marketing"
under lakehouse for easy access and organization.
Creating a Complex Virtual Dataset using Dremio
Now you know how to create s simple virtual dataset, let’s create a more complex virtual dataset using
SQL to make data more useful for analysis. We will split a column (preferred models) in the virtual
dataset “customers_vip” into multiple columns that indicate whether a customer bought a specific
model.
Follow the below steps create this new virtual dataset with the name “customers_vip_split_models”.
Again, we will use Dremio UI to do it.
Steps to Create the complex Virtual Dataset:
• Query the virtual dataset “customers_vip” that you just created in the previous step.
• Write a SQL query to split the “preferred models” column into five new columns, each
showing true (1) or false (0) if a customer bought a specific model. Below is the SQL query for
your reference.
SELECT
customer_id,
is_vip,
first_name,
email,
city,
state,
country,
total_purchases,
average_purchase_value,
CASE WHEN POSITION('PowerRide' IN preferred_models) > 0 THEN 1 ELSE 0 END AS
powerride,
CASE WHEN POSITION('UrbanGlide' IN preferred_models) > 0 THEN 1 ELSE 0 END AS
urbanglide,
CASE WHEN POSITION('EcoSprint' IN preferred_models) > 0 THEN 1 ELSE 0 END AS ecosprint,
CASE WHEN POSITION('AeroFlow' IN preferred_models) > 0 THEN 1 ELSE 0 END AS aeroflow,
CASE WHEN POSITION('TerraCross' IN preferred_models) > 0 THEN 1 ELSE 0 END AS
terracross
FROM
lakehouse.marketing."customers_vip";
• Save this query as a view in the marketing folder with the name customers_vip_split_model
Congratulations! Now, let’s explore some of the features in Dremio.
• Jobs Tab:
o You can check the jobs tab to see details about the query execution, such as who ran it
and the query syntax.
• SQL Tab:
o You can view a graph of all datasets involved in the query, both physical and virtual.
In the next exercise, we will learn how to use Apache Superset to create a dashboard for the lakehouse.
Apache Superset
Apache Superset is a modern data exploration and data visualization platform. Here are its key uses:
• No-Code Interface: Allows you to build charts quickly without needing to write code.
• SQL Editor: Provides a powerful web-based SQL editor for advanced querying.
• Custom Dimensions and Metrics: Offers a lightweight semantic layer for defining custom
dimensions and metrics.
• Wide Database Support: Supports nearly any SQL database or data engine.
• Beautiful Visualizations: Includes a wide array of visualizations to showcase your data
effectively.
In this exercise, you will learn how to connect Dremio to Apache Superset. Run SQL queries to retrieve
data and use the retrieved data to build an interactive marketing dashboard in Superset. At the end of the
exercise, you will have an interactive marketing dashboard that visualizes data insights, demonstrating
their ability to use BI tools for data-driven decision-making.
Key steps for this exercise
• Connect Superset to Dremio
• Open your browser and go to the URL running on port 8088 (Superset).
• Log in to Superset using the default credentials: username:admin and password:admin.
• Connect Superset to Dremio:
o After login in to Superset, go to Settings and select Database Connections.
o Add a new database and select Dremio from the supported databases.
o Enter the SQLAlchemy URL, replacing the username and password with your Dremio
credentials. (i.e. your username/password to access Dremio)
dremio+flight://{username}:{password}@dremio:32010/?UseEncryption=false
Using SQLAlchemy:SQLAlchemy is a toolkit that helps manage database connections in a
Pythonic way.
• Add a Dataset at Superset
• Go to Datasets and select Create New Dataset.
• Choose the Dremio database and select the schema (e.g., lakehouse.marketing).
• Select the table (e.g., customers_vip) and click Create Dataset and Create Chart.
• Creating a marketing dashboard
The following steps describe how to create and customize visualizations and dashboards using
Apache Superset. Try to do your own exploration on data visualization, dashboard building and
customizing the final layouts for your dashboard.
• Visualize Data: Learn to create different types of charts (e.g., pie charts, bar charts) to represent
customer data.
• Build Dashboards: Combine multiple charts into a cohesive dashboard to present marketing
insights.
• Customize Layouts: Adjust the layout and design of the dashboard for better readability and
presentation.
Utilize the tables `customers_vip` and `customer_vip_split_models`, which you previously created and
stored in the `lakehouse.marketing` schema, to generate various charts for the marketing dashboard.
• Create a Pie Chart:
o Use the is_vip column in the customes_vip dataset to show the breakdown of VIP vs.
non-VIP customers.
o Set the metric as COUNT and customize the labels to show percentages.
• Create a Bar Chart:
o Use the same dataset to how VIP customers by state.
o Use the state column for the X-Axis and filter by is_vip to show only VIP customers.
• Create a Stacked Bar Chart:
o Use the customer_vip_split_models dataset(Make sure that you have added it at
Superset, refer to “Add a Dataset at Superset”)
o Show customer model preferences by state.
o Use the state column for the X-Axis and sum the new columns created for each model
preference.
o Enable Stacking: On the "Customize" tab, find "Stacked Style" and select "Stacked" to
enable the Stacked Histogram pattern.
• Customize the Dashboard:
o Arrange the charts for better presentation.
o Add headers and adjust the layout to make the dashboard more informative.
You should have something like this in the dashboard after completing the above steps.
Capture this screen and make sure to include the url in the screen.
*** You may want to add additional charts to get your bonus points.***
END
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。