代做FIN7880、代写Python编程语言-代写Python编程

联系方式

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

您当前位置：首页 >> Python编程Python编程

代做FIN7880、代写Python编程语言

日期：2025-05-05 08:19

FIN7880 Take Home Exercise 1

Building a Lakehouse

Version: 0.3

Take Home Exercise 1: Building a Lakehouse

Pre-requisite:

• Get a Github ID as we will do the project on Github Codespaces.

Preparation of the the Github environment for this project.

Before starting your hands-on exercise, ensure you adjust the idle timeout setting and configure the

machine size appropriately.

• You can adjust the idle timeout settings in GitHub to prevent it from restarting if you're away for too

long (default is only 30 minutes which definitely is not enough).

• Also choose a larger machine (like 16-core) to handle the project.

• To Launch Codespace: Go to the GitHub repo, click on Code > Codespaces.

• Use this template to create the Github Codespaces project. This provides a virtual machine with

preinstalled tools for the project.

https://github.com/hkbu-samsontai/complete-guide-to-data-lakes-and-lakehouses-3865060

About Idle time-out: A codespace will stop running after a period of inactivity. By default this period is 30

minutes, but you can specify a longer or shorter default timeout period in your personal settings on

GitHub.

After clicking the “Create Codespaces”, you will be getting something like this after a while.

Project Organization:

You will find the following folders in this project. The project is organized to handle data from ingestion

through transformation to orchestration, making it easier to manage and analyze data efficiently in a

lakehouse environment.

• Docker Compose YAML file: Defines configurations to launch components like Dremio, MinIO,

Nessie, Superset, and JupyterLab using Docker.

• Data Folder: Contains source data files for initial data ingestion.

• Ingestion Folder: Holds PySpark scripts for loading data into the bronze layer.

• Transformation Folder: Includes DBT projects for managing data transformations in the silver

and gold layers.

• Orchestration Folder: Contains Dagster project files to manage the data pipeline.

• Notebooks Folder: Stores Jupyter Notebooks for data analysis.

• RAG Folder: Holds scripts for the generative AI application.

Key steps in setting up Lakehouse infrastructure

• Starting Docker Containers from the terminal (bottom of the UI)

o Enter this command from the terminal: docker-compose up -d

§ This command pulls all necessary docker images and creates containers for the

data lakehouse components.

• After starting up all the docker containers, you can access the services provided in the container such

as MinIO, Nessie and Dremio.

• Set up the MinIO.

o To access MinIO use the port 9000.

§ Check the Ports tab in your terminal to find the URL for MinIO (usually on Port

9000).

o Login: Log in MinIO with username “minio” and password “minioadmin”.

o Creating Access Keys in MinIO: Create an access key for authentication by clicking on

"Access Keys" and then "Create Access Key”.

o Copy and paste the access key details to your environment file (.env.example).

o Setting Up the Lakehouse Bucket:

In MinIO, create a bucket named “lakehouse” to store your data.

• Steps: Go to Buckets > Create Bucket.

• Bucket Name: Use lakehouse as the name (important for code compatibility).

After setting up MinIO, you have to setup the Dremio.

• Setting up Dremio

• Access Dremio via the URL for Port 9047.

• Create an admin account at the login page and store the details safely.

• Adding a Source: Add a new source in Dremio pointing to the MinIO bucket using the Nessie

Catalog. Ensure the root path is “lakehouse” and enter your AWS credentials.

• Detail Steps:

o Go to Sources > Add New Source.

o Select Nessie Catalog and configure it to point to your MinIO bucket.

o Use the same AWS credentials and endpoint details from the environment file.

• Rename .env.example (at the project folder on the left of the UI) to .env to finalize the

setup.

• With everything set up, you can now load your raw data into the lakehouse.

Here is the summary of what you have done.

• Docker: start and run containers for your data lakehouse components.

• MinIO: Manages your data storage.

• Dremio: Acts as your query engine.

You are now ready to ingest the raw data into the Iceberg table.

Raw Data Ingestion:

In this section, we will learn how to use the provided scripts to load the data into the datalake. You can

skip the key concepts if you just would like to get the loading done first.

Key Concepts:

• Configuration Setup:

• config_loader script: This script sets up your environment by loading configuration

settings from an .env file. It pulls in important environment variables

like aws_s3_endpoint, secret_key_id, secret_access_key, and lakehouse_s3_path.

• DataIngestor class: This class is responsible for the core data ingestion process. It

initializes a Spark session and creates or updates Iceberg tables based on the data

schema. It ensures data is correctly formatted and logs any errors during the process.

• Ingesting Data:

• Ingest file to bronze: This method reads data from CSV or JSON files and writes it to the

Iceberg table. It verifies the table exists before loading the data and logs the success or

errors.

• DocumentDataIngestor class: This class handles the ingestion of document data into the

MinIO bucket without creating Iceberg tables. It uploads files directly to the specified S3

or MinIO bucket.

• Managing File Paths:

• FilePathManager class: This utility manages file paths in the project, generating paths

dynamically based on the base directories provided.

• Spark Session Initialization:

• spark_utils script: Initializes a Spark session with configurations tailored for Iceberg and

Nessie integrations.

• Running Ingestion Scripts:

o ecoride_ingest.py script: This script sets up the environment, initializes the DataIngestor

and FilePathManager, and ingests various datasets into the bronze layer. It processes

and loads files into the appropriate Iceberg tables

Ingesting the Data

All the ingestion scripts are located in the ingestion folder.

Steps to Execute the ingestion scripts:

Before running the ingestion script, make sure the containers are up, you can check the ports.

If not, enter “docker-compose up -d” to restart all the containers.

• Run the ingestion script in the bronze folder

• Use the below commands to execute the ingestion scripts to load the raw datasets into the

bronze layer.

• python -m ingestion.bronze.ecoride_ingest

• python -m ingestion.bronze.chargenet_ingest

• python -m ingestion.bronze.vehicle_health_ingest

• python -m ingestion.bronze.documents_ingest

You can verify if the ingestion has been completed properly by following the steps below:

• Verify Ingestion from MinIO

• Check the MinIO Console and Nessie UI to ensure data is ingested correctly and is part of the

catalog.

MinIO

• Verify from Nessie Catalog:

Nessie(Port 19120)

By now the data are loaded properly into lakehouse. You can query these using the Dremio.

Querying the Data

By following the above steps, you can transform raw data files (CSV, JSON) into a queryable format

(Iceberg tables) in your data lakehouse, enabling you to run SQL queries on the data.

• Use Dremio UI to query the ingested data.

’

Transformer Model Overview

Before releasing the raw data for use by data consumers, we need to transform it. DBT handles the

transformations in our Lakehouse.

DBT, or Data Build Tool, is an open-source command-line tool that empowers data analysts and engineers

to transform and model data within their lakehouse environment. It allows users to write SQL queries to

convert raw data into a more structured format, which is crucial in a lakehouse setting where data is stored

in its raw form and requires processing for analytical purposes.

All the transformation scripts have been prepared for you in the "transformation" folder of the project.

You can skip the key concepts for now and revisit them after completing the lab exercise titled "Exercise:

Create a Sales Insights Gold Model in DBT and Execute Models in Silver and Gold Layers."

Key Concept:

• Transformation Layers: In this project, we have define two layers in the Lakehouse - Silver layer and

Gold layer.

• Silver Layer: Focuses on transforming raw data formats and filtering out unnecessary columns

to enhance data quality and privacy. For example, last names and other sensitive details are

not propagated to this layer.

• Gold Layer: Contains more complex models that perform aggregations and joins across

different datasets from the silver layer to create refined datasets.

• Configuration Files:

• profiles.yml: Configures connection settings, such as host details and user credentials.

• dbt_project.yml: Defines project settings, including versioning and model configurations,

also how models are materialized (as tables in Silver and as views in Gold).

DBT Projects: DBT (Data Build Tool) is used to manage these transformations. Each layer (Silver

and Gold) has its own DBT project with configuration files that define how data should be

processed and stored. The DBT projects are in the “transformation folder”.

• Models Folder: There are two separate model folder, Silver and Gold. The Silver layer focuses on

transforming raw data and improving data quality by filtering out unnecessary columns. The gold

layer further refines this data, creating more complex datasets.

• Contains folders for different business units. Each folder includes SQL scripts for

transforming data specific to that unit.

• Example: The customers.sql script in the Ecoride folder transforms raw customer data,

enhancing data quality and privacy.

• Source and Schema Files:

• sources.yml: Defines the source data, including database schema and tables (where the

source data comes from).

• schema.yml: Adds details about models, such as descriptions and tests, which are important

for documentation and model testing.

• Materialization:

• In the silver layer, models are materialized as tables.

• In the gold layer, models are materialized as views, which are more complex and include

aggregations and joins.

• Process Overview:

o Silver Layer: Transforms raw data, filters out unnecessary columns, and enhances data

quality.

o Gold Layer: Aggregates and joins data from the silver layer to create refined datasets.

This layered approach helps manage data privacy and quality, ensuring that only necessary information is

accessible to different business units.

Customer_segmentation.sql from Model folder in the Gold Layer:

Transformation Exercise: Create a Sales Insights Gold Model and execute the silver/gold

transformation models using dbt

In this exercise, we will run all the SQL files that have already been defined for you in the

transformation/silver and transformation/gold folders. You will use the DBT command to execute these

files. If you're interested in learning how to create the transformation file, please refer to the description

in step 1. By the end of step 1, you will have created a SQL file named "sales_insights.sql."

To save you time, I've already created the sales_insights.sql file for you. You can skip step 1 and go directly

to step 2.

Step 1: Create a new transformation SQL file (if you want to learn how to create one, otherwise, you

can skip to step 2)

The goal of this exercise is develop an sql file to join sales data with vehicles and customers tables to

enhance the sales data with customer names and vehicle model names.

The model should use tables from the silver layer as its source.

Instructions: Following these steps, to create a unified gold model that provides enriched sales insights.

• Create/Review a new model

o Create a new model by making a new SQL file named sales_insights.sql in the ecoride folder

of the gold dbt project. <= note, I have created this file for you as some of the students might

not familiar with SQL.

o However, I still suggest you to take a look of the sales_insights.sql file to understand its syntax.

• If you would like to build this sql file from scratch, follow the steps below:

1. Create a SQL File: Start by creating a new file named sales_insight inside the ecoride folder.

2. Set a Macro: Define a macro to handle the nessie_branch. This sets the branch to main.

3. Write the Query: Begin with a basic SELECT statement to choose the columns you need from

the sales table. For example:

sql SELECT sales_id, sale_date, sale_price, payment_method FROM silver.sales;

4. Join Tables: Add joins to include columns from the customer and vehicle tables. This helps in

combining relevant data from different tables:

sql

JOIN silver.customers ON sales.customer_id = customers.customer_id

JOIN silver.vehicles ON sales.vehicle_id = vehicles.vehicle_id;

5. Add Columns: Include additional columns as needed for your analysis.

By following these steps, you can create a structured and refined data model using SQL, which

can then be used for further analysis and insights.

6. Execute and Validate: After creating the model, execute your DBT (Data Build Tool) project to

validate and materialize the models. <= I will show you how to do it in the next section.

In step 2, you will learn how to do the data transformation based on the models in the silver and gold layer

subfolders in the transformation folder.

Step 2: Executing Data Transformation

The models located in the gold/silver subfolders of the transformation folder will be utilized by the DBT

tool. You can run the DBT command directly from the terminal.

However, before executing the DBT command, you need to set up two environment variables. These

variables are essential for the DBT (Data Build Tool) to connect to Dremio and execute the SQL

commands.

• Setting up the Environment Variables:

o Set up the two environment variables by entering the following commands directly from the

terminal.

export DREMIO_USER=your_dremio_userid

export DREMIO_PASSWORD=your_dremio_password

• Running dbt Commands:

o You can use the “dbt debug” command to verify that dbt can connect to Dremio and that the

setup is correct.

dbt debug

o Executing the silver model

• Silver models represent intermediate transformations, whereas gold models are the

final, refined datasets.

• Ensure that you have navigated to the silver model folder before executing the “dbt

run” command.

• Use the command: `cd transformation/silver`

• Then, run `dbt run` to execute your silver models.

• Verifying Results

o You can see the name silver shows up in the Dremio catalog. You can also click into it and

find the new datasets.

From Dremio:

You can also check the existence of silver layer at Nessie.

From Nessie:

• Execute the gold model.

• Now move to the gold folder and run the dbt command again for your gold models. If there

are errors, debug the SQL queries and rerun the command.

• cd ../gold/

• dbt run

• Verifying Results for Silver and Gold

o Check the different platforms (MinIO, Nessie, and Dremio) to ensure the models have been

created and materialized correctly

o MinIO: Check the lakehouse bucket for the silver layer.

o Nessie: Verify the new silver folder and models.

o Dremio: Ensure the silver models are materialized as tables and the gold layer is available as

a virtual dataset.

By following the above steps, you have executed data transformations and ensured that your data

lakehouse has been populated with raw and transformed data.

Data Orchestration Overview

Data orchestration involves the automated management and scheduling of data workflows. In a production

environment, we would prefer not to execute these models manually if our lakehouse were operational.

You can skip the key concepts for now and revisit them after completing the lab exercise titled "Exercise:

Create a Sales Insights Gold Model in DBT and Execute Models in Silver and Gold Layers."

Key Concepts:

Dagster is an orchestrator used to automate the execution of data models. Instead of running models

manually, Dagster schedules and manages these tasks.

• Code Components:

o assets.py: Defines silver and gold DBT assets, mapping them to DBT models.

o constants.py: Locates directories for DBT projects and initializes DBT CLI resource objects. It

also sets up environment variables to generate DBT manifests, which are metadata

descriptions for the projects.

o definitions.py: Configures the Dagster project, including DBT assets, schedules, and

resources.

o schedules.py: Placeholder for defining any scheduling you want for your project.

Dagster helps automate and manage the execution of data models, ensuring that dependencies are

respected and tasks are scheduled efficiently.

All the Dagster’s artifacts (.py files) have been created for you in the orchestration folder.

Orchestration folder:

Data Orchestration Execution Steps

This lab exercise guides you through the setup and execution process, emphasizing the significance of

environment variables, DBT assets, and the DAG structure.

• Switch to the orchestration directory,

o cd ../../orchestration

• Set up environment variables to generate manifest files.

o export DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1

• Launch Dagster using the command dagster dev.

o dagster dev

• Open the Dagster UI in your browser to view the asset lineage, also known as a directed acyclic graph

(DAG). This shows the dependencies between bronze, silver, and gold models.

o Open the dagster UI at port 3000 to see the flow

• Click on "Materialize all" to start materializing the models. Dagster ensures that silver models are

executed first, followed by gold models, respecting their dependencies.

• Verify that all models are materialized correctly.

Take Home Exercise 2: BI, Advanced Analytics and ML in Lakehouse

In this exercise, you will execute queries, create virtual datasets (SQL View), and develop an interactive

dashboard using Dremio and Apache Superset. At the end of this exercise, you will learn how to:

• Execute Queries and Create Virtual Datasets:

o You'll use Dremio to run SQL queries and create complex virtual datasets.

• Connect Dremio to Apache Superset:

o You'll link Dremio with Apache Superset to create interactive dashboards.

• Develop a Marketing Dashboard:

o You'll build a marketing dashboard to visualize data insights.

• Use Jupyter Notebooks:

o You'll connect Dremio to Jupyter Notebook for advanced analytics, including tasks like

product reviews analytics and vehicle health analytics.

These activities will help you gain practical experience in using business intelligence tools and performing

advanced data analytics.

Key Concepts:

Terminologies used in this exercise:

• Sources: These are connections to storage systems like S3 or databases, containing datasets

known as physical datasets.

• Physical Datasets: These immutable datasets, marked by a purple-blue icon, are the original data

stored in sources like S3 or databases. Due to their sensitive nature, access is restricted to a

select group of data engineers or administrators. You can view details such as the source, last

update, columns, and data types, and perform actions like adding columns, grouping, joining,

and changing data types directly from the dataset overview screen. They can be downloaded in

formats such as JSON, CSV, or Parquet.

• Virtual Datasets: These are built on top of physical datasets and function like enhanced table

views, allowing for flexible data manipulation and business logic development. Data analysts can

create virtual datasets by executing queries and saving the results. They can be stored in either a

private home space or a public shared space. Adding descriptions and labels, such as "customers

with more than three purchases" or tags like "marketing" and "segmentation," provides context

and improves searchability.

• Spaces: These are locations that organize virtual datasets shared with other users. They can be

grouped by project, business unit, or region and have configurable sharing and access privileges.

• Home Space: A private area for your physical and virtual datasets where you can curate data

until it's ready to be shared.

Create a simple virtual datasets in Dremio

Follow the below steps to create a new virtual dataset with the name “customers_vip” by adding a new

columns to an existing table “customer_segementation”.

Steps:

• Open the Dremio UI

• selecting a physical dataset, "customer_segmentation" dataset.

(catalog.gold.customer_segmentation)

• Execute a simple SQL select statement to view the dataset.

• Add a New Column with the column name “is_vip”

• Click the Add Column

• Name this new Field Name as “is_vip”

• This is a new calculated column to identify VIP customers.

• VIP customers is defined as those with more than three purchases.

• Uses the “total_purchases” field to create this new “is_vip” column, setting it to true for

customers with more than three purchases.

• Apply and Save this to a virtual dataset (i.e. view): The new column is applied and

saved as a view named customers_vip.

• Organize the Dataset: Save this new virtual dataset to a new folder named "marketing"

under lakehouse for easy access and organization.

Creating a Complex Virtual Dataset using Dremio

Now you know how to create s simple virtual dataset, let’s create a more complex virtual dataset using

SQL to make data more useful for analysis. We will split a column (preferred models) in the virtual

dataset “customers_vip” into multiple columns that indicate whether a customer bought a specific

model.

Follow the below steps create this new virtual dataset with the name “customers_vip_split_models”.

Again, we will use Dremio UI to do it.

Steps to Create the complex Virtual Dataset:

• Query the virtual dataset “customers_vip” that you just created in the previous step.

• Write a SQL query to split the “preferred models” column into five new columns, each

showing true (1) or false (0) if a customer bought a specific model. Below is the SQL query for

your reference.

SELECT

customer_id,

is_vip,

first_name,

email,

city,

state,

country,

total_purchases,

average_purchase_value,

CASE WHEN POSITION('PowerRide' IN preferred_models) > 0 THEN 1 ELSE 0 END AS

powerride,

CASE WHEN POSITION('UrbanGlide' IN preferred_models) > 0 THEN 1 ELSE 0 END AS

urbanglide,

CASE WHEN POSITION('EcoSprint' IN preferred_models) > 0 THEN 1 ELSE 0 END AS ecosprint,

CASE WHEN POSITION('AeroFlow' IN preferred_models) > 0 THEN 1 ELSE 0 END AS aeroflow,

CASE WHEN POSITION('TerraCross' IN preferred_models) > 0 THEN 1 ELSE 0 END AS

terracross

FROM

lakehouse.marketing."customers_vip";

• Save this query as a view in the marketing folder with the name customers_vip_split_model

Congratulations! Now, let’s explore some of the features in Dremio.

• Jobs Tab:

o You can check the jobs tab to see details about the query execution, such as who ran it

and the query syntax.

• SQL Tab:

o You can view a graph of all datasets involved in the query, both physical and virtual.

In the next exercise, we will learn how to use Apache Superset to create a dashboard for the lakehouse.

Apache Superset

Apache Superset is a modern data exploration and data visualization platform. Here are its key uses:

• No-Code Interface: Allows you to build charts quickly without needing to write code.

• SQL Editor: Provides a powerful web-based SQL editor for advanced querying.

• Custom Dimensions and Metrics: Offers a lightweight semantic layer for defining custom

dimensions and metrics.

• Wide Database Support: Supports nearly any SQL database or data engine.

• Beautiful Visualizations: Includes a wide array of visualizations to showcase your data

effectively.

In this exercise, you will learn how to connect Dremio to Apache Superset. Run SQL queries to retrieve

data and use the retrieved data to build an interactive marketing dashboard in Superset. At the end of the

exercise, you will have an interactive marketing dashboard that visualizes data insights, demonstrating

their ability to use BI tools for data-driven decision-making.

Key steps for this exercise

• Connect Superset to Dremio

• Open your browser and go to the URL running on port 8088 (Superset).

• Log in to Superset using the default credentials: username:admin and password:admin.

• Connect Superset to Dremio:

o After login in to Superset, go to Settings and select Database Connections.

o Add a new database and select Dremio from the supported databases.

o Enter the SQLAlchemy URL, replacing the username and password with your Dremio

credentials. (i.e. your username/password to access Dremio)

dremio+flight://{username}:{password}@dremio:32010/?UseEncryption=false

Using SQLAlchemy:SQLAlchemy is a toolkit that helps manage database connections in a

Pythonic way.

• Add a Dataset at Superset

• Go to Datasets and select Create New Dataset.

• Choose the Dremio database and select the schema (e.g., lakehouse.marketing).

• Select the table (e.g., customers_vip) and click Create Dataset and Create Chart.

• Creating a marketing dashboard

The following steps describe how to create and customize visualizations and dashboards using

Apache Superset. Try to do your own exploration on data visualization, dashboard building and

customizing the final layouts for your dashboard.

• Visualize Data: Learn to create different types of charts (e.g., pie charts, bar charts) to represent

customer data.

• Build Dashboards: Combine multiple charts into a cohesive dashboard to present marketing

insights.

• Customize Layouts: Adjust the layout and design of the dashboard for better readability and

presentation.

Utilize the tables `customers_vip` and `customer_vip_split_models`, which you previously created and

stored in the `lakehouse.marketing` schema, to generate various charts for the marketing dashboard.

• Create a Pie Chart:

o Use the is_vip column in the customes_vip dataset to show the breakdown of VIP vs.

non-VIP customers.

o Set the metric as COUNT and customize the labels to show percentages.

• Create a Bar Chart:

o Use the same dataset to how VIP customers by state.

o Use the state column for the X-Axis and filter by is_vip to show only VIP customers.

• Create a Stacked Bar Chart:

o Use the customer_vip_split_models dataset(Make sure that you have added it at

Superset, refer to “Add a Dataset at Superset”)

o Show customer model preferences by state.

o Use the state column for the X-Axis and sum the new columns created for each model

preference.

o Enable Stacking: On the "Customize" tab, find "Stacked Style" and select "Stacked" to

enable the Stacked Histogram pattern.

• Customize the Dashboard:

o Arrange the charts for better presentation.

o Add headers and adjust the layout to make the dashboard more informative.

You should have something like this in the dashboard after completing the above steps.

Capture this screen and make sure to include the url in the screen.

*** You may want to add additional charts to get your bonus points.***

END

【返回顶部】【打印本稿】【关闭本页】

【上一篇】：ECE371编程代做、代写Python程序设计

【下一篇】：ECE371编程代做、代写Python程序设计

联系方式

最新辅导

热门辅导

您当前位置：首页 >> Python编程Python编程

代做FIN7880、代写Python编程语言

日期：2025-05-05 08:19

相关文章