MET CS 689 Designing and Implementing a Data Warehouse
Assignment 3A: ETL - Python
Overview of the Assignment:
ETL takes most of the data warehouse developers’ and administrators’ time. This assignment will go through some of the more common ETL processes using Python.
Part 1 –Extract, Staging
Download the two csv files Ships and CLIWOC15.csv. You just completed the “Extract” phase, it does get more complicated, but we keep it simple here.
We will begin by “staging” the data. The first step before “Transformation” is to get the data prepared or what’s referred to as “staged” before we can load it, and that initial step is to load the data into “staging tables” in this case these are going to be staging data frames. The idea of staging tables is to keep the data as close to the source format as possible. We will begin by using Python, to load the two files into two data frames called Ship_df and Trip_df, respectively. Review the data and note that the two files share three columns: ShipName, ShipType, and Nationality. These will be used as “natural composite key” columns to join the two data frames (tables).
Pandas Reference/Hints/Notes
· Pandas - DataFrame. Reference (w3schools.com) is a great reference to Pandas methods
· Review run LoadTitanic.py from assignment 1 on how to create a data frame. from a CSV file. Note that you don’t have to have a separate Load.py, you can paste the load command directly in the notebook, however having separate load.py source files allow you to build modularity.
· display(df) method shows all the data in the data frame. – this is very helpful in understanding the data
· df.head()method allows you to inspect the header and the first 10 rows – this is very helpful in understanding the data with performance in mind.
· df.count() method will show count of records, it’s good to know what was loaded or not.
· List(df.columns.values) method shows the attributes
· Loading Trips_df might give you an error, understand why you are getting the error, and research how to solve it (you can use the “lazy option”, and not convert the data types- especially for the initial staging table)
1. Once you load the two data frames: How many rows and columns are in each data frame?
|
Rows |
Columns |
Ship_df |
|
|
Trip_df |
|
|
Show the record count of both Ship_df and Trip_df data frames as well as the count.
Python command:
Screenshots of the executed command:
Part 2 –Creating SCD1 Dimension and Key maintenance
2. Our end goal for this section of the assignment is to create a Ship dimension table ShipDim. In this case we are going to keep it simple and use SCD type1, meaning overwrite if there are any changes, or add if it’s a new record. Recall that SCD type 1 needs to have a unique instance of each record, so let’s check if there are any duplicates in the Ship_df
Hints:
· There are null values in some attributes, we want all combinations with the null (NaN) values, not all of these techniques below will work correctly- look to figure out which ones will give you the best results
o value_counts() function – this might only help in seeing the count for a single attribute.
o groupby() function – make sure to include all three attributes as we want to find a distinct combination of all three.
o concatenate attributes for example df.a +"-"+ df.b
o df.fillna('') which will help transform. empty values into strings
o df.describe(include='all') which will give you statistics of the data frame.
· You can apply multiple functions together on a data frame. For example, you can apply both groupby first and value_counts after to get a count of distinct values, however groupby may have issues with the nulls in one of the columns.
Show if there are any duplicate combination of all three attributes in the Ship_df, this can just show the counts of combinations.
Python command:
Screenshots of the executed command:
3. You will notice that Ship_df has some duplicates that need to be removed before that data can be used to populate a ShipDim dimension table for it to be in SCD1 format. Use pandas to drop the duplicates and store the result into another data frame. labelled ShipDistinct_df.
· Hint: Review the drop_duplicates() function.
Show the above operation to create a new ShipDistinct_df data frame. by removing the duplicate records.
Python command:
Screenshots of the executed command:
4. Show new ShipDistinct_df data frame. record count which has no duplicates, use the commands you used in question 1 and 3 on this new data frame.
Python command:
Screenshots of the executed command:
How many rows are in ShipDistinct_df after duplicates have been dropped? ______
Note: the new count should make sense in reviewing groupby count results between the two data frames.
5. Now let’s focus on the Trip_df. Trip has some additional ships (ShipName, ShipType, Nationality) that do not currently appear in ShipDistinct_df. These new dimension rows need to be pulled and added to the dimension table. Inspect the column names of the Trips_df and provide screenshot of the columns.
Hint: Look into columns.tolist() function, if not all columns display, modify settings to display all columns by using pd.set_option('display.max_columns', None)
Python command:
Screenshots of the executed command:
6. Our goal is to identify any Ships in Trips_df that is not in ShipDistinct. We can use a LEFT JOIN on these dataframes to determine which values of (ShipName, ShipType, Nationality) are in Trip_df but not in ShipDistinct_df.
First create a joined data frame. called ShipTrips_df by using the merge() function on the Trips_df and the ShipDistinct_df; use a left join on (ShipName, ShipType, Nationality) and set the indicator to True.
Show the merge command to create the ShipTrips_df data frame.
Python command:
Screenshots of the executed command:
7. Inspect the resulting ShipsTrips_df data frame. (use the head() function), specifically scroll all the way to the right and note the _merge column that has been added. Let’s determine the unique combinations of _merge column by using the value_counts() function.
Show the value_counts() of the _merge column.
Python command:
Screenshots of the executed command:
For each of the resulting values (you should see three) of the results above, very briefly explain what it means – short, bulleted list.
· Your explanation goes here….
·
8. Now we can filter out the new records we will need to bring into our Ships from the joined data frame, decide the filter condition based on the results from question 7.
Hint: Look into query and filter functions to use on the ShipsTrips_df. The filter function will show the attributes that we want to look at, while the query function will help us filter the results
Provide the function call for the ShipsTrips_df showing only the columns that we need (ShipName, ShipType, Nationality), as well as the _merge column. The function call should filter (query) _merge column as outlined in the directions above.
Python command:
Screenshots of the executed command:
9. Your rows count should match the count in question 7. How many new records were found? ____
10. Now let’s find the distinct instances of the ShipName, ShipType, and Nationality. Perform. the same operations as outlined in steps 2 through 4 to create a ShipsTrips_Distinct_df (this new data frame. should not have any duplicates)
Python command:
Screenshots of the executed command:
11. Show the resulting data frame. – list all the data.
Python command:
Screenshots of the executed command:
How many new records are in the ShipsTrips_Distinct_df? _____
12. Combine the two distinct data frames (ShipsDistinct_df and ShipsTrips_Distinct_df ) into DimShip data frame. Hint: Use the pandas append() or the pd.concat() function, look to ignore the existing index as we will create a new surrogate primary key in the next step.
Show the command combining the data frames into a single DimShip.
Python command:
Screenshots of the executed command:
How many records are in DimShip now? _______
13. Now we need to create a surrogate primary key for the DimShip data frame. Use the reset_index() function to add a column to the DimShip data frame. and call the column “Id”, start the index at 1. Hint: investigate how to add a new column to the existing data frame.
Show the command creating the surrogate key for DimShip and a separate command showing the new index column. Use display() method.
Python command:
Screenshots of the executed command:
Congratulations, you have now created a clean data frame. called DimShip which includes distinct record combinations as well as a primary key.
Part 3 –Creating Fact data frame
Now that we have a DimShip dimension, lets focus on creating a FactTrip dat aframe. There are many columns in the Trip_df – if you recall fact tables contain measures. Some of the columns are good candidates for additional dimensions, we are going to keep it simple for now and focus on creating a Fact table. Select three or four numeric columns from the original Trip_df dataframe. which you will use.
14. List the three measures columns you will use:
· Measure 1:
· Measure 2:
· Measure 3:
15. Create a new dataframe. called FactTrip that includes the following attributes:
a. ShipName, ShipType, Nationality – these are our natural key to connect the fact table to the ShipDim
b. RecID, Year, Month, Day (we will work with dates in a later question)
c. The DimShipId foreign key from the DimShip dataframe. – you will need to join to the DimShip dataFrame. to get this attribute. Make sure to call this attribute DimShipId
d. Three numeric values which you selected in question 15.
Provide the command(s) creating the FactTrip data frame. and the results of displaying some sample data from the FactTrip. Provide a third screenshot verifying counts between the original Trip_df and FactTrip.
Python commands:
Screenshots of the executed command:
Which counts specifically from the two data frames gives you confidence in the FactTrip data frame? (Short answer- single sentence)
Your answer goes here:
16. Add a surrogate primary key to the FactTrip data frame. like we did in step 13 for the DimShip data frame.
Provide the command and results of creating the surrogate key for FactTrip and a second command and results showing the new index column including several rows of the FactTrip data frame. Use display() method.
Python commands:
Screenshots of the executed commands:
Part 4 – Transformation: Dates
There are a few columns in the FactTrip data frame. that, together, indicate a specific date: Year, Month, Day. We will use the next step to transform. those columns into a date column.
17. Add two new columns to the data frame. and populate one with a string for the date and the second with the date calculated from the string. Hint, look into date formatting and string conversion as well as the to_datetime() function and pay attention to how to handle errors (errors='coerce') and format.
Python command:
Screenshots of the executed command:
18. Take a screen shot showing the first 10-20 rows of the updated FactTrip data frame.
Python command:
Screenshots of the executed command:
Transforming date is a single example of transformation, usually this is one of the more complicated steps. In your project, you will want to focus transforming strings and aggregating data to create measures as part of transformation instead of just extracting measures from the source file.
Part 5 – Load the data frames into the database tables
Create a new database in your system and create a table called Dim_Ship and load it with the DimShip data frame. The Dim_Ship table should have all the attributes from the DimShip including the primary key. You can choose to use SQL or Python to create the Dim_Ship table. Using Python, populate the new Dim_Ship table.
19. Take a screen shot of your command to create the Dim_Ship table in your database
Python/or SQL command:
Screenshots of the executed command:
20. Show your command of loading the DimShip into the database using Python.
Python command:
Screenshots of the executed command:
21. Display the loaded data from the Dim_Ship table in your Database using SQL
SQL command:
Screenshots of the executed command:
Next, we will create a table called Fact_Trip and load it with the FactTrip data frame. data. Fact_Trip table will contain the following attributes from the FactTrip data frame. The primary key, for foreign key to the Dim_Ship, the Date you converted, and the three measures. You will want to skip loading ShipName, ShipType and Nationality fields as you now have a foreign key to it and these are no longer needed. You can choose to use SQL or Python to create the table. Use Python to populate the FactTrip table.
22. Take a screen shot of your command to create the Fact_Trip table in your database
Python/SQL command:
Screenshots of the executed command:
23. Take a screen shot of your command loading the FactTrip into the database using Python
Python command:
Screenshots of the executed command:
24. Take a screen shot selecting the loaded data from the Fact_Trip table in your Database using SQL
Python command:
Screenshots of the executed command:
Part 6 –Extract, Staging
We will start by working with the original source files Ships.csv and CLIWOC15.csv.
The first step is “staging” the data within our SQL database itself by creating the staging tables directly within the relational database itself. The idea of staging tables is to keep the data as close to the source format as possible.
1. Load the Ships.csv file into the staging table Ships_Staging on your DBMS
· SQL Server data load tutorial: https://learn.microsoft.com/en-us/sql/integration-services/import-export-data/start-the-sql-server-import-and-export-wizard?view=sql-server-ver15#sql-server-management-studio-ssmsb.
i. Hint for SQL Server, on the Modify Columns screen, make sure to select allow NULLs for all three columns.
· PostgreSQL data load tutorial: https://www.postgresqltutorial.com/postgresql-tutorial/import-csv-file-into-posgresql-table/
ii. use the Import CSV file into a table using pgAdmin
iii. You will need to create the table first before using the import wizard
Screenshots of the data load of Ships_Staging:
· For SQL Server: provide the Summary and Results screenshots of the Import Flat File Wizard. You do not need to create the table first as the import wizard will do it for you.
· For PostgreSQL: provide your DDL SQL Code to create Ships_Staging table, as well as a screenshot of the successfully completed confirmation screen similar to the one at the end of the tutorial above.
25. Verify that there is data Ships_Staging table by selecting data from it.
SQL command:
Screenshots of the executed command:
26. Load the Trips csv data from the CLIWOC15.csv file into Trips_Staging table. You are free to pick the method of loading this staging table including Python, SQL and or a combination of both.
a. Hints: Both SQL Server and PostgreSQL will have trouble loading the full file using the method in the previous question where you loaded the Ships_Staging table. This is mostly due to the variation of the data in some of the columns, we would need to know the appropriate lengths.
b. Suggested approach: Use any method you like, however what might be the easiest is to load the Trips data into a data frame. as python is more forgiving, filter out the columns that you need (recall you will need ShipName, ShipType, and Nationality, RecID, Year, Month, Day, and the three measures you selected in the previous assignment. Create the Trips_Staging and load using the same method that you loaded the TripFact table at the end of the last assignment, however here you are loading the original data from the CLIWOC15.csv file, beyond the selection of the columns and the load there are no other steps on the Python side.
SQL/Python commands:
Screenshots showing the data load:
27. Verify that there is data in the Trips_staging table by selecting data from it.
SQL command:
Screenshots of the executed command:
28. Once you load the two tables: How many rows are in each staging table? Did your row count match to the loaded data frames in the previous assignment?
|
Rows |
Ships_staging |
|
Trips_staging |
|
Part 7 –Creating SCD1 Ship Dimension, Loading and Key maintenance
29. Our end goal for this section of the assignment is to create a Ship dimension table DimShipSQL which will be SCD type1, we will need to determine unique records from the Ships_Staging table first.
Select unique records Ships_Staging using SQL
SQL command:
Screenshots of the executed command:
How many unique rows are there? ______
30. Recall that Trips_Staging has a few additional dimensional records. Write a query which gets additional unique ships (ShipName, ShipType, Nationality) which are not in Ships_Staging.
a. Hint, this can be done via a sub-query – this is probably preferred, or using a left join filtering out the nulls.
SQL command:
Screenshots of the executed command:
How many additional unique ship rows are there in the Trips_Staging? ______
31. Combine the two queries from question 5 and 6 using the UNION command. You know have your dimensional records.
SQL command:
Screenshots of the executed command:
How many unique rows are there? ______
32. Create the DimShipSQL table – it will be the same structure as ShipDim from Assignment 3A – however look to implement key management for DimShipID surrogate key using Identity (SQL Server) or serial (PostgreSQL)
SQL command:
Screenshots of the executed command:
33. Load the data from the results of the previous question #8 into the DimShipSQL
a. Hint, use the insert into command.
SQL command:
Screenshots of the executed command:
Display the loaded data from the DimShipSQL table and check the record count to make sure it matches
SQL command:
Screenshots of the executed command:
How many rows are there in the resulting DimShipSQL? Your rows count should match the count in question 7 _____
Congratulations, you have now created a clean SCD1 table called DimShipSQL which includes distinct record combinations as well as a primary key.
Part 8 –Creating Fact table by joining and transforming
Now that we have a DimShipSQL dimension, lets again focus on creating a FactTripSQL table. Note the three or four numeric columns from the original Trip_Staging table that you are using.
34. List the three measures columns you will use:
· Measure 1:
· Measure 2:
· Measure 3:
35. Create the TripFactSQL table – it will be the same structure as TripDim from Assignment 3A – however look to implement key management using Identity (SQL Server) or serial (PostgreSQL) for the primary key. Note that in some fact table designs, there is no surrogate PK used, we will use one here.
Your TripFactSQL should include the following attributes
a. TripFactSQL PK using identity/serial for key management
b. DimShipID FK which should not be NULL
c. TripRecID (the source column for this will be RecID)
d. TripDate (date column)
e. The three/four measures you have selected.
SQL command:
Screenshots of the executed command:
36. We will begin to shape the data which will be inserted into the TripFactSQL table by joining Trip_Staging to the DimShipSQL – recall using the three columns that define the ship dimension as the join condition. Your resulting query will include the following fields from the Trip_Staging table: TripRecID, RecID, Year, Month, Day, the three/four measures of your choice, and DimShipID from the DimShip table.
a. Hint: it might be helpful at first to see the Ship data fields from both tables first to make sure the join is working, however you won’t need them for the final result.
SQL command:
Screenshots of the executed command:
How many rows are there? _____ Compare this number of rows to how many rows there are in the Trip_Staging table, you should have the same amount of rows.
37. We are almost ready to load this data. Let’s finish the prep work by transforming the month day year fields into a date. Select the RecID, Year, Month, Day as well as your calculated transformed date field called TripDate. Your screenshot can include the first 15-20 rows.
a. Hints: It might be helpful to limit your results as you are building your formula
b. Look at SQL String to date functions, you might need to concatenate the data first, there are a few approaches here.
c. If there are any null values or errors, you can ignore these. For example, you can use a case statement to check each date part for nulls, and only put the date together if all three parts are not null.
SQL command:
Screenshots of the executed command:
Transforming date is a single example of transformation, usually this is one of the more complicated steps. In your project, you will want to focus transforming strings and aggregating data to create measures as part of transformation instead of just extracting measures from the source file.
38. Now we are ready to load our FactTripSQL table. Take your query from question 13, and instead of the separate date fields, use the calculated/transformed Trip date, and insert the results of this query into the FactTripSQL table.
SQL command:
Screenshots of the executed command:
39. Display the loaded data from the FactTripSQL table and check the record count to make sure it matches to results of question 13.
SQL command:
Screenshots of the executed command:
Congratulations, you have now performed ETL in both SQL and Python
40. Within no more than two-three sentences compare and contrast the process between SQL and Python, what was easier in one process then the other? There is no right answer here, we just want to see you reflect on the process!
Part 9 Data flow
41. Create a physical data flow diagram depicting the ETL process in this assignment.
42. Expand on one physical data flow – the one where your DBMS was involved and create a logical data flow depicting the process. Focus and limit this to 3 logical steps within your design (not the entire transform. and load process)
Extra Credit - Extending the dimension
(Up-to 5 extra credit points)
Some of the fields in the CLIWOC.csv could be a new dimension, or part of the Dim_Ship table. Outline the new dimension you want to create, extract, transform. and load both the dimension and the fact appropriately. A suggestion is to focus on some complexity within transformation. This will give you practice and prepare you for the term project.
Show the commends and appropriate screenshots demonstrating your work and that the data has been loaded into the database.
Python commands:
Screenshots of the executed commands:
Extra Credit - Extending the dimension
(Up-to 5 extra credit points)
Similar to the extra credit in 3A, perform. the same steps utilizing SQL as much as possible. Some of the fields in the CLIWOC.csv could be a new dimension, or part of the DimShipSQL table. Outline the new dimension you want to create, extract, transform. and load both the dimension and the fact appropriately. A suggestion is to focus on some complexity within transformation. This will give you practice and prepare you for the term project.
Show the commends and appropriate screenshots demonstrating your work and that the data has been loaded into the database.
SQL/Python commands:
Screenshots of the executed commands:
Use the Ask the Teaching Team Discussion Forum if you have any questions regarding the how to approach this assignment.
Save your assignment as lastnameFirstname_assign3_A.docx and submit it in the Assignments section of the course.
For help uploading files please refer to the Technical Support page in the syllabus.
Criterion |
A |
B |
C |
D |
F |
Letter Grade |
Correctness and Completeness of Results (70%) |
All steps' results are entirely complete and correct |
About ¾ of the steps' results are correct and complete |
About half of the steps' results are correct and complete |
About ¼ of the steps' results are correct and complete |
Virtually none of the step's results are correct and complete |
|
Constitution of SQL/Python and Explanations (30%) |
Excellent use and integration of appropriate SQL/Python constructs and supporting explanations |
Good use and integration of appropriate SQL/Python constructs and supporting explanations |
Mediocre use and integration of appropriate SQL/Python constructs and supporting explanations |
Substandard use and integration of appropriate SQL/Python constructs and supporting explanations |
Virtually all SQL/Python constructs and supporting explanations are unsuitable or improperly integrated |
|
|
|
|
|
|
Assignment Grade: |
|
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。