Architecture
Prepare
- AWS console
- Region :
Oregon
1. Session 1
- Download workshop01_Kinesis_Firehose.pdf
1.1 建立 cloudformation 環境
- Download sdl-workshop.json (important)
- Open Cloud Formation
- create stack > select
sdl-workshop.json
- Specifiy stack details :
- Stack name :
scott-workshop-0912
- yourInitials :
s0912
===> characters should <= 5
- Stack name :
- Configure stack options :
- Tags :
(empty)
- Permissions :
(empty)
- Tags :
- Review :
- ✅ I acknowledge that AWS CloudFormation might create IAM resources.
create stack
1.2 建立 S3 資料存放區
- Open S3
- create bucket :
- name:
s0912-tame-bda-immersion
(< your initials >-tame-bda-immersion) - region :
US West (Oregon)
create bucket
- name:
1.3 Kinesis Firehorse
- open Kinesis
- create
Kinesis Firehose delivery streams
- New delivery stream
- Delivery stream name :
tamebda-rta-kinesisfh-prodcat
- Source:
Direct Put or other sources
- Delivery stream name :
- Process records
- Transform source records with AWS Lambda – Record transformation:
disabled
- Convert record format – Record format conversion Conversion:
disabled
- Transform source records with AWS Lambda – Record transformation:
- Select a destination
- S3 destination :
s0912-tame-bda-immersion
- S3 prefix:
raw/
- S3 destination :
- Configure settings
- Buffer size:
1 MB
- Buffer interval:
60 seconds
- Buffer size:
- S3 compression and encryption
- S3 compression:
gzip
- S3 encryption:
Disabled
- Error logging:
Enabled
- S3 compression:
- Permissions
- IAM role > Create new or choose
- IAM role :
- Select the role created by the CloudFormation template :
scott-workshop-0912-tameFHoseRoleSlessDataLakeImme-< UNIQUE ID >
- Select the role created by the CloudFormation template :
- Policy Name :
- Select the role created by the CloudFormation template :
FirehosePolicyServerlessDataLakeImmersion
- Select the role created by the CloudFormation template :
- IAM role :
- IAM role > Create new or choose
- Review >
Create delivery stream
1.4 Amazon Kinesis Data Generator (KDG)
- KDG makes it easy to send data to Kinesis Streams or Kinesis Firehose. Learn how to use the tool and create templates for your records.
- Github: awslabs/amazon-kinesis-data-generator
- Documents Overview
- click
Create a Cognito User with CloudFormation
- click
- Create stack
- (sources path were ready when entering page)
- Specify stack details
- Stack name :
Kinesis-Data-Generator-Cognito-User-0912
- Parameters > Cognito User for Kinesis Data Generator
- Username :
__your_username__
- Password :
__your_password__
- Username :
- Stack name :
- Configure stack options :
- Tags : (empty)
- Permissions : (empty)
- Review
create stack
- After creating stack, you can see
KinesisDataGeneratorUrl
value on Outputs sheet- click
url
on Value
- click
- Amazon Kinesis Data Generator
- enter your
username
andpassword
- Region :
us-west-2
- Stream/delivery stream : ``tamebda-rta-kinesisfh-prodcat
- Records per second :
100
- Record template :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
{ "productName": "{{commerce.productName}}", "color": "{{commerce.color}}", "department": "{{commerce.department}}", "product": "{{commerce.product}}", "imageUrl": "{{image.imageUrl}}", "dateSoldSince": "{{date.past}}", "dateSoldUntil": "{{date.future}}", "price": {{random.number( { "min": 10, "max": 150 } )}}, "campaign": "{{random.arrayElement( ["BlackFriday","10Percent","NONE"] )}}" }
- Press
Test Template
to generate a few records. (check again) - Press
Send data
- enter your
- After few minutes, go to
Data Firehorse
> selecttamebda-rta-kinesisfh-prodcat
> Monitoring sheet
- go to
S3
> selects0912-tame-bda-immersion
> you will see lots of fake data in/raw
1.5 Summary
Congratulations. You have successfully created an ingestion pipeline without servers, which is ready to process huge amounts of data.
So, what’s next? In the next lab, you will create the processing pipeline to convert, transform the data from the ingestion layer..
1.6 Deleting Resources
If you won’t proceed to the next lab, make sure you terminate the resources below to avoid bills.
- S3:
- Delete the buckets and the data inside.
- Firehose:
- Delete the delivery stream
- IAM
- Security Warning: For the sake of simplicity, some of the permissions in the examples below are a bit too permissive (for example, the IAM role defined for the sagemaker notebook has s3FullAccess permissions. This is a configuration issue we will fix in the next version of the CloudFormation template). The permissions shall either be removed for the user after the
2. Session 2 : Glue
- Download workshop02_Glue.pdf
- Open
AWS Glue
service
2.1 Cataloging a Data Source with AWS Glue
- Data catalog > select Crawlers >
add crawler
- Crawler name :
s0912-tame-bda-immersion-gc
(__your_initials__
-tame-bda-immersion-gc) - Specify crawler source type
- Crawler source type :
Data stores
- Crawler source type :
- Add a data store
- Choose a data store :
S3
- Crawl data in :
Specified path in my account
- Include path :
s3://s0912-tame-bda-immersion/raw
(make sure you select/raw
folder)
- Choose a data store :
- Add another data store :
No
- Choose an IAM role
- ✅
Choose an existing IAM role
:scott-workshop-0912-tameGlueRoleSlessDataLakeImmer-< UNIQUE ID >
- ✅
- Create a schedule for this crawler
- Frequency :
Run on demand
- Frequency :
- Configure the crawler’s output
- Add database :
s0912-tame-bda-immersion-gdb
(< your initials >-tame-bda-immersion-gdb)
- Add database :
- create crawler >
finish
- select crawler you just created >
Run crawler
- After few minutes, the crawler finish, go to
Database > Tables
, and you will seeraw
tables - Review Schema :
- Edit the schema by renaming the partitions to correct values:
- Partition_0 ->
year
- Partition_1 ->
month
- Partition_2 ->
day
- Partition_3 ->
hour
- Partition_0 ->
2.2 Transforming a Data Source with Glue
-
Glue concepts used in the lab:
-
ETL Operations: Using the metadata in the Data Catalog, AWS Glue can autogenerate Scala or PySpark (the Python API for Apache Spark) scripts with AWS Glue extensions that you can use and modify to perform various ETL operations. For example, you can extract, clean, and transform raw data, and then store the result in a different repository, where it can be queried and analyzed. Such a script might convert a CSV file into a relational form and save it in Amazon Redshift.
-
Jobs- The AWS Glue Jobs system provides managed infrastructure to orchestrate your ETL workflow. You can create jobs in AWS Glue that automate the scripts you use to extract, transform, and transfer data to different locations. Jobs can be scheduled and chained, or they can be triggered by events such as the arrival of new data.
- AWS Glue runs your ETL jobs in an Apache Spark serverless environment.
- AWS Glue can generate a script to transform your data. Or, you can provide the script in the AWS Glue console or API.
- You can run your job on demand, or you can set it up to start when a specified trigger occurs. The trigger can be a time-based schedule or an event.
- When your job runs, a script extracts data from your data source, transforms the data, and loads it to your data target. The script runs in an Apache Spark serverless environment in AWS Glue.
-
-
Open
S3
service > select bucket you created :s0912-tame-bda-immersion
-
create new folder :
scripts-etl
- Open
Glue
service > ETL > Jobs >Add job
- Configure the job properties
- Name :
s0912-gj-tame-bda-kdg-raw2parquet
(< your-initials >-gj-tame-bda-kdg-raw2parquet) - IAM role :
scott-workshop-0912-tameGlueRoleSlessDataLakeImmer-__UNIQUE_ID__
- Type :
Spark
- Glue version :
Spark 2.4, Python3 (Glue version 1.0)
- This job runs : Select
A proposed script generated by AWS Glue
- Script name :
s0912-tame-bda-kdg-raw2parquet
(< your-initials >-tame-bda-kdg-raw2parquet) - S3 path where the script is stored :
s3://s0912-tame-bda-immersion/scripts-etl
- Temporary directory :
Don't change
( Glue provides a default value ) - Advanced Properties :
- job bookmarks :
enable
- job bookmarks :
- Monitoring options :
- job metrics:
enable
- job metrics:
- Name :
- Choose a data source
- select
raw
table from our Glue DB
- select
- Choose a transform type
- ✅ Change schema
- Choose a data target
- ✅ Create tables in your data target
- Data store :
S3
- Format :
Parquet
- Target path :
s3://s0912-tame-bda-immersion/compressed-parquet
- Map the source columns to target columns
- Remove
color
from target. - Rename
datesoldsince
todate_start
- Rename
datesolduntil
todate_until
- Remove
Save
the script andclose
the edit script window.- Select
job: s0912-gj-tame-bda-kdg-raw2parquet
> Action >Run Job
- Wait a few minutes while the job is going through stages :
- Starting
- Running
- Stopped
- After ETL job finished, you will see results in S3
2.3 Summary & Next Steps
Congratulations. You have successfully created an processing pipeline without clusters, which is ready to process huge amounts of data.
So, what’s next? Since the amount of data you processed in this lab was tiny, in the next lab, you will work with an open data set by first cataloging it using Glue, and querying using Athena.
3. Athena
- Download workshop03_Athena.pdf
- Open
AWS Athena
service - We can use
Presto SQL
to query data
3.1 Query your sample data from Glue
- select Database (Glue) :
s0912-tame-bda-immersion-gdb
- New query :
1
select * from raw order by productname limit 20
s0912-tame-bda-immersion-gdb
是一開始 Glue > Run crawler 後的 database 資料
其實我們後來還有使用 ETL 將/raw
轉換成parquet
格式存放在compress-parquet
我們可以再使用 2.1 節所教的方法,建立一個新的 crawler :s0912-gc-parquet
可以一樣使用s0912-tame-bda-immersion-gdb
database
raw
table 表示原始 json 格式資料;compress-parquet
table 表示 parquet 格式資料
比較 : 在 raw
table 使用 sql 查詢資料
比較 : 在 compress-parquet
table 使用 sql 查詢資料
據說
parquet
資料格式在查詢大資料量(> 50G?)的時候,就可以體驗到他查詢快速的魅力 (…未知,待驗證 🤔)
3.2 Query 200GB sample Data on Athena
由於本次 workshop 自己產的 fake data 才 10MB
很難看出查詢效能
所以我們這邊就借用別人已經建好的 S3 sample data 來實驗 Athena 的查詢效能
Gdelt
所提供的 S3 public dataset : s3://gdelt-open-data/events/
|
|
|
|
|
|
Notice the data amount scanned? The results are returned in
less than 50 seconds
by scanning190 GB
of data from thousands of uncompressed CSV files on S3. That’s the power ofHIVE
,Presto
and otherHadoop
Technologies simplified by Athena Service.
-
workshop03_Athena.pdf 參考 5 ~ 17 建立
gdelt.eventcodes
,gdelt.types
,gdelt.groups
,gdelt.countries
表,可進行 22, 23, 24 的 sql 測試練習 (本人有點懶,這邊就不實作這段了…) -
最後 25 提到使用未壓縮的 csv 檔案格式,得到的查詢效能其實還好。 若是改用
Parquet
的格式,則會大大提升查詢效能 -
因為我們的 s3 並沒有 gdelt 全部的 csv raw data,所以這邊整理一些可用的參考實作資料
4. Quicksight
- Download workshop04_Quicksight.pdf
- Open
AWS Quicksight
service
4.1 Switch to Virginia region for permission setting
- 參考
workshop04_Quicksight.pdf
第 3 點,重新配置 Athena 和 S3 的權限
4.2 建立 Quicksight Data Set
- 點選左上角 New Analysis > New Data Set > Athena,建立一個 Data source : < 名稱自訂: s0912-demo >
- QuickSight 會尋找 Glue Data Catalog 裡的 Database 資訊,來找出資料的 metadata,包含 Database 以及 Table 資訊
- 選擇 :
s0912-tame-bda-immersion-gdb
- 選擇 :
由於這次 Workshop 目的是讓大家學習如何使用,而且資料量並不大,因此我們不使用 QuickSight SPICE 做 Query 加速,直接選擇 Directly query your data 並點選
Visualize
按鈕
4.3 Start to play around QuickSight!
使用左上角 Add 按鈕新增 Visual 圖表。圖表可在左下方選擇你喜歡的圖表類型,並且拖拉 Data set 裡的資料欄位到你圖表對應的 位置,如長條圖中的 X Y 軸