Architecture

Prepare

  • AWS console
  • Region : Oregon

1. Session 1

1.1 建立 cloudformation 環境

  • Download sdl-workshop.json (important)
  • Open Cloud Formation
  • create stack > select sdl-workshop.json
  • Specifiy stack details :
    • Stack name : scott-workshop-0912
    • yourInitials : s0912 ===> characters should <= 5
  • Configure stack options :
    • Tags : (empty)
    • Permissions : (empty)
  • Review :
    • ✅ I acknowledge that AWS CloudFormation might create IAM resources.
    • create stack

1.2 建立 S3 資料存放區

  • Open S3
  • create bucket :
    • name: s0912-tame-bda-immersion (< your initials >-tame-bda-immersion)
    • region : US West (Oregon)
    • create bucket

1.3 Kinesis Firehorse

  • open Kinesis
  • create Kinesis Firehose delivery streams
  • New delivery stream
    • Delivery stream name : tamebda-rta-kinesisfh-prodcat
    • Source: Direct Put or other sources
  • Process records
    • Transform source records with AWS Lambda – Record transformation: disabled
    • Convert record format – Record format conversion Conversion: disabled
  • Select a destination
    • S3 destination : s0912-tame-bda-immersion
    • S3 prefix: raw/
  • Configure settings
    • Buffer size: 1 MB
    • Buffer interval: 60 seconds
  • S3 compression and encryption
    • S3 compression: gzip
    • S3 encryption: Disabled
    • Error logging: Enabled
  • Permissions
    • IAM role > Create new or choose
    • IAM role :
      • Select the role created by the CloudFormation template : scott-workshop-0912-tameFHoseRoleSlessDataLakeImme-< UNIQUE ID >
    • Policy Name :
      • Select the role created by the CloudFormation template : FirehosePolicyServerlessDataLakeImmersion
  • Review > Create delivery stream

1.4 Amazon Kinesis Data Generator (KDG)

  • KDG makes it easy to send data to Kinesis Streams or Kinesis Firehose. Learn how to use the tool and create templates for your records.
  • Github: awslabs/amazon-kinesis-data-generator
  • Documents Overview
    • click Create a Cognito User with CloudFormation
  • Create stack
    • (sources path were ready when entering page)
    • Specify stack details
    • Stack name : Kinesis-Data-Generator-Cognito-User-0912
    • Parameters > Cognito User for Kinesis Data Generator
      • Username : __your_username__
      • Password : __your_password__
    • Configure stack options :
    • Tags : (empty)
    • Permissions : (empty)
    • Review
    • create stack
  • After creating stack, you can see KinesisDataGeneratorUrl value on Outputs sheet
    • click url on Value
  • Amazon Kinesis Data Generator

    • enter your username and password
    • Region : us-west-2
    • Stream/delivery stream : ``tamebda-rta-kinesisfh-prodcat
    • Records per second : 100
    • Record template :

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      
      {
      "productName": "{{commerce.productName}}",
      "color": "{{commerce.color}}",
      "department": "{{commerce.department}}",
      "product": "{{commerce.product}}",
      "imageUrl": "{{image.imageUrl}}",
      "dateSoldSince": "{{date.past}}",
      "dateSoldUntil": "{{date.future}}",
      "price": {{random.number(
          {
              "min": 10,
              "max": 150
          }
      )}},
      "campaign": "{{random.arrayElement(
      ["BlackFriday","10Percent","NONE"]
      )}}"
      }
    • Press Test Template to generate a few records. (check again)

    • Press Send data

  • After few minutes, go to Data Firehorse > select tamebda-rta-kinesisfh-prodcat > Monitoring sheet

  • go to S3 > select s0912-tame-bda-immersion > you will see lots of fake data in /raw

1.5 Summary

Congratulations. You have successfully created an ingestion pipeline without servers, which is ready to process huge amounts of data.

So, what’s next? In the next lab, you will create the processing pipeline to convert, transform the data from the ingestion layer..

1.6 Deleting Resources

If you won’t proceed to the next lab, make sure you terminate the resources below to avoid bills.

  • S3:
    • Delete the buckets and the data inside.
  • Firehose:
    • Delete the delivery stream
  • IAM
    • Security Warning: For the sake of simplicity, some of the permissions in the examples below are a bit too permissive (for example, the IAM role defined for the sagemaker notebook has s3FullAccess permissions. This is a configuration issue we will fix in the next version of the CloudFormation template). The permissions shall either be removed for the user after the

2. Session 2 : Glue

2.1 Cataloging a Data Source with AWS Glue

  • Data catalog > select Crawlers > add crawler
  • Crawler name : s0912-tame-bda-immersion-gc (__your_initials__-tame-bda-immersion-gc)
  • Specify crawler source type
    • Crawler source type : Data stores
  • Add a data store
    • Choose a data store : S3
    • Crawl data in : Specified path in my account
    • Include path : s3://s0912-tame-bda-immersion/raw (make sure you select /raw folder)
  • Add another data store : No
  • Choose an IAM role
    • Choose an existing IAM role : scott-workshop-0912-tameGlueRoleSlessDataLakeImmer-< UNIQUE ID >
  • Create a schedule for this crawler
    • Frequency : Run on demand
  • Configure the crawler’s output
    • Add database : s0912-tame-bda-immersion-gdb (< your initials >-tame-bda-immersion-gdb)
  • create crawler > finish
  • select crawler you just created > Run crawler
  • After few minutes, the crawler finish, go to Database > Tables, and you will see raw tables
  • Review Schema :

  • Edit the schema by renaming the partitions to correct values:
    • Partition_0 -> year
    • Partition_1 -> month
    • Partition_2 -> day
    • Partition_3 -> hour

2.2 Transforming a Data Source with Glue

  • Glue concepts used in the lab:

    • ETL Operations: Using the metadata in the Data Catalog, AWS Glue can autogenerate Scala or PySpark (the Python API for Apache Spark) scripts with AWS Glue extensions that you can use and modify to perform various ETL operations. For example, you can extract, clean, and transform raw data, and then store the result in a different repository, where it can be queried and analyzed. Such a script might convert a CSV file into a relational form and save it in Amazon Redshift.

    • Jobs- The AWS Glue Jobs system provides managed infrastructure to orchestrate your ETL workflow. You can create jobs in AWS Glue that automate the scripts you use to extract, transform, and transfer data to different locations. Jobs can be scheduled and chained, or they can be triggered by events such as the arrival of new data.

    • AWS Glue runs your ETL jobs in an Apache Spark serverless environment.

    • AWS Glue can generate a script to transform your data. Or, you can provide the script in the AWS Glue console or API.

    • You can run your job on demand, or you can set it up to start when a specified trigger occurs. The trigger can be a time-based schedule or an event.

    • When your job runs, a script extracts data from your data source, transforms the data, and loads it to your data target. The script runs in an Apache Spark serverless environment in AWS Glue.

  • Open S3 service > select bucket you created : s0912-tame-bda-immersion

  • create new folder : scripts-etl

  • Open Glue service > ETL > Jobs > Add job
  • Configure the job properties
    • Name : s0912-gj-tame-bda-kdg-raw2parquet (< your-initials >-gj-tame-bda-kdg-raw2parquet)
    • IAM role : scott-workshop-0912-tameGlueRoleSlessDataLakeImmer-__UNIQUE_ID__
    • Type : Spark
    • Glue version : Spark 2.4, Python3 (Glue version 1.0)
    • This job runs : Select A proposed script generated by AWS Glue
    • Script name : s0912-tame-bda-kdg-raw2parquet (< your-initials >-tame-bda-kdg-raw2parquet)
    • S3 path where the script is stored : s3://s0912-tame-bda-immersion/scripts-etl
    • Temporary directory : Don't change ( Glue provides a default value )
    • Advanced Properties :
    • job bookmarks : enable
    • Monitoring options :
    • job metrics: enable
  • Choose a data source
    • select raw table from our Glue DB

  • Choose a transform type
    • ✅ Change schema
  • Choose a data target
    • ✅ Create tables in your data target
    • Data store : S3
    • Format : Parquet
    • Target path : s3://s0912-tame-bda-immersion/compressed-parquet

  • Map the source columns to target columns
    • Remove color from target.
    • Rename datesoldsince to date_start
    • Rename datesolduntil to date_until

  • Save the script and close the edit script window.
  • Select job: s0912-gj-tame-bda-kdg-raw2parquet > Action > Run Job
  • Wait a few minutes while the job is going through stages :
    • Starting
    • Running
    • Stopped

  • After ETL job finished, you will see results in S3

2.3 Summary & Next Steps

Congratulations. You have successfully created an processing pipeline without clusters, which is ready to process huge amounts of data.

So, what’s next? Since the amount of data you processed in this lab was tiny, in the next lab, you will work with an open data set by first cataloging it using Glue, and querying using Athena.

3. Athena

3.1 Query your sample data from Glue

  • select Database (Glue) : s0912-tame-bda-immersion-gdb
  • New query :

    1
    
    select * from raw order by productname limit 20

s0912-tame-bda-immersion-gdb 是一開始 Glue > Run crawler 後的 database 資料
其實我們後來還有使用 ETL 將 /raw 轉換成 parquet 格式存放在 compress-parquet
我們可以再使用 2.1 節所教的方法,建立一個新的 crawler : s0912-gc-parquet 可以一樣使用 s0912-tame-bda-immersion-gdb database
raw table 表示原始 json 格式資料;compress-parquet table 表示 parquet 格式資料

比較 : 在 raw table 使用 sql 查詢資料

比較 : 在 compress-parquet table 使用 sql 查詢資料

據說 parquet 資料格式在查詢大資料量(> 50G?)的時候,就可以體驗到他查詢快速的魅力 (…未知,待驗證 🤔)

3.2 Query 200GB sample Data on Athena

由於本次 workshop 自己產的 fake data 才 10MB
很難看出查詢效能

所以我們這邊就借用別人已經建好的 S3 sample data 來實驗 Athena 的查詢效能

Gdelt 所提供的 S3 public dataset : s3://gdelt-open-data/events/

1
2
/* 建立 Databese */
CREATE DATABASE gdelt;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
/* 建立 events table */
/* 使用 gdelt 所提供的來源資料 */
CREATE EXTERNAL TABLE IF NOT EXISTS gdelt.events (
    `globaleventid` INT,
    `day` INT,
    `monthyear` INT,
    `year` INT,
    `fractiondate` FLOAT,
    `actor1code` string,
    `actor1name` string,
    `actor1countrycode` string,
    `actor1knowngroupcode` string,
    `actor1ethniccode` string,
    `actor1religion1code` string,
    `actor1religion2code` string,
    `actor1type1code` string,
    `actor1type2code` string,
    `actor1type3code` string,
    `actor2code` string,
    `actor2name` string,
    `actor2countrycode` string,
    `actor2knowngroupcode` string,
    `actor2ethniccode` string,
    `actor2religion1code` string,
    `actor2religion2code` string,
    `actor2type1code` string,
    `actor2type2code` string,
    `actor2type3code` string,
    `isrootevent` BOOLEAN,
    `eventcode` string,
    `eventbasecode` string,
    `eventrootcode` string,
    `quadclass` INT,
    `goldsteinscale` FLOAT,
    `nummentions` INT,
    `numsources` INT,
    `numarticles` INT,
    `avgtone` FLOAT,
    `actor1geo_type` INT,
    `actor1geo_fullname` string,
    `actor1geo_countrycode` string,
    `actor1geo_adm1code` string,
    `actor1geo_lat` FLOAT,
    `actor1geo_long` FLOAT,
    `actor1geo_featureid` INT,
    `actor2geo_type` INT,
    `actor2geo_fullname` string,
    `actor2geo_countrycode` string,
    `actor2geo_adm1code` string,
    `actor2geo_lat` FLOAT,
    `actor2geo_long` FLOAT,
    `actor2geo_featureid` INT,
    `actiongeo_type` INT,
    `actiongeo_fullname` string,
    `actiongeo_countrycode` string,
    `actiongeo_adm1code` string,
    `actiongeo_lat` FLOAT,
    `actiongeo_long` FLOAT,
    `actiongeo_featureid` INT,
    `dateadded` INT,
    `sourceurl` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
    'serialization.format' = '\t',
    'field.delim' = '\t'
)
LOCATION 's3://gdelt-open-data/events/';
1
2
3
4
5
6
-- Find the number of events per year
SELECT year,
COUNT(globaleventid) AS nb_events
FROM gdelt.events
GROUP BY year
ORDER BY year ASC;

Notice the data amount scanned? The results are returned in less than 50 seconds by scanning 190 GB of data from thousands of uncompressed CSV files on S3. That’s the power of HIVE, Presto and other Hadoop Technologies simplified by Athena Service.

  • workshop03_Athena.pdf 參考 5 ~ 17 建立 gdelt.eventcodes, gdelt.types, gdelt.groups, gdelt.countries 表,可進行 22, 23, 24 的 sql 測試練習 (本人有點懶,這邊就不實作這段了…)

  • 最後 25 提到使用未壓縮的 csv 檔案格式,得到的查詢效能其實還好。 若是改用 Parquet 的格式,則會大大提升查詢效能

  • 因為我們的 s3 並沒有 gdelt 全部的 csv raw data,所以這邊整理一些可用的參考實作資料

4. Quicksight

4.1 Switch to Virginia region for permission setting

  • 參考 workshop04_Quicksight.pdf 第 3 點,重新配置 Athena 和 S3 的權限

4.2 建立 Quicksight Data Set

  • 點選左上角 New Analysis > New Data Set > Athena,建立一個 Data source : < 名稱自訂: s0912-demo >
  • QuickSight 會尋找 Glue Data Catalog 裡的 Database 資訊,來找出資料的 metadata,包含 Database 以及 Table 資訊
    • 選擇 : s0912-tame-bda-immersion-gdb

由於這次 Workshop 目的是讓大家學習如何使用,而且資料量並不大,因此我們不使用 QuickSight SPICE 做 Query 加速,直接選擇 Directly query your data 並點選 Visualize 按鈕

4.3 Start to play around QuickSight!

使用左上角 Add 按鈕新增 Visual 圖表。圖表可在左下方選擇你喜歡的圖表類型,並且拖拉 Data set 裡的資料欄位到你圖表對應的 位置,如長條圖中的 X Y 軸