Table
一. 子查詢 Subquery
子查詢就是在一個 SELECT 內再放入一個 SELECT,所以叫做子查詢。可以放於 WHERE 或 SELECT 之後,但必須用小括號包起來。要留意的是啟動順序是會先處理子查詢條件,然後再依子查詢條件處理主查詢。
1.範例:子查詢放於 WHERE 後
1 2 3 4 5 |
SELECT * FROM salaries WHERE salary > (SELECT avg(salary) from salaries) |
2.範例:子查詢放於 SELECT 後
1 2 3 4 5 |
SELECT *, (SELECT avg(salary) from salaries) AS avg_salary FROM salaries |
3.範例:多層子查詢
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT * FROM salaries WHERE depname in( SELECT depname FROM salaries WHERE salary > ( SELECT avg(salary) FROM salaries)) |
二. 通用表達式 Common Table Expressions
CTE 最主要有二個用途:簡化子查詢與遞迴查詢
1.簡化子查詢
如果使用多層子查詢,過幾天之後回來看會發現非常難讀懂,而 CTE 可以有助於程式的易讀性,但要留意的是 CTE 的生命週期短,只有第一個 WITH 後面的 SELECT 有效,再一次查詢則會報錯,所以 CTE 只適合用於一次性的場合
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
WITH cte AS ( SELECT * FROM rfm ) SELECT * FROM cte -- 第二次查詢則會報錯 SELECT * FROM cte |
2.遞迴查詢
關於 CTE 的遞迴查詢,以下這篇寫了兩個蠻淺顯易懂的範例 code,推薦給大家閱讀:Fun with SQL: Recursive CTEs in Postgres
三. 窗函式 Window Function
通常會把 Window Function 分為三類 :
- Ranking Functions:
- ROW_NUMBER – 編號
- RANK – 排序 (會跳號)
- DENSE_RANK – 排序 (不會跳號)
- NTILE – (平均分配)
- Aggregate Functions:
- Max
- Min
- Sum
- Count
- Avg
- Analytic Functions:
- FIRST_VALUE – 取第一個
- LAST_VALUE – 取最後一個
- LAG – 取後一個
- LEAD – 取前一個
然後用以上 function 搭配 OVER() 使用,OVER 裡面可以使用 PARTITION BY 、ORDER BY 和 ROWS or RANGE。
使用 PARTITION BY 、ORDER BY 的範例如下:
1 2 3 4 5 |
SELECT *, rank() OVER (PARTITION BY depname ORDER BY salary) as rank FROM salaries |
而關於 ROWS 的使用如下:
取得加總目前行 (CURRENT ROW) 和前 2 兩行 (2 PRECEDING)
1 2 3 4 5 |
SELECT *, SUM(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as sum_salary FROM salaries; |
取得加總目前行 (CURRENT ROW) 和後 2 兩行 (2 FOLLOWING)
1 2 3 4 5 |
SELECT *, SUM(salary) OVER (ORDER BY salary ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) as sum_salary FROM salaries; |
取得加總前 2 兩行 (2 PRECEDING) 和後 2 兩行 (2 FOLLOWING)
1 2 3 4 5 |
SELECT *, SUM(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) as sum_salary FROM salaries; |
取得加總目前行 (CURRENT ROW) 和之後的每一行 (UNBOUNDED FOLLOWING)
1 2 3 4 5 |
SELECT *, CUME_DIST() OVER (PARTITION BY depname ORDER BY salary ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM salaries |
取得加總目前行 (CURRENT ROW) 和之前的每一行 (UNBOUNDED FOLLOWING)
1 2 3 4 5 |
SELECT *, SUM(salary) OVER (PARTITION BY depname ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM salaries; |
關於 ROWS 和 RANGES 的更詳盡使用方法,推薦以下兩篇:
- Metrics Maven: Window Frames in PostgreSQL – Compose Articles
- SQLT-SQL 的 Window Function 中使用與 OVER 的搭配 | 五餅二魚工作室 – 點部落
四. 實戰操作
▍本次實際操作目錄如下:
- 用戶活躍度指數 MAU / WAU / DAU
- 會員分類 RFM 模型
- 會員分類 NES 模型
- 用戶留存率 Customer Retention Rate
- 數值區間分類貼標籤 Label
- 計算七日平均 Rolling Averages
- 計算累積加總 Cumulative Sum
1. 實戰:用戶活躍度指數 MAU / WAU / DAU
▍名詞解釋:
- Daily Active Users(單日活躍使用者)
- Weekly Active Users(單週活躍使用者)
- Monthly Active Users(單月活躍使用者)
Active Users 的定義不同,比方說:
1. Twitter 的計算方式是:有 follow 30 人,並且有三分之一的人回 follow,這些人之中每個月有「登入」的就算 MAU。
2. Facebook 的計算方式是計算「每個月有互動的人」,這邊的「互動」包含按讚、留言、分享、私訊、點連結等等。
關於更詳細的 MAU 資訊可參考此篇:MAU 指標的真實面貌
建立 Table
1 2 3 4 5 |
CREATE TABLE users ( id SERIAL primary key, user_id int, login_time timestamp ); |
注入資料
1 2 3 4 5 6 7 8 |
INSERT INTO users(user_id, login_time) VALUES (3, '2020-05-02 06:38:35.162709'), (1, '2020-05-02 06:38:35.162709'), (2, '2020-05-09 06:38:35.162709'), (1, '2020-04-02 06:38:35.162709'), (2, '2020-04-09 06:38:35.162709'), (1, '2020-03-02 06:38:35.162709') |
方法一
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
WITH mau AS ( SELECT DATE_TRUNC('month', login_time) month_timestamp, COUNT(DISTINCT user_id) mau FROM users GROUP BY DATE_TRUNC('month', login_time) ) SELECT a.month_timestamp previous_month, a.mau previous_mau, b.month_timestamp current_month, b.mau current_mau, ROUND(100.0 *(b.mau - a.mau) / a.mau, 2) AS percent_change FROM mau a JOIN mau b ON a.month_timestamp = b.month_timestamp - interval '1 month'; |
方法二
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
WITH mau AS ( SELECT DATE_TRUNC('month', login_time) month_timestamp, COUNT(DISTINCT user_id) mau FROM users GROUP BY DATE_TRUNC('month', login_time) ) SELECT *, LAG(mau) over() as lead, ROUND(100.0 * (mau - LAG(mau) over()) / LAG(mau) over(),2) FROM mau |
2.實戰:用戶留存率 Customer Retention Rate
▍名詞解釋:
公式:每月回客率 = 當月舊客戶/當月總客戶
關於更詳細的 用戶留存率 資訊可參考此篇:
如何實際計算 Retention、LTV ,並產出可執行的商業計畫 – Lucas Hsieh – Medium
建立 Table
1 2 3 4 5 |
CREATE TABLE users ( id SERIAL primary key, user_id int, login_time timestamp ); |
注入資料
1 2 3 4 5 6 7 8 |
INSERT INTO users(user_id, login_time) VALUES (3, '2020-05-02 06:38:35.162709'), (1, '2020-05-02 06:38:35.162709'), (2, '2020-05-09 06:38:35.162709'), (1, '2020-04-02 06:38:35.162709'), (2, '2020-04-09 06:38:35.162709'), (1, '2020-03-02 06:38:35.162709') |
query
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
with base as ( /* 計算每月不重複 user_id */ SELECT DISTINCT DATE_trunc('month', login_time) as month, user_id FROM users ), base_total_count as ( /* 計算每月不重複訪客 */ SELECT DISTINCT DATE_trunc('month', login_time) as month, COUNT(DISTINCT user_id) as total_user FROM users GROUP BY month ) SELECT b.month, count(DISTINCT b.user_id) AS retention_user, MAX(c.total_user) AS month_user, round( count(DISTINCT b.user_id) :: numeric / MAX(c.total_user) :: numeric, 2 ) AS retention_rate FROM base a JOIN base b ON a.user_id = b.user_id and a.month = b.month - INTERVAL '1 month' JOIN base_total_count c on c.month = b.month GROUP BY b.month |
3.實戰:會員分類 RFM 模型
在眾多CRM客戶資料分析中,RFM模型是被最常被廣泛提到的,其中的三大要素為R(Recency)、F(Frequency)、M(Monetary):
1. 最近一次消費 (Recency)
2. 消費頻率 (Frequency)
3. 消費金額 (Monetary)
什麼是需要準備的資料?
1. cid:訂單編號
2. date:訂單日期
3. money:訂單花費
關於更詳細資訊或 Python 版本可以參考此篇:數據分析 Python快速計算RFM模型 | Max行銷誌
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
WITH base AS ( SELECT cid, max(date), date_part('day', cast(now() AS TIMESTAMP) - cast(max(date) AS TIMESTAMP)) AS rencency, sum(money) AS money, count(cid) AS frequency FROM rfm_all GROUP BY cid ), data_m AS ( SELECT cid, ntile(4) OVER (ORDER BY money) AS rfm_m FROM base ), data_r AS ( SELECT cid, ntile(4) OVER (ORDER BY rencency DESC) AS rfm_r FROM base ), data_f AS ( SELECT cid, ntile(4) OVER (ORDER BY frequency) AS rfm_f FROM base ) SELECT base.*, rfm_r, rfm_f, rfm_m FROM base LEFT JOIN data_m ON base.cid = data_m.cid LEFT JOIN data_r ON base.cid = data_r.cid LEFT JOIN data_f ON base.cid = data_f.cid |
4.實戰:會員分類 NES 模型
NES模型是為了掌握顧客的消費行為變動性而設計,將客戶分群成為三大類,共五種標籤:
* N 新顧客(New Customer)
* E 既有顧客(Existing Customer)
* E0 主力顧客:購買週期 1 倍時間內回購的人
* E1 瞌睡顧客:超過購買週期 1~2 倍未回購的人
* E2 半睡顧客:超過購買週期 2~3 倍未回購的人
* S1 沉睡顧客(Sleeping Customer)購買頻率超過購買週期3倍未回購的人
什麼是需要準備的資料?
1. cid:訂單編號
2. date:訂單日期
3. money:訂單花費
關於更詳細資訊或 Python 版本可以參考此篇:數據分析 Python快速計算NES模型 | Max行銷誌
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
WITH base AS ( SELECT cid, max(date), date_part('day', cast(now() AS TIMESTAMP) - cast(max(date) AS TIMESTAMP)) AS rencency, sum(money) AS money, count(cid) AS frequency FROM rfm_all GROUP BY cid ) SELECT *, CASE WHEN rencency < (select AVG(rencency) from base)*1 then 'E0' WHEN rencency >= (select AVG(rencency) from base)*1 and rencency < (select AVG(rencency) from base)*2 then 'S1' WHEN rencency >= (select AVG(rencency) from base)*2 and rencency < (select AVG(rencency) from base)*3 then 'S2' WHEN rencency >= (select AVG(rencency) from base)*3 then 'S3' END FROM base |
5.實戰:數值區間分類貼標籤 Label
目標:將 length_seconds 數值切成每 5 為一區間 (20-25、25-30 …) 並統計此區隔內的 length_seconds 次數是多少
1 2 3 4 5 6 7 8 9 10 11 12 |
| session_id | length_seconds | |------------|----------------| | 1 | 23 | | 2 | 43 | | 3 | 67 | | .. | .. | | bucket | count | |---------|-------| | 20-25 | 2 | | 25-30 | 1 | |
建立 Table
1 |
CREATE TABLE Histograms (sid INT, len_seconds INT) |
注入資料
1 2 3 4 5 6 7 8 9 10 11 |
INSERT INTO Histograms (sid, len_seconds) VALUES (1, 23), (2, 43), (3, 63), (4, 73), (5, 13), (6, 83), (7, 84), (8, 82), (9, 63), (10, 65) |
方法一
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
WITH base AS ( SELECT *, floor(len_seconds / 5) AS label FROM Histograms ) SELECT CONCAT(label * 5, '-', label * 5 + 5) bucket, COUNT(DISTINCT sid) FROM base GROUP BY label ORDER BY label ASC |
方法二
1 2 3 4 5 6 7 |
SELECT count(len_seconds), concat(FLOOR(len_seconds / 5) * 5, '-', FLOOR(len_seconds / 5) * 5 + 5) FROM Histograms GROUP BY concat(FLOOR(len_seconds / 5) * 5, '-', FLOOR(len_seconds / 5) * 5 + 5) |
6.實戰:計算七日平均 Rolling Averages
建立 Table
1 |
CREATE TABLE rolling_avg (date TIMESTAMP,sign_up int) |
注入資料
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
INSERT INTO rolling_avg(date,sign_up) VALUES ('2020-05-02 06:38:35.162709', 100), ('2020-05-03 06:38:35.162709', 200), ('2020-05-04 06:38:35.162709', 300), ('2020-05-05 06:38:35.162709', 400), ('2020-05-06 06:38:35.162709', 500), ('2020-05-07 06:38:35.162709', 600), ('2020-05-08 06:38:35.162709', 700), ('2020-05-09 06:38:35.162709', 800), ('2020-05-10 06:38:35.162709', 900), ('2020-05-11 06:38:35.162709', 1000), ('2020-05-12 06:38:35.162709', 1100), ('2020-05-13 06:38:35.162709', 1200), ('2020-05-14 06:38:35.162709', 1300), ('2020-05-15 06:38:35.162709', 1400) |
query
1 2 3 4 5 6 7 8 9 10 |
SELECT a.date, MAX(a.sign_up) as sign_up, avg(b.sign_up) FROM rolling_avg as a join rolling_avg as b on a.date <= b.date + INTERVAL '6 days' AND a.date >= b.date GROUP BY a.date |
7.實戰:計算累積加總 Cumulative Sum
目標:將 cash_flow 採用累加方式計算成新的欄位
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
| date | cash_flow | |------------|-----------| | 2020-05-02 | 100 | | 2020-05-03 | 200 | | 2020-05-04 | 300 | | ... | ... | | date | cumulative_cf | |------------|---------------| | 2020-05-02 | 100 | | 2020-05-03 | 300 | | 2020-05-04 | 600 | | ... | ... | |
建立 Table
1 |
CREATE TABLE cumulative_sum (date TIMESTAMP, cash_flow int); |
注入資料
1 2 3 4 5 6 |
INSERT INTO cumulative_sum(date, cash_flow) VALUES ('2020-05-02 06:38:35.162709', 100), ('2020-05-03 06:38:35.162709', 200), ('2020-05-04 06:38:35.162709', 300), ('2020-05-05 06:38:35.162709', 400), ('2020-05-06 06:38:35.162709', 500) |
方法 A
1 2 3 4 5 6 7 8 9 10 |
SELECT a.date, max(a.cash_flow), sum(b.cash_flow) as sum_cash_flow from cumulative_sum as a JOIN cumulative_sum as b on a.date >= b.date GROUP BY a.date |
方法 B – window function
1 2 3 4 5 6 7 8 |
SELECT date, cash_flow, sum(cash_flow) over (ORDER BY date ASC) as cumulative_cf FROM cumulative_sum ORDER BY date ASC |
最後~
▍回顧本篇我們介紹了的 PostgreSQL 內容:
- PostgreSQL 介紹
- 子查詢 Subquery
- 通用表達式 Common Table Expressions
- 窗函式 Window Function
- 實戰操作
- 用戶活躍度指數 MAU / WAU / DAU
- 會員分類 RFM 模型
- 會員分類 NES 模型
- 用戶留存率 Customer Retention Rate
- 數值區間分類貼標籤 Label
- 計算七日平均 Rolling Averages
- 計算累積加總 Cumulative Sum
那 【資料庫筆記】 PostgreSQL 基礎教學和練習題操作 的介紹就到這邊告一個段落囉!有任何問題可以在以下留言~
有關 Max行銷誌的最新文章,都會發佈在 Max 的 Facebook 粉絲專頁,如果想看最新更新,還請您按讚或是追蹤唷!