文章目錄
一. 子查詢 Subquery
子查詢就是在一個 SELECT 內再放入一個 SELECT,所以叫做子查詢。可以放於 WHERE 或 SELECT 之後,但必須用小括號包起來。要留意的是啟動順序是會先處理子查詢條件,然後再依子查詢條件處理主查詢。
1.範例:子查詢放於 WHERE 後
SELECT
*
FROM
salaries
WHERE salary > (SELECT avg(salary) from salaries)
2.範例:子查詢放於 SELECT 後
SELECT
*,
(SELECT avg(salary) from salaries) AS avg_salary
FROM
salaries
3.範例:多層子查詢
SELECT
*
FROM
salaries
WHERE
depname in(
SELECT
depname FROM salaries
WHERE
salary > (
SELECT
avg(salary)
FROM salaries))
二. 通用表達式 Common Table Expressions
CTE 最主要有二個用途:簡化子查詢與遞迴查詢
1.簡化子查詢
如果使用多層子查詢,過幾天之後回來看會發現非常難讀懂,而 CTE 可以有助於程式的易讀性,但要留意的是 CTE 的生命週期短,只有第一個 WITH 後面的 SELECT 有效,再一次查詢則會報錯,所以 CTE 只適合用於一次性的場合
WITH cte AS (
SELECT
*
FROM
rfm
)
SELECT
*
FROM
cte
-- 第二次查詢則會報錯
SELECT
*
FROM
cte
2.遞迴查詢
關於 CTE 的遞迴查詢,以下這篇寫了兩個蠻淺顯易懂的範例 code,推薦給大家閱讀:Fun with SQL: Recursive CTEs in Postgres
三. 窗函式 Window Function
通常會把 Window Function 分為三類 :
- Ranking Functions:
- ROW_NUMBER – 編號
- RANK – 排序 (會跳號)
- DENSE_RANK – 排序 (不會跳號)
- NTILE – (平均分配)
- Aggregate Functions:
- Max
- Min
- Sum
- Count
- Avg
- Analytic Functions:
- FIRST_VALUE – 取第一個
- LAST_VALUE – 取最後一個
- LAG – 取後一個
- LEAD – 取前一個
然後用以上 function 搭配 OVER() 使用,OVER 裡面可以使用 PARTITION BY 、ORDER BY 和 ROWS or RANGE。
使用 PARTITION BY 、ORDER BY 的範例如下:
SELECT
*,
rank() OVER (PARTITION BY depname ORDER BY salary) as rank
FROM
salaries
而關於 ROWS 的使用如下:
取得加總目前行 (CURRENT ROW) 和前 2 兩行 (2 PRECEDING)
SELECT
*,
SUM(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as sum_salary
FROM
salaries;
取得加總目前行 (CURRENT ROW) 和後 2 兩行 (2 FOLLOWING)
SELECT
*,
SUM(salary) OVER (ORDER BY salary ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) as sum_salary
FROM
salaries;
取得加總前 2 兩行 (2 PRECEDING) 和後 2 兩行 (2 FOLLOWING)
SELECT
*,
SUM(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) as sum_salary
FROM
salaries;
取得加總目前行 (CURRENT ROW) 和之後的每一行 (UNBOUNDED FOLLOWING)
SELECT
*,
CUME_DIST() OVER (PARTITION BY depname ORDER BY salary ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM
salaries
取得加總目前行 (CURRENT ROW) 和之前的每一行 (UNBOUNDED FOLLOWING)
SELECT
*,
SUM(salary) OVER (PARTITION BY depname ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM
salaries;
關於 ROWS 和 RANGES 的更詳盡使用方法,推薦以下兩篇:
- Metrics Maven: Window Frames in PostgreSQL – Compose Articles
- SQLT-SQL 的 Window Function 中使用與 OVER 的搭配 | 五餅二魚工作室 – 點部落
四. 實戰操作
▍本次實際操作目錄如下:
- 用戶活躍度指數 MAU / WAU / DAU
- 會員分類 RFM 模型
- 會員分類 NES 模型
- 用戶留存率 Customer Retention Rate
- 數值區間分類貼標籤 Label
- 計算七日平均 Rolling Averages
- 計算累積加總 Cumulative Sum
1. 實戰:用戶活躍度指數 MAU / WAU / DAU
▍名詞解釋:
- Daily Active Users(單日活躍使用者)
- Weekly Active Users(單週活躍使用者)
- Monthly Active Users(單月活躍使用者)
Active Users 的定義不同,比方說:
1. Twitter 的計算方式是:有 follow 30 人,並且有三分之一的人回 follow,這些人之中每個月有「登入」的就算 MAU。
2. Facebook 的計算方式是計算「每個月有互動的人」,這邊的「互動」包含按讚、留言、分享、私訊、點連結等等。
關於更詳細的 MAU 資訊可參考此篇:MAU 指標的真實面貌
建立 Table
CREATE TABLE users (
id SERIAL primary key,
user_id int,
login_time timestamp
);
注入資料
INSERT INTO users(user_id, login_time)
VALUES
(3, '2020-05-02 06:38:35.162709'),
(1, '2020-05-02 06:38:35.162709'),
(2, '2020-05-09 06:38:35.162709'),
(1, '2020-04-02 06:38:35.162709'),
(2, '2020-04-09 06:38:35.162709'),
(1, '2020-03-02 06:38:35.162709')
方法一
WITH mau AS (
SELECT
DATE_TRUNC('month', login_time) month_timestamp,
COUNT(DISTINCT user_id) mau
FROM users
GROUP BY
DATE_TRUNC('month', login_time)
)
SELECT
a.month_timestamp previous_month,
a.mau previous_mau,
b.month_timestamp current_month,
b.mau current_mau,
ROUND(100.0 *(b.mau - a.mau) / a.mau, 2) AS percent_change
FROM mau a
JOIN mau b ON a.month_timestamp = b.month_timestamp - interval '1 month';
方法二
WITH mau AS (
SELECT
DATE_TRUNC('month', login_time) month_timestamp,
COUNT(DISTINCT user_id) mau
FROM users
GROUP BY
DATE_TRUNC('month', login_time)
)
SELECT
*,
LAG(mau) over() as lead,
ROUND(100.0 * (mau - LAG(mau) over()) / LAG(mau) over(),2)
FROM
mau
2.實戰:用戶留存率 Customer Retention Rate
▍名詞解釋:
公式:每月回客率 = 當月舊客戶/當月總客戶
關於更詳細的 用戶留存率 資訊可參考此篇:
如何實際計算 Retention、LTV ,並產出可執行的商業計畫 – Lucas Hsieh – Medium
建立 Table
CREATE TABLE users (
id SERIAL primary key,
user_id int,
login_time timestamp
);
注入資料
INSERT INTO users(user_id, login_time)
VALUES
(3, '2020-05-02 06:38:35.162709'),
(1, '2020-05-02 06:38:35.162709'),
(2, '2020-05-09 06:38:35.162709'),
(1, '2020-04-02 06:38:35.162709'),
(2, '2020-04-09 06:38:35.162709'),
(1, '2020-03-02 06:38:35.162709')
query
with base as (
/*
計算每月不重複 user_id
*/
SELECT DISTINCT DATE_trunc('month', login_time) as month,
user_id
FROM users
),
base_total_count as (
/*
計算每月不重複訪客
*/
SELECT DISTINCT DATE_trunc('month', login_time) as month,
COUNT(DISTINCT user_id) as total_user
FROM users
GROUP BY month
)
SELECT b.month,
count(DISTINCT b.user_id) AS retention_user,
MAX(c.total_user) AS month_user,
round(
count(DISTINCT b.user_id) :: numeric / MAX(c.total_user) :: numeric,
2
) AS retention_rate
FROM base a
JOIN base b ON a.user_id = b.user_id
and a.month = b.month - INTERVAL '1 month'
JOIN base_total_count c on c.month = b.month
GROUP BY b.month
3.實戰:會員分類 RFM 模型
在眾多CRM客戶資料分析中,RFM模型是被最常被廣泛提到的,其中的三大要素為R(Recency)、F(Frequency)、M(Monetary):
1. 最近一次消費 (Recency)
2. 消費頻率 (Frequency)
3. 消費金額 (Monetary)
什麼是需要準備的資料?
1. cid:訂單編號
2. date:訂單日期
3. money:訂單花費
關於更詳細資訊或 Python 版本可以參考此篇:數據分析 Python快速計算RFM模型 | Max行銷誌
WITH base AS (
SELECT
cid,
max(date),
date_part('day',
cast(now() AS TIMESTAMP) - cast(max(date) AS TIMESTAMP)) AS rencency,
sum(money) AS money,
count(cid) AS frequency
FROM
rfm_all
GROUP BY
cid
),
data_m AS (
SELECT
cid,
ntile(4) OVER (ORDER BY money) AS rfm_m
FROM
base
),
data_r AS (
SELECT
cid,
ntile(4) OVER (ORDER BY rencency DESC) AS rfm_r
FROM
base
),
data_f AS (
SELECT
cid,
ntile(4) OVER (ORDER BY frequency) AS rfm_f
FROM
base
)
SELECT
base.*,
rfm_r,
rfm_f,
rfm_m
FROM
base
LEFT JOIN data_m ON base.cid = data_m.cid
LEFT JOIN data_r ON base.cid = data_r.cid
LEFT JOIN data_f ON base.cid = data_f.cid
4.實戰:會員分類 NES 模型
NES模型是為了掌握顧客的消費行為變動性而設計,將客戶分群成為三大類,共五種標籤:
* N 新顧客(New Customer)
* E 既有顧客(Existing Customer)
* E0 主力顧客:購買週期 1 倍時間內回購的人
* E1 瞌睡顧客:超過購買週期 1~2 倍未回購的人
* E2 半睡顧客:超過購買週期 2~3 倍未回購的人
* S1 沉睡顧客(Sleeping Customer)購買頻率超過購買週期3倍未回購的人
什麼是需要準備的資料?
1. cid:訂單編號
2. date:訂單日期
3. money:訂單花費
關於更詳細資訊或 Python 版本可以參考此篇:數據分析 Python快速計算NES模型 | Max行銷誌
WITH base AS (
SELECT
cid,
max(date),
date_part('day',
cast(now() AS TIMESTAMP) - cast(max(date) AS TIMESTAMP)) AS rencency,
sum(money) AS money,
count(cid) AS frequency
FROM
rfm_all
GROUP BY
cid
)
SELECT
*,
CASE
WHEN rencency < (select AVG(rencency) from base)*1 then 'E0'
WHEN rencency >= (select AVG(rencency) from base)*1 and rencency < (select AVG(rencency) from base)*2 then 'S1'
WHEN rencency >= (select AVG(rencency) from base)*2 and rencency < (select AVG(rencency) from base)*3 then 'S2'
WHEN rencency >= (select AVG(rencency) from base)*3 then 'S3'
END
FROM
base
5.實戰:數值區間分類貼標籤 Label
目標:將 length_seconds 數值切成每 5 為一區間 (20-25、25-30 …) 並統計此區隔內的 length_seconds 次數是多少
| session_id | length_seconds |
|------------|----------------|
| 1 | 23 |
| 2 | 43 |
| 3 | 67 |
| .. | .. |
| bucket | count |
|---------|-------|
| 20-25 | 2 |
| 25-30 | 1 |
建立 Table
CREATE TABLE Histograms (sid INT, len_seconds INT)
注入資料
INSERT INTO Histograms (sid, len_seconds) VALUES
(1, 23),
(2, 43),
(3, 63),
(4, 73),
(5, 13),
(6, 83),
(7, 84),
(8, 82),
(9, 63),
(10, 65)
方法一
WITH base AS (
SELECT
*,
floor(len_seconds / 5) AS label
FROM
Histograms
)
SELECT
CONCAT(label * 5, '-', label * 5 + 5) bucket,
COUNT(DISTINCT sid)
FROM
base
GROUP BY
label
ORDER BY
label ASC
方法二
SELECT
count(len_seconds),
concat(FLOOR(len_seconds / 5) * 5, '-', FLOOR(len_seconds / 5) * 5 + 5)
FROM
Histograms
GROUP BY
concat(FLOOR(len_seconds / 5) * 5, '-', FLOOR(len_seconds / 5) * 5 + 5)
6.實戰:計算七日平均 Rolling Averages
建立 Table
CREATE TABLE rolling_avg (date TIMESTAMP,sign_up int)
注入資料
INSERT INTO rolling_avg(date,sign_up) VALUES
('2020-05-02 06:38:35.162709', 100),
('2020-05-03 06:38:35.162709', 200),
('2020-05-04 06:38:35.162709', 300),
('2020-05-05 06:38:35.162709', 400),
('2020-05-06 06:38:35.162709', 500),
('2020-05-07 06:38:35.162709', 600),
('2020-05-08 06:38:35.162709', 700),
('2020-05-09 06:38:35.162709', 800),
('2020-05-10 06:38:35.162709', 900),
('2020-05-11 06:38:35.162709', 1000),
('2020-05-12 06:38:35.162709', 1100),
('2020-05-13 06:38:35.162709', 1200),
('2020-05-14 06:38:35.162709', 1300),
('2020-05-15 06:38:35.162709', 1400)
query
SELECT
a.date,
MAX(a.sign_up) as sign_up,
avg(b.sign_up)
FROM
rolling_avg as a
join
rolling_avg as b on a.date <= b.date + INTERVAL '6 days' AND a.date >= b.date
GROUP BY
a.date
7.實戰:計算累積加總 Cumulative Sum
目標:將 cash_flow 採用累加方式計算成新的欄位
| date | cash_flow |
|------------|-----------|
| 2020-05-02 | 100 |
| 2020-05-03 | 200 |
| 2020-05-04 | 300 |
| ... | ... |
| date | cumulative_cf |
|------------|---------------|
| 2020-05-02 | 100 |
| 2020-05-03 | 300 |
| 2020-05-04 | 600 |
| ... | ... |
建立 Table
CREATE TABLE cumulative_sum (date TIMESTAMP, cash_flow int);
注入資料
INSERT INTO cumulative_sum(date, cash_flow) VALUES
('2020-05-02 06:38:35.162709', 100),
('2020-05-03 06:38:35.162709', 200),
('2020-05-04 06:38:35.162709', 300),
('2020-05-05 06:38:35.162709', 400),
('2020-05-06 06:38:35.162709', 500)
方法 A
SELECT
a.date,
max(a.cash_flow),
sum(b.cash_flow) as sum_cash_flow
from
cumulative_sum as a
JOIN
cumulative_sum as b on a.date >= b.date
GROUP BY
a.date
方法 B – window function
SELECT
date,
cash_flow,
sum(cash_flow) over (ORDER BY date ASC) as cumulative_cf
FROM
cumulative_sum
ORDER BY
date ASC
最後~
▍回顧本篇我們介紹了的 PostgreSQL 內容:
- PostgreSQL 介紹
- 子查詢 Subquery
- 通用表達式 Common Table Expressions
- 窗函式 Window Function
- 實戰操作
- 用戶活躍度指數 MAU / WAU / DAU
- 會員分類 RFM 模型
- 會員分類 NES 模型
- 用戶留存率 Customer Retention Rate
- 數值區間分類貼標籤 Label
- 計算七日平均 Rolling Averages
- 計算累積加總 Cumulative Sum
那 【資料庫筆記】 PostgreSQL 基礎教學和練習題操作 的介紹就到這邊告一個段落囉!有任何問題可以在以下留言~
有關 Max行銷誌的最新文章,都會發佈在 Max 的 Facebook 粉絲專頁,如果想看最新更新,還請您按讚或是追蹤唷!