06 Python 資料庫教學10 所有文章

【資料庫筆記】 PostgreSQL 基礎教學和練習題操作

postgresql面試練習題_Max行銷誌

一. 子查詢 Subquery

子查詢就是在一個 SELECT 內再放入一個 SELECT,所以叫做子查詢。可以放於 WHERE 或 SELECT 之後,但必須用小括號包起來。要留意的是啟動順序是會先處理子查詢條件,然後再依子查詢條件處理主查詢。

1.範例:子查詢放於 WHERE 後

SELECT
    *
FROM
    salaries
WHERE salary > (SELECT avg(salary) from salaries)

2.範例:子查詢放於 SELECT 後

SELECT
    *,
    (SELECT avg(salary) from salaries) AS avg_salary
FROM
    salaries

3.範例:多層子查詢

SELECT
    *
FROM
    salaries
WHERE
    depname in(
        SELECT
            depname FROM salaries
        WHERE
            salary > (
                SELECT
                    avg(salary)
                    FROM salaries))

二. 通用表達式 Common Table Expressions

CTE 最主要有二個用途:簡化子查詢與遞迴查詢

1.簡化子查詢

如果使用多層子查詢,過幾天之後回來看會發現非常難讀懂,而 CTE 可以有助於程式的易讀性,但要留意的是 CTE 的生命週期短,只有第一個 WITH 後面的 SELECT 有效,再一次查詢則會報錯,所以 CTE 只適合用於一次性的場合

WITH cte AS (
    SELECT
        *
    FROM
        rfm
)
SELECT
    *
FROM
    cte

-- 第二次查詢則會報錯
SELECT
    *
FROM
    cte

2.遞迴查詢

關於 CTE 的遞迴查詢,以下這篇寫了兩個蠻淺顯易懂的範例 code,推薦給大家閱讀:Fun with SQL: Recursive CTEs in Postgres

三. 窗函式 Window Function

通常會把 Window Function 分為三類 :

  1. Ranking Functions:
    • ROW_NUMBER – 編號
    • RANK – 排序 (會跳號)
    • DENSE_RANK – 排序 (不會跳號)
    • NTILE – (平均分配)
  2. Aggregate Functions:
    • Max
    • Min
    • Sum
    • Count
    • Avg
  3. Analytic Functions:
    • FIRST_VALUE – 取第一個
    • LAST_VALUE – 取最後一個
    • LAG – 取後一個
    • LEAD – 取前一個

然後用以上 function 搭配 OVER() 使用,OVER 裡面可以使用 PARTITION BY 、ORDER BY 和 ROWS or RANGE。

使用 PARTITION BY 、ORDER BY 的範例如下:

SELECT
    *,
    rank() OVER (PARTITION BY depname ORDER BY salary) as rank
FROM
    salaries

而關於 ROWS 的使用如下:

取得加總目前行 (CURRENT ROW) 和前 2 兩行 (2 PRECEDING)

SELECT
    *,
    SUM(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as sum_salary
FROM
    salaries;

取得加總目前行 (CURRENT ROW) 和後 2 兩行 (2 FOLLOWING)

SELECT
    *,
    SUM(salary) OVER (ORDER BY salary ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) as sum_salary
FROM
    salaries;

取得加總前 2 兩行 (2 PRECEDING) 和後 2 兩行 (2 FOLLOWING)

SELECT
    *,
    SUM(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) as sum_salary
FROM
    salaries;

取得加總目前行 (CURRENT ROW) 和之後的每一行 (UNBOUNDED FOLLOWING)

SELECT
    *,
    CUME_DIST() OVER (PARTITION BY depname ORDER BY salary ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM
    salaries

取得加總目前行 (CURRENT ROW) 和之前的每一行 (UNBOUNDED FOLLOWING)

SELECT
    *,
    SUM(salary) OVER (PARTITION BY depname ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM
    salaries;

關於 ROWS 和 RANGES 的更詳盡使用方法,推薦以下兩篇:

四. 實戰操作

▍本次實際操作目錄如下:

  • 用戶活躍度指數 MAU / WAU / DAU
  • 會員分類 RFM 模型
  • 會員分類 NES 模型
  • 用戶留存率 Customer Retention Rate
  • 數值區間分類貼標籤 Label
  • 計算七日平均 Rolling Averages
  • 計算累積加總 Cumulative Sum

1. 實戰:用戶活躍度指數 MAU / WAU / DAU

▍名詞解釋:

  • Daily Active Users(單週活躍使用者)
  • Weekly Active Users(單週活躍使用者)
  • Monthly Active Users(單月活躍使用者)

Active Users 的定義不同,比方說:
1. Twitter 的計算方式是:有 follow 30 人,並且有三分之一的人回 follow,這些人之中每個月有「登入」的就算 MAU。
2. Facebook 的計算方式是計算「每個月有互動的人」,這邊的「互動」包含按讚、留言、分享、私訊、點連結等等。

關於更詳細的 MAU 資訊可參考此篇:MAU 指標的真實面貌

建立 Table

CREATE TABLE users (
  id SERIAL primary key,
  user_id int,
  login_time timestamp
);

注入資料

INSERT INTO users(user_id, login_time)
VALUES
  (3, '2020-05-02 06:38:35.162709'),
  (1, '2020-05-02 06:38:35.162709'),
  (2, '2020-05-09 06:38:35.162709'),
  (1, '2020-04-02 06:38:35.162709'),
  (2, '2020-04-09 06:38:35.162709'),
  (1, '2020-03-02 06:38:35.162709')

方法一

  WITH mau AS (
    SELECT
      DATE_TRUNC('month', login_time) month_timestamp,
      COUNT(DISTINCT user_id) mau
    FROM users
    GROUP BY
      DATE_TRUNC('month', login_time)
  )
SELECT
  a.month_timestamp previous_month,
  a.mau previous_mau,
  b.month_timestamp current_month,
  b.mau current_mau,
  ROUND(100.0 *(b.mau - a.mau) / a.mau, 2) AS percent_change
FROM mau a
JOIN mau b ON a.month_timestamp = b.month_timestamp - interval '1 month';

方法二

WITH mau AS (
    SELECT
      DATE_TRUNC('month', login_time) month_timestamp,
      COUNT(DISTINCT user_id) mau
    FROM users
    GROUP BY
      DATE_TRUNC('month', login_time)
  )
SELECT
  *,
  LAG(mau) over() as lead,
  ROUND(100.0 * (mau - LAG(mau) over()) / LAG(mau) over(),2)
FROM 
  mau

2.實戰:用戶留存率 Customer Retention Rate

▍名詞解釋:

公式:每月回客率 = 當月舊客戶/當月總客戶

關於更詳細的 用戶留存率 資訊可參考此篇:
如何實際計算 Retention、LTV ,並產出可執行的商業計畫 – Lucas Hsieh – Medium

建立 Table

CREATE TABLE users (
  id SERIAL primary key,
  user_id int,
  login_time timestamp
);

注入資料

INSERT INTO users(user_id, login_time)
VALUES
  (3, '2020-05-02 06:38:35.162709'),
  (1, '2020-05-02 06:38:35.162709'),
  (2, '2020-05-09 06:38:35.162709'),
  (1, '2020-04-02 06:38:35.162709'),
  (2, '2020-04-09 06:38:35.162709'),
  (1, '2020-03-02 06:38:35.162709')

query

with base as (
    /*
     計算每月不重複 user_id
     */
    SELECT DISTINCT DATE_trunc('month', login_time) as month,
        user_id
    FROM users
),
base_total_count as (
    /*
     計算每月不重複訪客
     */
    SELECT DISTINCT DATE_trunc('month', login_time) as month,
        COUNT(DISTINCT user_id) as total_user
    FROM users
    GROUP BY month
)
SELECT b.month,
    count(DISTINCT b.user_id) AS retention_user,
    MAX(c.total_user) AS month_user,
    round(
        count(DISTINCT b.user_id) :: numeric / MAX(c.total_user) :: numeric,
        2
    ) AS retention_rate
FROM base a
    JOIN base b ON a.user_id = b.user_id
    and a.month = b.month - INTERVAL '1 month'
    JOIN base_total_count c on c.month = b.month
GROUP BY b.month

3.實戰:會員分類 RFM 模型

在眾多CRM客戶資料分析中,RFM模型是被最常被廣泛提到的,其中的三大要素為R(Recency)、F(Frequency)、M(Monetary):
1. 最近一次消費 (Recency)
2. 消費頻率 (Frequency)
3. 消費金額 (Monetary)

什麼是需要準備的資料?
1. cid:訂單編號
2. date:訂單日期
3. money:訂單花費

關於更詳細資訊或 Python 版本可以參考此篇:數據分析 Python快速計算RFM模型 | Max行銷誌

WITH base AS (
    SELECT
        cid,
        max(date),
        date_part('day',
            cast(now() AS TIMESTAMP) - cast(max(date) AS TIMESTAMP)) AS rencency,
        sum(money) AS money,
        count(cid) AS frequency
    FROM
        rfm_all
    GROUP BY
        cid
),
data_m AS (
    SELECT
        cid,
        ntile(4) OVER (ORDER BY money) AS rfm_m
    FROM
        base
),
data_r AS (
SELECT
    cid,
    ntile(4) OVER (ORDER BY rencency DESC) AS rfm_r
FROM
    base
),
data_f AS (
SELECT
    cid,
    ntile(4) OVER (ORDER BY frequency) AS rfm_f
FROM
    base
)
SELECT
    base.*,
    rfm_r,
    rfm_f,
    rfm_m
FROM
    base
    LEFT JOIN data_m ON base.cid = data_m.cid
    LEFT JOIN data_r ON base.cid = data_r.cid
    LEFT JOIN data_f ON base.cid = data_f.cid

4.實戰:會員分類 NES 模型

NES模型是為了掌握顧客的消費行為變動性而設計,將客戶分群成為三大類,共五種標籤:
* N 新顧客(New Customer)
* E 既有顧客(Existing Customer)
* E0 主力顧客:購買週期 1 倍時間內回購的人
* E1 瞌睡顧客:超過購買週期 1~2 倍未回購的人
* E2 半睡顧客:超過購買週期 2~3 倍未回購的人
* S1 沉睡顧客(Sleeping Customer)購買頻率超過購買週期3倍未回購的人

什麼是需要準備的資料?
1. cid:訂單編號
2. date:訂單日期
3. money:訂單花費

關於更詳細資訊或 Python 版本可以參考此篇:數據分析 Python快速計算NES模型 | Max行銷誌

WITH base AS (
    SELECT
        cid,
        max(date),
        date_part('day',
            cast(now() AS TIMESTAMP) - cast(max(date) AS TIMESTAMP)) AS rencency,
        sum(money) AS money,
        count(cid) AS frequency
    FROM
        rfm_all
    GROUP BY
        cid
)
SELECT
    *,
    CASE
        WHEN rencency < (select AVG(rencency) from base)*1 then 'E0'
        WHEN rencency >= (select AVG(rencency) from base)*1 and rencency < (select AVG(rencency) from base)*2 then 'S1'
        WHEN rencency >= (select AVG(rencency) from base)*2 and rencency < (select AVG(rencency) from base)*3 then 'S2'
        WHEN rencency >= (select AVG(rencency) from base)*3 then 'S3'
    END
FROM
    base

5.實戰:數值區間分類貼標籤 Label

目標:將 length_seconds 數值切成每 5 為一區間 (20-25、25-30 …) 並統計此區隔內的 length_seconds 次數是多少

| session_id | length_seconds |
|------------|----------------|
| 1 | 23 |
| 2 | 43 |
| 3 | 67 |
| .. | .. |


| bucket | count |
|---------|-------|
| 20-25 | 2 |
| 25-30 | 1 |

建立 Table

CREATE TABLE Histograms (sid INT, len_seconds INT)

注入資料

INSERT INTO Histograms (sid, len_seconds) VALUES
    (1, 23), 
    (2, 43), 
    (3, 63), 
    (4, 73), 
    (5, 13), 
    (6, 83), 
    (7, 84), 
    (8, 82), 
    (9, 63), 
    (10, 65)

方法一

WITH base AS (
    SELECT
        *,
        floor(len_seconds / 5) AS label
    FROM
        Histograms
)
SELECT
    CONCAT(label * 5, '-', label * 5 + 5) bucket,
    COUNT(DISTINCT sid)
FROM
    base
GROUP BY
    label
ORDER BY
    label ASC

方法二

SELECT
    count(len_seconds),
    concat(FLOOR(len_seconds / 5) * 5, '-', FLOOR(len_seconds / 5) * 5 + 5)
FROM
    Histograms
GROUP BY
    concat(FLOOR(len_seconds / 5) * 5, '-', FLOOR(len_seconds / 5) * 5 + 5)

6.實戰:計算七日平均 Rolling Averages

建立 Table

CREATE TABLE rolling_avg (date TIMESTAMP,sign_up int)

注入資料

INSERT INTO rolling_avg(date,sign_up) VALUES
    ('2020-05-02 06:38:35.162709', 100),
    ('2020-05-03 06:38:35.162709', 200),
    ('2020-05-04 06:38:35.162709', 300),
    ('2020-05-05 06:38:35.162709', 400),
    ('2020-05-06 06:38:35.162709', 500),
    ('2020-05-07 06:38:35.162709', 600),
    ('2020-05-08 06:38:35.162709', 700),
    ('2020-05-09 06:38:35.162709', 800),
    ('2020-05-10 06:38:35.162709', 900),
    ('2020-05-11 06:38:35.162709', 1000),
    ('2020-05-12 06:38:35.162709', 1100),
    ('2020-05-13 06:38:35.162709', 1200),
    ('2020-05-14 06:38:35.162709', 1300),
    ('2020-05-15 06:38:35.162709', 1400)

query

SELECT
    a.date,
    MAX(a.sign_up) as sign_up,
    avg(b.sign_up)
FROM
    rolling_avg as a
join 
    rolling_avg as b on a.date <= b.date + INTERVAL '6 days' AND a.date >= b.date
GROUP BY
    a.date

7.實戰:計算累積加總 Cumulative Sum

目標:將 cash_flow 採用累加方式計算成新的欄位

| date | cash_flow |
|------------|-----------|
| 2020-05-02 | 100 |
| 2020-05-03 | 200 |
| 2020-05-04 | 300 |
| ... | ... |


| date | cumulative_cf |
|------------|---------------|
| 2020-05-02 | 100 |
| 2020-05-03 | 300 |
| 2020-05-04 | 600 |
| ... | ... |

建立 Table

CREATE TABLE cumulative_sum (date TIMESTAMP, cash_flow int);

注入資料

INSERT INTO cumulative_sum(date, cash_flow) VALUES
    ('2020-05-02 06:38:35.162709', 100),
    ('2020-05-03 06:38:35.162709', 200),
    ('2020-05-04 06:38:35.162709', 300),
    ('2020-05-05 06:38:35.162709', 400),
    ('2020-05-06 06:38:35.162709', 500)

方法 A

SELECT 
    a.date,
    max(a.cash_flow),
    sum(b.cash_flow) as sum_cash_flow
from 
    cumulative_sum as a
JOIN 
    cumulative_sum as b on a.date >= b.date
GROUP BY
    a.date

方法 B – window function

SELECT
    date,
    cash_flow,
    sum(cash_flow) over (ORDER BY date ASC) as cumulative_cf 
FROM
    cumulative_sum
ORDER BY
    date ASC

最後~

▍回顧本篇我們介紹了的 PostgreSQL 內容:

  1. PostgreSQL 介紹
    1. 子查詢 Subquery
    2. 通用表達式 Common Table Expressions
    3. 窗函式 Window Function
  2. 實戰操作
    • 用戶活躍度指數 MAU / WAU / DAU
    • 會員分類 RFM 模型
    • 會員分類 NES 模型
    • 用戶留存率 Customer Retention Rate
    • 數值區間分類貼標籤 Label
    • 計算七日平均 Rolling Averages
    • 計算累積加總 Cumulative Sum

那 【資料庫筆記】 PostgreSQL 基礎教學和練習題操作 的介紹就到這邊告一個段落囉!有任何問題可以在以下留言~

有關 Max行銷誌的最新文章,都會發佈在 Max 的 Facebook 粉絲專頁,如果想看最新更新,還請您按讚或是追蹤唷!

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *