【Python教學】Python 使用 BigQuery 的操作與安裝

ㄧ. 如何安裝串接 BigQuery 套件？

在 console 終端機輸入指令：

pip install google-cloud-bigquery

pip install pyarrow

套件說明：

google-cloud-bigquery：是 Google Cloud Client Libraries 的 Python 程式庫
pyarrow：是將 pandas DataFrame 轉換成 BigQuery Table 的套件

參考文件：

google-cloud-bigquery：Python Client for Google BigQuery — google-cloud-bigquery 0.1.0 documentation
pyarrow：Using BigQuery with Pandas — google-cloud-bigquery 0.1.0 documentation

二. 如何設定 BigQuery 金鑰 & import 套件？

1. 取得 BigQuery 金鑰：

進入 GCP 並且建立服務帳戶金鑰 > Google Cloud Platform
選擇服務帳戶或 (建立新增服務帳戶 > 角色設定擁有者)
金鑰>建立新金鑰>選擇 JSON 格式
選擇建立後，會立即下載 json 金鑰

2. 如何 import BigQuery 套件：

載入 BigQuery 金鑰

import os

os.environ[“GOOGLE_APPLICATION_CREDENTIALS”] = "config/Crawler-Bigquery.json"

載入 BigQuery

1 2	from google.cloud import bigquery as bq client = bq.Client()

參考文件：

BigQuery Python上傳DataFrame至BigQuery – YL-Tsai – Medium

三. 了解 BigQuery 架構

GCP Projects (專案) > BigQuery Datasets (資料集) > BigQuery Tables (資料表) > BigQuery View (檢視表)

參考文件：
BigQuery的結構 – Google Cloud Platform In Practice

1. 建立 GCP Projects (專案)

剛剛在建立金鑰時，已經建立了 Projects，而每個 Projects 會對應到一個金鑰，接下來我們會使用金鑰在這個 Projects 內建立 Datasets。

2. 建立 BigQuery Datasets (資料集)

在 Datasets 內，我們可以設定以下細節設定：

設定資料儲存位置
資料過期時間

▍使用 Python 建立 BigQuery Datasets 如下：

import os

from google.cloud import bigquery as bq

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "config/Crawler-Bigquery.json"

client = bq.Client()

# Creat new dataset

# dataset_id = ‘your-project.your_dataset’

dataset_id = ‘crawler-bigquery.create_new_dataset’ # 設定 Dataset 名稱，可以修改

dataset = bq.Dataset(dataset_id)

dataset.location = "asia-east1" # 設定資料位置，如不設定預設是 US

dataset.default_table_expiration_ms = 30 * 24 * 60 * 60 * 1000 # 設定資料過期時間，這邊設定 30 天過期

dataset.description = 'creat_new_dataset & expiration in 30 days & location at asia-east1' # 設定 dataset 描述

dataset = client.create_dataset(dataset) # Make an API request.

參考文件：
Google 官方：Creating datasets | BigQuery | Google Cloud
Google 官方：BigQuery — google-cloud 0.28.1.dev1 documentation

▍查看目前有的 BigQuery Datasets (資料集)

import os

from google.cloud import bigquery as bq

os.environ[“GOOGLE_APPLICATION_CREDENTIALS”] = “config/Crawler-Bigquery.json”

client = bq.Client()

datasets = list(client.list_datasets()) # Make an API request.

for dataset in datasets:

print(dataset.dataset_id)

輸出指令：

1	>>> creat_new_dataset

3. 建立 BigQuery Tables (資料表)

Table 即為 BigQuery 資料實際存放的表格，可以提供給使用者進行查詢使用。

▍BigQuery Tables 範例ㄧ

在剛剛建好的 Datasets 內創建 Table
設定 Table Schema、Table Expires、Table Description、和 Partitioned
將 Python Dataframe 資料傳入 BigQuery Table

from google.cloud import bigquery as bq

import os

import datetime

import pandas as pd

os.environ[“GOOGLE_APPLICATION_CREDENTIALS”] = “config/Crawler-Bigquery.json”

client = bq.Client()

# 設定 Table 名稱

table_id = "crawler-bigquery.creat_new_dataset.create_table”

# 設定 Table 資料結構

schema = [

bq.SchemaField("name", "STRING"),

bq.SchemaField(“post”, “STRING”),

bq.SchemaField(“timestamp”, “TIMESTAMP”),

]

table = bq.Table(table_id, schema=schema)

# 設定 Table 過期時間

table.expires = datetime.datetime.now() + datetime.timedelta(days=6)

# 設定 Table 描述

table.description = “create a new table and write the description.”

# 設定 Table Partition

table.time_partitioning = bq.TimePartitioning(

type_=bq.TimePartitioningType.DAY,

field=“timestamp”, # name of column to use for partitioning

expiration_ms=7776000000,

)

# 建立 Table

table = client.create_table(table) # Make an API request.

dataframe 資料傳入 Bigquery

# 將 Dataframe 資料傳入 BigQuery

df = pd.DataFrame({

'name': ['Max'],

'post’: [‘1’],

'timestamp': [datetime.datetime.now()]

})

table = client.dataset(‘creat_new_dataset’).table(‘create_table’)

job = client.load_table_from_dataframe(df, table)

job.result()

▍BigQuery Tables 範例二

建立巢狀 Table Schema
將 json 資料傳入 BigQuery Table

from google.cloud import bigquery as bq

import os

import datetime

import pandas as pd

os.environ[“GOOGLE_APPLICATION_CREDENTIALS”] = “config/Crawler-Bigquery.json”

client = bq.Client()

table_id = “crawler-bigquery.creat_new_dataset.create_nested_table”

# 設定 Table 巢狀資料結構

schema = [

bq.SchemaField('post', 'STRING', mode='NULLABLE’),

bq.SchemaField(‘account’,

‘RECORD’,

mode=‘REPEATED’,

fields=[

bq.SchemaField(‘name’, “STRING”, mode=“NULLABLE”),

bq.SchemaField(‘address’, “STRING”, mode=“NULLABLE”),

bq.SchemaField(‘number’, “INTEGER”, mode=“NULLABLE”)

])

]

table = bq.Table(table_id, schema=schema)

# 設定 Table 過期時間

table.expires = datetime.datetime.now() + datetime.timedelta(days=6)

# 設定 Table 描述

table.description = 'create a new table and write the description.’

# 建立 Table

table = client.create_table(table) # Make an API request.

import json

# 將 json 資料傳入 BigQuery

now_stamp = datetime.datetime.now()

print(now_stamp)

json_data = [{

‘post’:

‘post01’,

‘account’: [{

‘name’: ‘Max’,

‘Address’: ‘忠孝東路走九遍’,

‘number’: ‘0900000000’

}]

table = client.dataset(‘creat_new_dataset’).table(‘create_nested_table’)

job = client.load_table_from_json(json_data, table)

job.result()

▍關於 Partitioned Tables

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.

看完官方說明後，簡單來說 Partitioned Tables 可以使用以下三種 segments 來區隔 table，達到更方便管理和優化 query 表現，並節省 BigQuery 花費。

Ingestion time : 利用 data insert 進去 bigquery 時間來做 partition ，這個數字會自動記錄在 _PARTITIONTIME 這個 column
Date/timestamp : 如果 table 裡面有欄位是 Date/Timestamp 欄位，我們可以指定由這個column 來做 partition
Integer range : 如果 table 裡面有欄位是 Integer 欄位，我們可以指定一個 integer 的 range (start,end,range) ，他就會根據 integer 來做 partition

四. 關於 BigQuery 價格

請依使用主機地區來參考官方定價：BigQuery 定價 | Google Cloud

而「一律免費」的用量限制如下：

動態儲存 – 每個月前 10 GB 免費
查詢 – 每個月前 1 TB 免費

將資料表分區和分群有助於降低查詢處理的資料量。為達到最佳做法的成效，請盡可能採用分區和分群的做法。

五. 關於 BigQuery 其他延伸閱讀

BigQuery create a table from a query result, write the results to a destination table.
Creating and using tables | BigQuery | Google Cloud
BigQuery Copying a single source table
Managing tables | BigQuery | Google Cloud
BigQuery Specifying Schema
Specifying a schema | BigQuery | Google Cloud
標準 SQL 查詢
Datetime Functions in Standard SQL | BigQuery | Google Cloud

最後～

▍回顧本篇我們介紹了的內容：

如何安裝串接 BigQuery 套件
如何設定 BigQuery 金鑰 & import 套件
了解 BigQuery 架構
- 建立 GCP Projects (專案)
- 建立 BigQuery Datasets (資料集)
- 建立 BigQuery Tables (資料表)
關於 BigQuery 價格
關於 BigQuery 其他延伸閱讀

▍資料庫相關延伸閱讀：

那麼有關於【Python教學】Python 使用 BigQuery 的操作與安裝的介紹就到這邊告一個段落囉！有任何問題可以在以下留言～

有關 Max行銷誌的最新文章，都會發佈在 Max 的 Facebook 粉絲專頁，如果想看最新更新，還請您按讚或是追蹤唷！