G-gen ăŽç岊ă§ăăĺ˝č¨äşă§ăŻ Vertex AI Custom Training ăŤăă㌠ăŤăšăżă ăłăłăă ă使ç¨ăăć¨ćşă§ăŻćäžăăăŚăăŞă LightGBM ă˘ăăŤăŽĺŚçżăă ĺŻä¸ĺşŚďźSHAPďźăŽĺşĺ ăžă§ĺŽčĄăăćšćłăç´šäťăăžăă ăŻăă㍠ăăŤăć¸ăżăłăłăăă¨ăŤăšăżă ăłăłăăăŽä˝żăĺă ăŤăšăżă ăłăłăăăŽĺŠçš ć§ćĺł ĺćč¨ĺŽ ăăźăżăŽćşĺă¨ĺĺ˛ ăŤăšăżă ăłăłăăăŽćşĺ ăăŁăŹăŻăăŞă¨ăŞăă¸ăăŞăŽćşĺ ĺŚçżăšăŻăŞăăăŽä˝ć Dockerfile ăŽä˝ć ăłăłăăăŽăăŤăă¨ăăăˇăĽ ĺŚçżă¸ă§ăăŽĺŽčĄ ć¨čŤă¨čŠäžĄćć¨ăŽç˘şčŞ ĺćăŹăăźăăŽč§Łé ăŻăă㍠ăăŤăć¸ăżăłăłăăă¨ăŤăšăżă ăłăłăăăŽä˝żăĺă Vertex AI Custom Training ăŤăŻĺŚçżă¸ă§ăăĺŽčĄăăăăăŽăłăłăăă¤ăĄăźă¸ă¨ăăŚă大ăă 2 ă¤ăŽé¸ćč˘ăăăăžăă ăłăłăăăŽç¨ŽéĄ çšĺž´ ĺăăŚăăăąăźăš ăăŤăć¸ăżăłăłăă Google Cloud ăç¨ćăăă¤ăĄăźă¸ XGBoost ă TensorFlow ăŞăŠăć¨ćşçăŞăăŹăźă ăŻăźăŻăăăăŤä˝żăăăć ăŤăšăżă ăłăłăă čŞĺă§ Dockerfile ăć¸ăăŚä˝ćăăă¤ăĄăźă¸ LightGBM ăŞăŠćŞćäžăŽăŠă¤ăăŠăŞă使ăăăćăăçŹčŞăŽĺŚçăçľăżčžźăżăăć ăŤăšăżă ăłăłăăăŽĺŠçš ăăŤăć¸ăżăłăłăăă§ăă¸ă§ăĺŽčĄćăŽĺźć°ăŤ requirements=["lightgbm", "shap"] ăŽăăăŤćĺŽăăăă¨ă§ăŠă¤ăăŠăŞăčż˝ĺ ă§ăăžăăăăŤăć¸ăżăłăłăăăŤă¤ăăŚăŻäťĽä¸ăŽč¨äşăĺç
§ăăŚăă ăăă blog.g-gen.co.jp ăăăĺŽĺăŽćŹçŞéç¨ăŤăăăŚăĺŽčĄćăŤăŠă¤ăăŠăŞăĺçăŤă¤ăłăšăăźăŤăăăă¨ăŻă䝼ä¸ăŽăăĄăŞăăăăăăžăă 1 çšçŽăŻă ç°ĺ˘ăŽĺçžć§ăä˝ä¸ăă ăă¨ă§ăă ă¸ă§ăăĺŽčĄăăăăłăŤă¤ăłăżăźăăăăăćć°ăŽăăăąăźă¸ăĺĺžăăăăăäžĺăŠă¤ăăŠăŞăŽăăźă¸ă§ăłăä¸ăăŁăăăăŤçŞçśă¸ă§ăăč˝ăĄăăăĺŚçżçľćăĺ¤ăăŁăŚăăžăă¨ăăŁăăćŹçŞéç¨ă§éżăăăăŞăšăŻăćăăžăă 2 çšçŽăŻă ĺŽčĄăŽăăłăŤăŞăźăăźăăăăçşçăă ăă¨ă§ăă ćŻĺăŠă¤ăăŠăŞăăăŚăłăăźăăăŚă¤ăłăšăăźăŤăăĺŚçăčľ°ăăăăä˝č¨ăŞĺž
ăĄćéăçşçăăžăă ăŤăšăżă ăłăłăăăĺŠç¨ăăăă¨ăŤăăăä¸č¨ăŽăăĄăŞăăăĺéżă§ăăžăă ĺč : ăŤăšăżă ăłăłăăăŽćŚčŚ ć§ćĺł ĺ˝č¨äşă§ç´šäťăăćé ăŤé˘ăăć§ćĺłăŻäťĽä¸ăŽă¨ăăă§ăăç°ĺ˘ć§çŻăŽč˛ čˇăčť˝ć¸ăăăăăă˝ăźăšăłăźăăŽä˝ćă Python ĺŽčĄç°ĺ˘ăŤ Colab Enterprise ă使ç¨ăăžăă ĺćč¨ĺŽ ăŻăăăŤăŠă¤ăăŠăŞăŽă¤ăłăšăăźăŤă¨ç°ĺ˘ĺ¤ć°ăŽč¨ĺŽăčĄăăžăăäťĺăŻĺŻčŚĺăč§ŁéăŽăăăŽăŠă¤ăăŠăŞďź seaborn ă shap ďźăčż˝ĺ ăăžăă # ĺż
čŚăŞăŠă¤ăăŠăŞăŽă¤ăłăšăăźăŤ !pip install google-cloud-aiplatform lightgbm shap scikit-learn pandas seaborn matplotlib -q # ăăă¸ă§ăŻăă¨ăŞăźă¸ă§ăłăŽč¨ĺŽ # âť ăčŞčşŤăŽç°ĺ˘ăŤĺăăăŚć¸ăćăăŚăă ăă PROJECT_ID = "your-project-id" LOCATION = "asia-northeast1" # ăăąăăă¨ăăŠăŤăăŽĺŽçžŠ ROOT_BUCKET = "gs://your-bucket" EXPERIMENT_NAME = "diamonds-lgbm-v1" WORK_DIR = f "{ROOT_BUCKET}/{EXPERIMENT_NAME}" # Vertex AI SDK ăŽĺćĺ from google.cloud import aiplatform aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=WORK_DIR) # ăăąăăăĺĺ¨ăăŞăĺ ´ĺăŽăżä˝ć !gsutil mb -l {LOCATION} {ROOT_BUCKET} ăăźăżăŽćşĺă¨ĺĺ˛ ăăźăżăŻćŠć˘°ĺŚçżăă˘ă§ä˝żç¨ăăăăă¤ă¤ă˘ăłăăŽäžĄć źăăźăżă使ç¨ăăžăăăăŽăăźăżăŻăŤăŠăăăŞăŠăŽć°ĺ¤ăăźăżăăăŤăăăč˛ă¨ăăŁăăŤăă´ăŞĺ¤ć°ăĺŤăżăžăă ĺŚçżăăźăżă¨ć¨čŤăăźăżăŤĺĺ˛ă㌠Cloud Storage ăŤäżĺăăžăă import seaborn as sns from sklearn.model_selection import train_test_split import pandas as pd # ăăźăżăŽăăźă (~54,000čĄ) df = sns.load_dataset( 'diamonds' ) # ćĺĺăŤăŠă ă 'category' ĺăŤĺ¤ć cat_cols = [ 'cut' , 'color' , 'clarity' ] for col in cat_cols: df[col] = df[col].astype( 'category' ) # ĺŚçżăăźăżă¨ć¨čŤăăźăżăŤ 90:10 ăŽĺ˛ĺă§ĺĺ˛ train_full_df, test_df = train_test_split(df, test_size= 0.1 , random_state= 42 ) # ăăźăżăŽäżĺ train_filename = "train.csv" train_full_df.to_csv(train_filename, index= False ) test_filename = "test.csv" test_df.to_csv(test_filename, index= False ) # GCS ă¸ă˘ăăăăźă !gsutil cp {train_filename} {WORK_DIR}/data/{train_filename} !gsutil cp {test_filename} {WORK_DIR}/data/{test_filename} print (f "ĺŚçżăăźăż: {WORK_DIR}/data/{train_filename}" ) print (f "ć¨čŤăăźăż: {WORK_DIR}/data/{test_filename}" ) ăŤăšăżă ăłăłăăăŽćşĺ ăăŁăŹăŻăăŞă¨ăŞăă¸ăăŞăŽćşĺ Colab Enterprise ä¸ăŤä˝ćĽăăŁăŹăŻăăŞăç¨ćăăGoogle Cloud ä¸ăŤĺŽćăăăłăłăăăŽäżĺĺ
ă¨ăŞă Artifact Registry ăŽăŞăă¸ăăŞăä˝ćăăžăă # ä˝ćĽç¨ăăŁăŹăŻăăŞăŽä˝ć !mkdir -p custom_container # Artifact Registry ăŤăŞăă¸ăăŞăä˝ć (ĺĺăŽăż) !gcloud artifacts repositories create custom-training-repo \ --repository- format =docker \ --location={LOCATION} \ --description= "Custom Training Repository" || true ĺŚçżăšăŻăŞăăăŽä˝ć ăłăłăăĺ
ă§ĺŽčĄăăă task.py ăä˝ćăăžăă äťĺăŻă˘ăăŤăŽĺŚçżă ăă§ăŞăăéĺŚçżă確čŞăăăăăŽĺŚçżć˛çˇă¨ăäşć¸ŹăŽć šć ă誏ćăăăăăŽĺŻä¸ĺşŚăŽçťĺăçćăăă˘ăăŤă¨ä¸çˇăŤ Cloud Storage ă¸ă˘ăăăăźăăăĺŚçăçľăżčžźăżăžăă %%writefile custom_container/task.py import argparse import os import pandas as pd import lightgbm as lgb import shap import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from google.cloud import storage from urllib.parse import urlparse import warnings warnings.filterwarnings( 'ignore' ) parser = argparse.ArgumentParser() parser.add_argument( '--train-data-uri' , dest= 'train_data_uri' , type = str , required= True ) args = parser.parse_args() # --- GCS ăăŚăłăăźă / ă˘ăăăăźăç¨ăŽé˘ć° --- def download_from_gcs (gcs_uri, local_file): parsed_url = urlparse(gcs_uri) client = storage.Client() bucket = client.bucket(parsed_url.netloc) blob = bucket.blob(parsed_url.path.lstrip( "/" )) blob.download_to_filename(local_file) def upload_to_gcs (local_file, gcs_dir): parsed_url = urlparse(gcs_dir) client = storage.Client() bucket = client.bucket(parsed_url.netloc) blob_path = f "{parsed_url.path.lstrip('/').rstrip('/')}/{local_file}" bucket.blob(blob_path).upload_from_filename(local_file) # --- 1. ăăźăżăŽćşĺ --- print (f "Downloading data from {args.train_data_uri}..." , flush= True ) local_train_file = "train.csv" download_from_gcs(args.train_data_uri, local_train_file) df = pd.read_csv(local_train_file) cat_cols = [ 'cut' , 'color' , 'clarity' ] for col in cat_cols: df[col] = df[col].astype( 'category' ) X = df.drop(columns=[ "price" ]) y = df[ "price" ] # ăšăŻăŞăăĺ
ă§ĺŚçżç¨ă¨ć¤č¨źç¨ăŤĺĺ˛ (ăăźăżăŞăźăŻé˛ć˘) X_train, X_val, y_train, y_val = train_test_split(X, y, test_size= 0.1 , random_state= 42 ) # --- 2. ă˘ăăŤăŽĺŚçż --- print ( "Training LightGBM model..." , flush= True ) model = lgb.LGBMRegressor(n_estimators= 100 , random_state= 42 ) # ĺŚçżéç¨ăč¨é˛ăăăă㍠eval_set ă渥ă model.fit( X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], eval_names=[ 'train' , 'valid' ] ) # --- 3. ĺćçťĺăŽçćă¨äżĺ --- # â ĺŚçżć˛çˇăŽćçť lgb.plot_metric(model, metric= 'l2' ) plt.title( 'Learning Curve (MSE)' ) plt.tight_layout() plt.savefig( "learning_curve.png" ) plt.close() # ⥠SHAPĺ¤ďźĺŻä¸ĺşŚďźăŽćçť print ( "Calculating SHAP values..." , flush= True ) explainer = shap.TreeExplainer(model) shap_values = explainer(X_val.sample( min ( 1000 , len (X_val)), random_state= 42 )) plt.figure() shap.plots.beeswarm(shap_values, show= False ) plt.title( "SHAP Feature Importance" ) plt.tight_layout() plt.savefig( "shap_importance.png" ) plt.close() # --- 4. ććçŠăŽă˘ăăăăźă --- aip_model_dir = os.getenv( "AIP_MODEL_DIR" ) if aip_model_dir: print (f "Uploading artifacts to: {aip_model_dir}" , flush= True ) model.booster_.save_model( "model.txt" ) upload_to_gcs( "model.txt" , aip_model_dir) upload_to_gcs( "learning_curve.png" , aip_model_dir) upload_to_gcs( "shap_importance.png" , aip_model_dir) print ( "Upload completed." , flush= True ) Dockerfile ăŽä˝ć Dockerfile ăč¨čż°ăăžăăăăźăšă¤ăĄăźă¸ăŤăŻ Python 3.12 ăćĺŽăăLightGBM ăŤĺż
čŚăŞ libgomp1 ăă¤ăłăšăăźăŤăăžăă %%writefile custom_container/Dockerfile FROM python: 3.12 -slim # LightGBM ăŤĺż
é ㎠OS ăŠă¤ăăŠăŞăă¤ăłăšăăźăŤ RUN apt-get update && apt-get install -y --no-install-recommends \ libgomp1 \ && rm -rf /var/lib/apt/lists/* # ĺż
čŚăŞ Python ăŠă¤ăăŠăŞăŽă¤ăłăšăăźăŤ RUN pip install --no-cache- dir \ pandas scikit-learn lightgbm shap matplotlib google-cloud-storage WORKDIR /app COPY task.py /app/task.py ENTRYPOINT [ "python" , "task.py" ] ăłăłăăăŽăăŤăă¨ăăăˇăĽ Cloud Build ă使ç¨ăăŚăłăłăăăăăŤăăăăăăˇăĽăăžăă # Cloud Build ă§ăăŤăă¨ăăăˇăĽăĺŽčĄ REPO_NAME = "custom-training-repo" IMAGE_URI = f "{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/lgbm-shap-trainer:latest" !gcloud builds submit --tag {IMAGE_URI} ./custom_container ĺŚçżă¸ă§ăăŽĺŽčĄ ä˝ćăăčŞä˝ăłăłăă ( IMAGE_URI ) ăćĺŽăăŚăĺŚçżă¸ă§ăăé俥ăăžăăĺźć° base_output_dir ăćĺŽăăăă¨ă§ăćĺŽăă Cloud Storage ăŽăăšé
ä¸ăŤă˘ăăŤăçťĺăäżĺă§ăăžăă # ă¸ă§ăăŽĺŽçžŠ job = aiplatform.CustomContainerTrainingJob( display_name= "diamonds-lgbm-shap-job" , container_uri=IMAGE_URI, ) # ă¸ă§ăăŽĺŽčĄ print ( "ă¸ă§ăăé俥ăăžăăăĺŽäşăžă§ăĺž
ăĄăă ăă..." ) job.run( machine_type= "n1-standard-4" , replica_count= 1 , args=[ f "--train-data-uri={WORK_DIR}/data/train.csv" ], # ććçŠăŽäżĺĺ
ăăŠăŤăăćĺŽ base_output_dir=f "{WORK_DIR}/model_output" ) ć¨čŤă¨čŠäžĄćć¨ăŽç˘şčŞ ă¸ă§ăĺŽäşĺžăCloud Storage ăăĺŚçżć¸ăżă˘ăăŤăăăŚăłăăźăăăColab Enterprise ä¸ă§ăăšăăăźăżăŤĺŻžăă粞庌čŠäžĄăčĄăăžăă import numpy as np import lightgbm as lgb from sklearn.metrics import r2_score, mean_squared_error import pandas as pd # 1. ĺŚçżăŽććçŠăŽăăŚăłăăźă MODEL_DIR = f "{WORK_DIR}/model_output/model" print ( "ĺŚçżć¸ăżă˘ăăŤă¨ĺćçťĺăăăŚăłăăźăăăžă..." ) !gsutil cp {MODEL_DIR}/model.txt . !gsutil cp {MODEL_DIR}/learning_curve.png . !gsutil cp {MODEL_DIR}/shap_importance.png . # 2. ăăšăăăźăżăŽčŞăżčžźăż df_test = pd.read_csv(f "{WORK_DIR}/data/test.csv" ) cat_cols = [ 'cut' , 'color' , 'clarity' ] for col in cat_cols: df_test[col] = df_test[col].astype( 'category' ) X_test = df_test.drop(columns=[ "price" ]) y_true = df_test[ "price" ] # 3. ăăźăŤăŤć¨čŤăŽĺŽčĄ local_model = lgb.Booster(model_file= "model.txt" ) predictions = local_model.predict(X_test) # 4. čŠäžĄćć¨ăŽč¨çŽă¨čĄ¨ç¤ş r2 = r2_score(y_true, predictions) rmse = np.sqrt(mean_squared_error(y_true, predictions)) print ( "-" * 30 ) print (f "čŠäžĄçľć (ăăźăżć°: {len(y_true)}äťś)" ) print (f "R2 Score (ćąşĺŽäżć°): {r2:.4f}" ) print (f "RMSE (誤塎ăŽĺ¤§ăă): {rmse:.4f}" ) print ( "-" * 30 ) 䝼ä¸ăŻçč
ăŽç°ĺ˘ăŤăăăĺŽčĄçľćă§ăăR2ăšăłă˘ă 0.98 ăčś
ăă粞庌ăŽéŤăă˘ăăŤăä˝ćă§ăăžăăă ------------------------------ čŠäžĄçľć (ăăźăżć°: 5394äťś) R2 Score (ćąşĺŽäżć°): 0.9817 RMSE (誤塎ăŽĺ¤§ăă): 543.6218 ------------------------------ ĺćăŹăăźăăŽč§Łé ĺăŤäşć¸Źç˛žĺşŚăĺşăă ăă§ăŞăăAI ă ăŞăăăŽäşć¸ŹăăăăŽă ăč§Łéăăăă¨ăŻĺŽĺăŤăăăŚéčŚă§ăăăłăłăăĺ
ă§çćăăĺŚçżć˛çˇăŽçťĺ㨠SHAP ăç¨ăăĺĺĽăăźăżăŽĺćçľćă確čŞăăžăă import shap from IPython.display import Image, display print ( "=== ĺŚçżć˛çˇ (éĺŚçżăŽç˘şčŞ) ===" ) display(Image( "learning_curve.png" )) print ( " \n === ĺ
¨ä˝ăŽĺŻä¸ĺşŚ (SHAP Beeswarm) ===" ) display(Image( "shap_importance.png" )) # --- ĺĺĽăăźăżăŤĺŻžăăSHAPďźčĄ¨ĺ˝˘ĺźďź--- print ( " \n === çšĺŽăŽăăźăżďź1äťśçŽďźăŽäşć¸ŹăŽć šć ===" ) explainer = shap.TreeExplainer(local_model) single_instance = X_test.iloc[[ 0 ]] shap_values_single = explainer(single_instance) shap_df = pd.DataFrame({ "çšĺž´é (Feature)" : single_instance.columns, "ĺŽéăŽĺ¤ (Value)" : single_instance.values[ 0 ], "äžĄć źă¸ăŽĺ˝ąéż (SHAPĺ¤)" : shap_values_single.values[ 0 ] }) shap_df = shap_df.reindex(shap_df[ "äžĄć źă¸ăŽĺ˝ąéż (SHAPĺ¤)" ].abs().sort_values(ascending= False ).index) base_value = explainer.expected_value predicted_price = predictions[ 0 ] print (f "ăăăźăšăŠă¤ăłäžĄć ź (ĺšłĺ)ă: {base_value:.2f}" ) display(shap_df.style.format({ "äžĄć źă¸ăŽĺ˝ąéż (SHAPĺ¤)" : "{:+.2f}" }).hide(axis= "index" )) print (f "ăćçľäşć¸ŹäžĄć źă: {predicted_price:.2f}" ) ĺŚçżć˛çˇďźLearning Curveďź ă確čŞăăă¨ăĺŚçżăăźăżă¨ć¤č¨źăăźăżăŽčŞ¤ĺˇŽďźMSEďźăĺ
ąăŤĺłčŠä¸ăăă§ĺćăăŚăăžăă ăăăŻăćŞçĽăŽăăźăżă§ăăć¤č¨źăăźăżăŤĺŻžăăŚăéĺŚçżă辡ăăăă¨ăŞăĺŚçżăă§ăăŚăă訟ć ă§ăă ĺ
¨ä˝ăŽĺŻä¸ĺşŚ ă§ăŻăä¸ăŤăăçšĺž´éăťăŠäşć¸Źă¸ăŽĺ˝ąéżĺă大ăăăă¨ă示ăăŚăăžăă横蝸㎠0 ăĺşćşăŤăĺłĺ´ă äžĄć źăä¸ăăčŚĺ ă塌ĺ´ă äžĄć źăä¸ăăčŚĺ ă§ăă ăăăăăŽčľ¤č˛ăŻć°ĺ¤ă大ăăăăźăżă§ăăăéč˛ăŻć°ĺ¤ăŽĺ°ăăăăźăżă襨ăăŚăăžăăäžăă° carat ăŻĺłĺ´ăŤčľ¤č˛ă§ăăăăăăăŚăăăăă carat ă大ăăăťăŠéŤäžĄăŤăŞă ăă¨ăĺăăăžăă ćĺžăŤ çšĺŽăŽ1äťśăŤĺŻžăăäşć¸ŹăŽć šć ă襨形ĺźă§ĺşĺăăžăăă ĺ
¨ä˝ăŽĺšłĺäžĄć źďźăăźăšăŠă¤ăłďźăĺşćşă¨ăăŚăăéăă0.24ăŤăŠăăă¨ĺ°ăăăăăă¤ăăščŠäžĄăăéć庌ďźclarityďźăVVS1ă¨éŤĺ質ă§ăăăăăăŠăščŠäžĄăă¨ăăŁăăăăŤăćçľçăŞäşć¸ŹäžĄć źăŤčłăăžă§ăŽĺ
é¨ăŽč¨çŽăă¸ăăŻăăă¸ăăšé¨éăŤčŞŹćă§ăăžăă ç岊 čŁč˛´ (č¨äşä¸čڧ) ăŻăŠăŚăă˝ăŞăĽăźăˇă§ăłé¨ ăŻăŠăŚăăăŁăăăăăźčŞ˛ ĺćĺąąçĺ¨ä˝ăŽă¨ăłă¸ăă˘ăčĺłĺéăŻAI/MLăGoogle Cloud Partner Top Engineer ăŤé¸ĺşďź2025 / 2026ďźă