Machine Learning dengan Python — Panduan Lengkap dari Dasar hingga Advanced
Pengenalan: Era Transformasi Data-Driven dengan Machine Learning
Machine Learning telah mengubah cara kita membangun aplikasi modern, dari predictive analytics hingga computer vision dan natural language processing. Python telah menjadi lingua franca untuk machine learning development berkat ekosistem libraries yang kaya dan community yang besar. Baik Kamu seorang data scientist, software engineer, atau entrepreneur, pemahaman tentang machine learning dengan Python adalah skill yang valuable dan marketable.
Dalam panduan comprehensive ini, kita akan menjelajahi complete machine learning workflow dari data collection dan preprocessing, melalui model training dan evaluation, hingga deployment dan monitoring. Kamu akan belajar fundamental algorithms, advanced techniques, dan best practices yang digunakan oleh leading tech companies untuk solve real-world problems.
Daftar Isi
- ML Libraries dan Ecosystem
- Data Preprocessing dan Feature Engineering
- Supervised Learning: Regression
- Supervised Learning: Classification
- Unsupervised Learning: Clustering
- Dimensionality Reduction
- Deep Learning dengan TensorFlow/Keras
- Model Evaluation dan Validation
- Hyperparameter Tuning
- Model Deployment dan Serving
- Real-World Projects
- Kesimpulan
ML Libraries dan Ecosystem
Python’s rich ecosystem provides powerful libraries untuk every step dari ML workflow. Understanding libraries ini adalah crucial.
Popular Libraries
# Installation
pip install numpy pandas scikit-learn tensorflow keras matplotlib seaborn
# NumPy: Numerical computations
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr.mean(), arr.std()) # Mean dan standard deviation
# Pandas: Data manipulation dan analysis
import pandas as pd
df = pd.read_csv('data.csv')
print(df.describe()) # Summary statistics
print(df[df['age'] > 30]) # Filter
# Scikit-learn: Traditional ML algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Matplotlib & Seaborn: Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.show()
# TensorFlow & Keras: Deep learning
import tensorflow as tf
from tensorflow import keras
Data Preprocessing dan Feature Engineering
Data preprocessing adalah 80% dari ML pipeline. Quality data preparation directly impacts model performance dan reliability.
Data Loading dan Cleaning
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Load data
df = pd.read_csv('customer_data.csv')
# Check untuk missing values
print(df.isnull().sum())
# Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['city'].fillna('Unknown', inplace=True)
# Remove duplicates
df = df.drop_duplicates()
# Remove outliers (menggunakan IQR method)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['salary'] >= Q1 - 1.5*IQR) & (df['salary'] <= Q3 + 1.5*IQR)]
print(df.head())
Feature Engineering
Membuat fitur yang bermakna dari data mentah secara signifikan memengaruhi performa model. Rekayasa fitur yang baik memerlukan pengetahuan domain dan eksperimen.
# Create new features
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100],
labels=['Child', 'Young', 'Adult', 'Senior'])
# One-hot encoding untuk categorical variables
df = pd.get_dummies(df, columns=['city', 'occupation'])
# Scaling numerical features
scaler = StandardScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])
# Feature normalization (0-1)
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
df[['age']] = minmax_scaler.fit_transform(df[['age']])
Train-Test Split
Membagi data memastikan evaluasi yang adil terhadap performa model pada data yang belum pernah dilihat, mencegah overfitting, dan memungkinkan estimasi performa yang realistis.
from sklearn.model_selection import train_test_split
X = df.drop('target', axis=1) # Features
y = df['target'] # Target variable
# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
Supervised Learning: Regression
Regresi memprediksi nilai kontinu. Model regresi digunakan untuk tugas prediksi seperti prediksi harga, peramalan suhu, dan lainnya.
Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Create dan train model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse}")
print(f"R² Score: {r2}")
# Coefficients
for feature, coef in zip(X.columns, model.coef_):
print(f"{feature}: {coef}")
Polynomial Regression
Regresi polinomial menyesuaikan fungsi polinomial ke data, menangkap hubungan non-linear yang tidak bisa diekspresikan oleh model linear.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
# Create polynomial features
poly_model = Pipeline([
('poly_features', PolynomialFeatures(degree=2)),
('linear_regression', LinearRegression())
])
poly_model.fit(X_train, y_train)
y_pred = poly_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Polynomial R² Score: {r2}")
Supervised Learning: Classification
Klasifikasi memprediksi hasil kategoris. Ini adalah salah satu tugas ML paling umum di sistem production.
Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
Decision Trees dan Random Forests
Model berbasis tree menangani hubungan non-linear dan memberikan skor pentingnya fitur. Random Forests meningkatkan single trees dengan ensemble methods.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# Decision Tree
dt_model = DecisionTreeClassifier(max_depth=10, random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, dt_pred)}")
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_pred)}")
# Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, gb_pred)}")
# Feature importance
for feature, importance in zip(X.columns, rf_model.feature_importances_):
print(f"{feature}: {importance}")
Support Vector Machines (SVM)
SVM adalah algoritma yang powerful dan bekerja dengan baik untuk klasifikasi binary dan multiclass dengan decision boundaries non-linear.
from sklearn.svm import SVC
svm_model = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print(f"SVM Accuracy: {accuracy_score(y_test, svm_pred)}")
Unsupervised Learning: Clustering
Clustering mengelompokkan data points yang mirip tanpa data berlabel. Ini berguna untuk menemukan pola, segmentasi pelanggan, dan sejenisnya.
K-Means Clustering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Determine optimal K using elbow method
inertias = []
silhouette_scores = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_train)
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X_train, kmeans.labels_))
# Plot elbow curve
import matplotlib.pyplot as plt
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.show()
# Train dengan optimal K
optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
clusters = kmeans.fit_predict(X_train)
print(f"Cluster labels: {clusters}")
print(f"Cluster centers:\n{kmeans.cluster_centers_}")
Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Perform hierarchical clustering
linkage_matrix = linkage(X_train, method='ward')
# Plot dendrogram
dendrogram(linkage_matrix)
plt.show()
# Agglomerative clustering
hierarchical_model = AgglomerativeClustering(n_clusters=3, linkage='ward')
clusters = hierarchical_model.fit_predict(X_train)
DBSCAN
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# Standardize features untuk DBSCAN
X_scaled = StandardScaler().fit_transform(X_train)
# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X_scaled)
print(f"Number of clusters: {len(set(clusters))}")
print(f"Number of noise points: {sum(clusters == -1)}")
Dimensionality Reduction
Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
# Reduce dimensions ke 2 untuk visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_train)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative explained variance: {sum(pca.explained_variance_ratio_)}")
# Visualize
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_train, cmap='viridis')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
plt.show()
Deep Learning dengan TensorFlow/Keras
Neural Network Basics
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Sequential model
model = keras.Sequential([
layers.Input(shape=(X_train.shape[1],)),
layers.Dense(64, activation='relu'),
layers.Dropout(0.2),
layers.Dense(32, activation='relu'),
layers.Dropout(0.2),
layers.Dense(16, activation='relu'),
layers.Dense(1, activation='sigmoid') # Binary classification
])
# Compile
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Print summary
model.summary()
# Train
history = model.fit(
X_train, y_train,
validation_split=0.2,
epochs=50,
batch_size=32,
verbose=1
)
# Evaluate
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_accuracy}")
# Predict
predictions = model.predict(X_test)
Convolutional Neural Networks (CNN)
# CNN untuk image classification
model = keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax') # 10 classes
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
Recurrent Neural Networks (RNN/LSTM)
# LSTM untuk sequence data (time series, text)
model = keras.Sequential([
layers.LSTM(64, activation='relu', input_shape=(100, 1), return_sequences=True),
layers.Dropout(0.2),
layers.LSTM(32, activation='relu'),
layers.Dropout(0.2),
layers.Dense(16, activation='relu'),
layers.Dense(1) # Regression output
])
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
Model Evaluation dan Validation
Cross-Validation
from sklearn.model_selection import cross_val_score, cross_validate
# K-Fold Cross Validation
scores = cross_val_score(
model, X_train, y_train, cv=5, scoring='accuracy'
)
print(f"Cross-validation scores: {scores}")
print(f"Mean CV Score: {scores.mean():.3f} (+/- {scores.std():.3f})")
# Detailed cross-validation
cv_results = cross_validate(
model, X_train, y_train, cv=5,
scoring=['accuracy', 'precision', 'recall', 'f1'],
return_train_score=True
)
Confusion Matrix dan Classification Report
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Classification Report
print(classification_report(y_test, y_pred))
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
Hyperparameter Tuning
Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
Random Search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_dist = {
'n_estimators': randint(50, 300),
'max_depth': randint(3, 20),
'min_samples_split': randint(2, 20),
'learning_rate': uniform(0.01, 0.3)
}
random_search = RandomizedSearchCV(
GradientBoostingClassifier(random_state=42),
param_dist,
n_iter=20,
cv=5,
n_jobs=-1
)
random_search.fit(X_train, y_train)
Model Deployment dan Serving
Save dan Load Models
import pickle
import joblib
# Save scikit-learn model
joblib.dump(model, 'model.pkl')
loaded_model = joblib.load('model.pkl')
# Save Keras model
keras_model.save('keras_model.h5')
loaded_keras_model = keras.models.load_model('keras_model.h5')
# Save dengan versioning
import json
model_metadata = {
'accuracy': 0.95,
'features': X.columns.tolist(),
'timestamp': '2025-01-10'
}
with open('model_metadata.json', 'w') as f:
json.dump(model_metadata, f)
Flask API untuk Model Serving
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
features = [data['age'], data['income'], data['credit_score']]
prediction = model.predict([features])
return jsonify({
'prediction': int(prediction[0]),
'probability': float(model.predict_proba([features])[0].max())
})
if __name__ == '__main__':
app.run(debug=True, port=5000)
Real-World Projects
House Price Prediction
# Load housing data
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
# Split dan train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train ensemble model
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=100)
model.fit(X_train, y_train)
# Evaluate
from sklearn.metrics import mean_absolute_error, r2_score
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MAE: ${mae*100000:.2f}")
print(f"R² Score: {r2}")
Kesimpulan
Machine Learning dengan Python adalah skill yang sangat kuat dan membuka peluang karir di data science, penelitian AI, dan industri teknologi. Dengan menguasai algoritma fundamental, data preprocessing, model evaluation, dan teknik deployment yang dijelaskan dalam panduan ini, Kamu siap untuk mengatasi masalah machine learning di dunia nyata.
Checklist untuk Proyek ML Production:
- ✓ Pembersihan data dan preprocessing yang tepat
- ✓ Feature engineering dan seleksi fitur
- ✓ Pelatihan model dengan cross-validation
- ✓ Penyetelan hyperparameter dengan grid/random search
- ✓ Metrik evaluasi yang komprehensif
- ✓ Versioning model dan tracking
- ✓ Deployment API untuk model serving
- ✓ Strategi monitoring dan retraining
- ✓ Dokumentasi dan reproducibility model
- ✓ Pertimbangan etika dan keadilan
5. Evaluate
score = model.score(X_test, y_test)
## Supervised Learning
- Regression: Predict continuous values
- Classification: Predict categories
## Unsupervised Learning
- Clustering: Mengelompokkan data yang mirip
- Dimensionality Reduction: Mengurangi fitur
## Deep Learning
```python
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(X_train, y_train)
Mulai perjalanan ML Kamu hari ini!