Machine Learning dengan Python — Panduan Lengkap dari Dasar hingga Advanced

Machine Learning dengan Python — Panduan Lengkap dari Dasar hingga Advanced

16/10/2025 Python By Tech Writers
Machine LearningPythonData ScienceAIDeep Learning

Pengenalan: Era Transformasi Data-Driven dengan Machine Learning

Machine Learning telah mengubah cara kita membangun aplikasi modern, dari predictive analytics hingga computer vision dan natural language processing. Python telah menjadi lingua franca untuk machine learning development berkat ekosistem libraries yang kaya dan community yang besar. Baik Kamu seorang data scientist, software engineer, atau entrepreneur, pemahaman tentang machine learning dengan Python adalah skill yang valuable dan marketable.

Dalam panduan comprehensive ini, kita akan menjelajahi complete machine learning workflow dari data collection dan preprocessing, melalui model training dan evaluation, hingga deployment dan monitoring. Kamu akan belajar fundamental algorithms, advanced techniques, dan best practices yang digunakan oleh leading tech companies untuk solve real-world problems.

Daftar Isi

ML Libraries dan Ecosystem

Python’s rich ecosystem provides powerful libraries untuk every step dari ML workflow. Understanding libraries ini adalah crucial.

# Installation
pip install numpy pandas scikit-learn tensorflow keras matplotlib seaborn

# NumPy: Numerical computations
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr.mean(), arr.std())  # Mean dan standard deviation

# Pandas: Data manipulation dan analysis
import pandas as pd
df = pd.read_csv('data.csv')
print(df.describe())  # Summary statistics
print(df[df['age'] > 30])  # Filter

# Scikit-learn: Traditional ML algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Matplotlib & Seaborn: Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.show()

# TensorFlow & Keras: Deep learning
import tensorflow as tf
from tensorflow import keras

Data Preprocessing dan Feature Engineering

Data preprocessing adalah 80% dari ML pipeline. Quality data preparation directly impacts model performance dan reliability.

Data Loading dan Cleaning

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load data
df = pd.read_csv('customer_data.csv')

# Check untuk missing values
print(df.isnull().sum())

# Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['city'].fillna('Unknown', inplace=True)

# Remove duplicates
df = df.drop_duplicates()

# Remove outliers (menggunakan IQR method)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['salary'] >= Q1 - 1.5*IQR) & (df['salary'] <= Q3 + 1.5*IQR)]

print(df.head())

Feature Engineering

Membuat fitur yang bermakna dari data mentah secara signifikan memengaruhi performa model. Rekayasa fitur yang baik memerlukan pengetahuan domain dan eksperimen.

# Create new features
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100], 
                         labels=['Child', 'Young', 'Adult', 'Senior'])

# One-hot encoding untuk categorical variables
df = pd.get_dummies(df, columns=['city', 'occupation'])

# Scaling numerical features
scaler = StandardScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])

# Feature normalization (0-1)
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
df[['age']] = minmax_scaler.fit_transform(df[['age']])

Train-Test Split

Membagi data memastikan evaluasi yang adil terhadap performa model pada data yang belum pernah dilihat, mencegah overfitting, dan memungkinkan estimasi performa yang realistis.

from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)  # Features
y = df['target']  # Target variable

# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Supervised Learning: Regression

Regresi memprediksi nilai kontinu. Model regresi digunakan untuk tugas prediksi seperti prediksi harga, peramalan suhu, dan lainnya.

Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create dan train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse}")
print(f"R² Score: {r2}")

# Coefficients
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef}")

Polynomial Regression

Regresi polinomial menyesuaikan fungsi polinomial ke data, menangkap hubungan non-linear yang tidak bisa diekspresikan oleh model linear.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Create polynomial features
poly_model = Pipeline([
    ('poly_features', PolynomialFeatures(degree=2)),
    ('linear_regression', LinearRegression())
])

poly_model.fit(X_train, y_train)
y_pred = poly_model.predict(X_test)

r2 = r2_score(y_test, y_pred)
print(f"Polynomial R² Score: {r2}")

Supervised Learning: Classification

Klasifikasi memprediksi hasil kategoris. Ini adalah salah satu tugas ML paling umum di sistem production.

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Decision Trees dan Random Forests

Model berbasis tree menangani hubungan non-linear dan memberikan skor pentingnya fitur. Random Forests meningkatkan single trees dengan ensemble methods.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Decision Tree
dt_model = DecisionTreeClassifier(max_depth=10, random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, dt_pred)}")

# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_pred)}")

# Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, gb_pred)}")

# Feature importance
for feature, importance in zip(X.columns, rf_model.feature_importances_):
    print(f"{feature}: {importance}")

Support Vector Machines (SVM)

SVM adalah algoritma yang powerful dan bekerja dengan baik untuk klasifikasi binary dan multiclass dengan decision boundaries non-linear.

from sklearn.svm import SVC

svm_model = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_model.fit(X_train, y_train)

svm_pred = svm_model.predict(X_test)
print(f"SVM Accuracy: {accuracy_score(y_test, svm_pred)}")

Unsupervised Learning: Clustering

Clustering mengelompokkan data points yang mirip tanpa data berlabel. Ini berguna untuk menemukan pola, segmentasi pelanggan, dan sejenisnya.

K-Means Clustering

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Determine optimal K using elbow method
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_train)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_train, kmeans.labels_))

# Plot elbow curve
import matplotlib.pyplot as plt
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.show()

# Train dengan optimal K
optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
clusters = kmeans.fit_predict(X_train)

print(f"Cluster labels: {clusters}")
print(f"Cluster centers:\n{kmeans.cluster_centers_}")

Hierarchical Clustering

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Perform hierarchical clustering
linkage_matrix = linkage(X_train, method='ward')

# Plot dendrogram
dendrogram(linkage_matrix)
plt.show()

# Agglomerative clustering
hierarchical_model = AgglomerativeClustering(n_clusters=3, linkage='ward')
clusters = hierarchical_model.fit_predict(X_train)

DBSCAN

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Standardize features untuk DBSCAN
X_scaled = StandardScaler().fit_transform(X_train)

# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X_scaled)

print(f"Number of clusters: {len(set(clusters))}")
print(f"Number of noise points: {sum(clusters == -1)}")

Dimensionality Reduction

Principal Component Analysis (PCA)

from sklearn.decomposition import PCA

# Reduce dimensions ke 2 untuk visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_train)

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative explained variance: {sum(pca.explained_variance_ratio_)}")

# Visualize
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_train, cmap='viridis')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
plt.show()

Deep Learning dengan TensorFlow/Keras

Neural Network Basics

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Sequential model
model = keras.Sequential([
    layers.Input(shape=(X_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='sigmoid')  # Binary classification
])

# Compile
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Print summary
model.summary()

# Train
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=50,
    batch_size=32,
    verbose=1
)

# Evaluate
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_accuracy}")

# Predict
predictions = model.predict(X_test)

Convolutional Neural Networks (CNN)

# CNN untuk image classification
model = keras.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')  # 10 classes
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

Recurrent Neural Networks (RNN/LSTM)

# LSTM untuk sequence data (time series, text)
model = keras.Sequential([
    layers.LSTM(64, activation='relu', input_shape=(100, 1), return_sequences=True),
    layers.Dropout(0.2),
    layers.LSTM(32, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(16, activation='relu'),
    layers.Dense(1)  # Regression output
])

model.compile(optimizer='adam', loss='mse', metrics=['mae'])

Model Evaluation dan Validation

Cross-Validation

from sklearn.model_selection import cross_val_score, cross_validate

# K-Fold Cross Validation
scores = cross_val_score(
    model, X_train, y_train, cv=5, scoring='accuracy'
)
print(f"Cross-validation scores: {scores}")
print(f"Mean CV Score: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Detailed cross-validation
cv_results = cross_validate(
    model, X_train, y_train, cv=5,
    scoring=['accuracy', 'precision', 'recall', 'f1'],
    return_train_score=True
)

Confusion Matrix dan Classification Report

from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Classification Report
print(classification_report(y_test, y_pred))

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)

plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 20),
    'learning_rate': uniform(0.01, 0.3)
}

random_search = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_dist,
    n_iter=20,
    cv=5,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

Model Deployment dan Serving

Save dan Load Models

import pickle
import joblib

# Save scikit-learn model
joblib.dump(model, 'model.pkl')
loaded_model = joblib.load('model.pkl')

# Save Keras model
keras_model.save('keras_model.h5')
loaded_keras_model = keras.models.load_model('keras_model.h5')

# Save dengan versioning
import json
model_metadata = {
    'accuracy': 0.95,
    'features': X.columns.tolist(),
    'timestamp': '2025-01-10'
}
with open('model_metadata.json', 'w') as f:
    json.dump(model_metadata, f)

Flask API untuk Model Serving

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = [data['age'], data['income'], data['credit_score']]
    prediction = model.predict([features])
    
    return jsonify({
        'prediction': int(prediction[0]),
        'probability': float(model.predict_proba([features])[0].max())
    })

if __name__ == '__main__':
    app.run(debug=True, port=5000)

Real-World Projects

House Price Prediction

# Load housing data
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Split dan train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train ensemble model
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

model = GradientBoostingRegressor(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate
from sklearn.metrics import mean_absolute_error, r2_score
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE: ${mae*100000:.2f}")
print(f"R² Score: {r2}")

Kesimpulan

Machine Learning dengan Python adalah skill yang sangat kuat dan membuka peluang karir di data science, penelitian AI, dan industri teknologi. Dengan menguasai algoritma fundamental, data preprocessing, model evaluation, dan teknik deployment yang dijelaskan dalam panduan ini, Kamu siap untuk mengatasi masalah machine learning di dunia nyata.

Checklist untuk Proyek ML Production:

  • ✓ Pembersihan data dan preprocessing yang tepat
  • ✓ Feature engineering dan seleksi fitur
  • ✓ Pelatihan model dengan cross-validation
  • ✓ Penyetelan hyperparameter dengan grid/random search
  • ✓ Metrik evaluasi yang komprehensif
  • ✓ Versioning model dan tracking
  • ✓ Deployment API untuk model serving
  • ✓ Strategi monitoring dan retraining
  • ✓ Dokumentasi dan reproducibility model
  • ✓ Pertimbangan etika dan keadilan

5. Evaluate

score = model.score(X_test, y_test)


## Supervised Learning

- Regression: Predict continuous values
- Classification: Predict categories

## Unsupervised Learning

- Clustering: Mengelompokkan data yang mirip
- Dimensionality Reduction: Mengurangi fitur

## Deep Learning

```python
import tensorflow as tf

model = tf.keras.Sequential([
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(X_train, y_train)

Mulai perjalanan ML Kamu hari ini!