Machine learning library for the Python programming language.
Scikit-learn is a popular Python library for machine learning. It provides a selection of efficient tools for machine learning and statistical modeling, including classification, regression, clustering, and dimensionality reduction via a consistent interface.
Scikit-learn is built upon the SciPy (Scientific Python) that must be installed before you can use Scikit-learn. This stack includes:
Scikit-learn comes with standard datasets, for instance, the iris and digits datasets for classification and the Boston house prices dataset for regression.
Data preprocessing is a crucial step in the machine learning pipeline. Scikit-learn provides several utilities for data preprocessing:
Handling Missing Values: Scikit-learn provides the SimpleImputer
class that supports basic strategies for imputing missing values, using mean, median, or the most frequent values of the row or column where the missing values are located.
Encoding Categorical Variables: Machine learning models require input to be numeric. Scikit-learn provides utilities like LabelEncoder
and OneHotEncoder
to convert categorical data into numeric form.
Feature Scaling: Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. Scikit-learn provides utilities like StandardScaler
(for standardization) and MinMaxScaler
(for normalization).
Scikit-learn follows a consistent API where you first instantiate a model class, then fit the model to the data using the fit()
method, and finally use the model to make predictions using the predict()
method.
Splitting Data into Training and Test Sets: Scikit-learn provides the train_test_split
function to randomly partition the data into a training set and a test set.
Training Models: After instantiating the model (for example, model = LinearRegression()
), you can fit the model to the data using the fit()
method (for example, model.fit(X_train, y_train)
).
Scikit-learn provides utilities to evaluate the performance of models:
Accuracy: The accuracy_score
function computes the accuracy, either the fraction or the count of correct predictions.
Precision, Recall, F1 Score: The classification_report
function builds a text report showing the main classification metrics.
Confusion Matrix: The confusion_matrix
function computes the confusion matrix to evaluate the accuracy of a classification.
Understanding the bias-variance tradeoff is critical to understanding model performance. Scikit-learn provides utilities to help with this:
cross_val_score
and cross_validate
to perform cross-validation and assess the model's performance more robustly.By the end of this unit, you should have a solid understanding of Scikit-learn's basic functionalities and be able to use it to preprocess data, train models, and evaluate their performance.