Building your first machine learning

This post will teach you how to create your first machine learning model in Python. In addition, we’ll be creating regression models with traditional linear regression and additional machine learning algorithms in particular.

So, what kind of machine learning model are we working on right now? First, we will develop a regression model using the random forest approach on the Iris dataset in this post. After generating the model, we’ll use it to make predictions, then evaluate its performance and visualize the findings.

Every machine learning project starts with a thorough comprehension of the data and the development of goals. Then, you will be studying, building, and analyzing the data as you apply machine learning algorithms to your data set to reach the result.

Building machine learning model using Iris dataset

The steps for developing a well-defined ML project are as follows:

  • Recognize and define the issue
  • Analyze and prepare the information.
  • Make use of the algorithms.
  • Errors reduction.
  • Predict the outcome

First, let’s look at the Iris data set, one of the most well-known datasets available, to learn about several machine learning algorithms.

Set of data

So, which dataset will we be using? Using a toy dataset as an example, the Iris dataset (classification) or the Boston housing dataset, maybe the default answer (regression).

Although both are excellent examples, to begin with, most tutorials import the data directly from a Python library, such as the datasets sub-module of scikit-learn, rather than from an external source (such as a CSV file). For example, to load in the Iris dataset, use the following code block:

from sklearn import datasets

iris = datasets.load_iris()

X = iris.data

y = iris.target

Toy datasets have the advantage of being extremely easy to use; import the data from the library in a format that is utilized for model development. The disadvantage of this convenience is that newcomers may not be able to tell which functions are importing data, which ones are performing actual pre-processing, which ones are creating the model, and so on.

This article takes a hands-on approach, focusing on creating real-world models that you can readily recreate. We will be reading in the input data directly from a CSV file. You may easily substitute the input data with your data and repurpose the procedure outlined herein for your data projects.

Statement of the Issue

The physical attributes of three flower species — Versicolor, Setosa, and Virginica — make up this data set. Sepal width, Sepal length, Petal width, and petal length are the numeric parameters in the dataset. We will forecast the flower classes in this data based on these criteria.

The information is in the form of continuous numeric values that describe the dimensions of the various aspects. These characteristics are used to train the model.

Let’s get started on our machine learning project. To comprehend and train our model, we will use Python. Python has several built-in libraries, including Numpy, Pandas, and SciKit Learn.

import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import pandas as pd 
from sklearn.linear_model import LinearRegression
import numpy as np 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

The Iris dataset is already in the SciKit Learn package, and we can use the following code to import it, as we have shown earlier.

from sklearn import datasets

iris =datasets.load_iris()

The properties of the iris blooms can be described in the form of a dataframe, as shown below, with the column ‘class’ indicating the category to which it belongs.

iris_data =iris.data
iris_data =pd.DataFrame(iris_data, columns=iris.feature_names)
iris_data['class'] = iris.target
iris_data.head()

Our dataset contains three varieties of flowers, as previously stated. Let’s take a peek at the flower’s target names.

print(iris.target_names)

Data Comprehension

With only 150 samples, this is a minimal data collection. Because the dataframe comprises four features (sepal length, sepal width, petal length, and petal width) and 150 samples from each of the three target classes, we’ll use the following matrix:

print(iris_data.shape)

Let us now look at the dataset’s mathematics to determine the standard deviation, mean, lowest value, and four quartile percentiles.

iris_data.describe()

Because the data set is pre-defined, each class has an equal amount of samples. That’s 50 students per class.

Visually analyzing the data

Let’s look at the dataset’s box plot, which displays a visual depiction of how our data is distributed across the plane.

The box plot is a percentile-based graph in which the data is divided into four quartiles of 25% each. In statistical analysis, this strategy is utilized to understand numerous measurements such as mean, median, and deviation.

import seaborn as sns
sns.boxplot(data =iris_data, width=0.5, fliersize=5)
sns.set(rc={'figure.figsize':(2,5)})

To see how each feature contributes to data classification, we may create a scatter plot that displays the correlation with other elements. This strategy aids in determining the essential features that account for the categorization in our model.

Dividing the data for training and testing with the algorithm

We can begin training a model based on the methods after better understanding the dataset. We’ll be implementing some of the most commonly used machine learning techniques here. Let’s start by using some of the samples to train our model. We’ll use the built-in package ‘train test split,’ which divides our data set into a 70:30 ratio. The following code is ideal in accomplishing this:

# training the test split
from sklearn.model_selection import train_test_split
X = iris_data.values[:,0:4]
Y = iris_data.values[:,4]
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.3, random_state=42)

The model’s training

We’ll train our model with some of the most regularly used algorithms to see how accurate each is. These algorithms are implemented for comparison:

  • Support Vector Machine (SVM)
  • Randomforest
  • Logistic Regression
  • K – Nearest Neighbour (KNN)

Let’s get started on our model and estimate the accuracy of each algorithm. We can also see which method produces the best results. First, we can use the first algorithm, KNN, with several neighbors of 5. We can construct our model as follows:

model = KNeighborsClassifier()
model.fit(x_train, y_train)
predictions =model.predict(x_test)
print(accuracy_score(y_test, predictions))

The Support Vector Machine model employs the Radial Basis Function approach with default values. To check the accuracy, we’ll use the RBF kernel.

model = SVC()
model.fit(x_train,y_train)
predictions=model.predict(x_test)
print(accuracy_score(y_test,predictions))

Randomforest is a very accurate nonlinear method that uses the Decision Tree Classification principle. Let’s see if it’s correct:

model = RandomForestClassifier(n_estimators=5)
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(accuracy_score(y_test, predictions))

If the problem is a binary classification problem, it works as one vs. the rest, and if the problem is a multiclass classification problem, it works as one vs. many.

model = LogisticRegression()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(accuracy_score(y_test, predictions))

Please select a model and fine-tune its parameters

We can see from the above models that randomforest has the highest accuracy of 97.59 percent. So, let’s fine-tune the parameter to achieve 100% accuracy. To see if our model is doing effectively, let’s increase the number of trees to 1,000.

model = RandomForestClassifier(n_estimators=500)
model.fit(x_train, y_train)
predictions =model.predict(x_test)
print(accuracy_score(y_test, predictions))

Conclusion

No one model or technique can guarantee a perfect result for every dataset in machine learning. The starting point is analyzing the facts before we apply any method or develop our model based on the anticipated outcome. This dataset provides us with 100% accuracy, which is nearly impossible to achieve.

RandomForest, when compared to other algorithms, delivers the best accuracy since it works best with continuous data and applies a nonlinear relationship to the features. This approach reduces the likelihood of overfitting and the volatility in the data, resulting in improved accuracy.

Similar Posts

One Comment

  1. I am happy to say that this is my firs projects through blogs , and i learn so much things with very easy steps. thank you so much….

Leave a Reply

Your email address will not be published. Required fields are marked *