Python Machine Learning

❮ Previous Next ❯

Machine Learning is a branch of artificial intelligence.

With Machine Learning, computers can learn from data.

Python's syntax structure makes it ideal for developing machine learning applications.

Why Study Machine Language?

Have computers recognize patterns
Automate analytical models
Have systems learn from data
Optimize Processes
Predict Outcomes

Machine Learning Applications

Image and speech recognition
Fraud detection
Credit scores
Medical Analysis
Predicting Equipment Failure
Advertisement Targeting
Driverless Cars
And much, much more!

Why Use Python for Machine Learning?

Machine Learning is easy to implement with Python's open source Libraries. Python lets you spend less time debugging and more time solving machine learning problems.

In addition, Python can use several of its core libraries to handle large amounts of data and save you time when you create and test your algorithms.

Machine Learning Types

There are four types of machine learning.

Supervised Learning

This is the oldest form of machine learning. The algorithm is given a target outcome. Data is trained until the desired outcome is reached. Great for classification.

Unsupervised Learning

The algorithm is not given an outcome. Unsupervised algorithms are used to segment data into different groups. Useful when data can not be easily classified. Great for finding trends.

Reinforcement Learning

The machine is programmed to make decisions. Feedback loops train the algorithm by trial and error. The machine learns to make better estimates based on past data.

What You Should Already Know

Before taking on machine learning, you should be familiar with:

Python Coding
Python Libraries
Algorithms

This tutorial will not cover the basics of Python. For that, see our Python Tutorial.

Machine Learning Environment Set Up

A machine learning working environment must be set up for python.

Python Version

Be sure that Python 2.7 or later is installed. The version can easily be checked in the command line:

C:\Users\Your Name>python --version

If Python 2.7 or later is not installed a copy can be downloaded from python.org.

Libraries

A number of opensource libraries are commonly used with python machine learning. Be sure to install these before continuing on with the rest of the tutorial:

NumPy - Supports large scale arrays and matrices.

SciPy - A collection of mathematical algorithms built on NumPy.

Scikit-learn - Contains machine learning algorithms used for data analysis. Its preprocessing package helps standardize data for closer data fits.

Pandas - A data analysis library. Includes structures for manipulating tables.

Matpoltlib - 2D plotting library to produce publication-quality images. Visualize machine learning data sets with Matplotlib.

How to Install

Libraries can be installed using pip.

Navigate to the Python Scripts folder and type these individually into the command line to install the libraries:

C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install numpy
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install scipy
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install sklearn
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install pandas
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install matplotlib

Viewing Data with Python

Most machine learning algorithms start with raw(untouched) data. Looking at the data may show insights that can help us preprocess and manage the machine learning algorithm.

Example

Here we view a sample dataset taken from the UCI machine learning depository:

Try it Yourself »

Data Dimensions

It's important to know how the data is shaped. It may take a long time to read many rows and columns. With the shape property, we can know the size of the data.

Example

Print the data shape, the result should show that the iris data has 150 rows, and each row has 5 columns:

shape = data.shape
print(shape)

Try it Yourself »

Data Types

The dtype property will be important. Strings might need to be transformed into integers to represent categories. The data types can be pulled from the raw data.

Example

Print the data type of each column:

types = data.dtypes
print(types)

Try it Yourself »

Statistics with Python

More information about an attribute's shape can be revealed with statistics.

Panda's describe() function will display 8 statistics for all attributes:

Count
Mean - the average
Standard Deviation - the spread of the values from the mean
Minimum Value - the lowest volume
25th Percentile - 25% of all data falls bellow
50th Percentile - the median
75th Percentile - 75% of all data falls below
Maximum value

Example

At this point, it's worth to note any NA values or any interesting distributions:

import pandas

fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)

#Print the statistics
pandas.set_option('display.width', 100)
pandas.set_option('precision', 2)
description = data.describe()
print(description)

Try it Yourself »

Class Distribution

Class distribution applies only to classification problems. Imbalanced problems that have more observations in a specific category may need special accommodation in the data set. Class distribution data can easily be pulled by Pandas.

Example

In this case, the data is easily balanced:

import pandas

fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)

#Print the class distribution
class_counts = data.groupby('class').size()
print(class_counts)

Try it Yourself »

Correlation and Skew with Python

Correlation expresses the relationship between two variables. Pearson's correlation coefficient is most commonly used with normal distributions.

Example

import pandas

fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)

#Print the correlation
pandas.set_option('display.width', 100)
pandas.set_option('precision', 2)
correlations = data.corr(method='pearson')
print(correlations)

Try it Yourself »

Skew Distributions

With Skew Distributions, it's assumed that the Gaussian (normal distribution) is shifted left or right. Knowledge of the skews allows you to perform data preparation (explained next chapter) to improve the model accuracy. Skew can be calculated with the Panda's skew() Function.

Example

import pandas

fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)

#Print the skew
skew = data.skew()
print(skew)

Try it Yourself »

Data Preprocessing and Rescale with Python

Assumptions made by algorithms may require some form of data processing to help deliver better results. A popular method involves creating many different data views and transforms, then testing many algorithms. This method helps select the best transformations.

4 preprocessing techniques are shown in the next sections (although many more exist). All preprocessing techniques follow a similar structure:

A dataset is loaded
A dataset is split into input and output variables
A preprocessing transform is applied
Data is summarized to show a change

Data Rescale

Varied ranges can affect the prediction accuracy of our machine learning models. Data rescale (also called normalization) is used to scale different-sized attributes into a single scale. The rescale is typically ranged from 0 and 1 but can be of any range.

Data rescale is useful for certain machine learning algorithms such as gradient descent, regression, neural networks, and K-Nearest Neighbors.

Example

The MinMaxScaler class from scikit-learn is used to rescale the iris dataset:

import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler

fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values

# input/output component separation
X = array[:,0:4]
Y = array[:,4]

#Dataset is scaled to 0 and 3
scaler = MinMaxScaler(feature_range=(0, 3))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
numpy.set_printoptions(precision=2)
print(rescaledX[0:5,:])

Try it Yourself »

Normalize Data with Python

Unlike rescale, normalization in scikit-learn scales each observation to a length of 1 (algebraically referred to as the unit norm). Normalization is common with cluster and classification problems. Additionally, normalized data is useful in neural networks and distance algorithms such as K-Nearest Neighbors.

Example

Data can be normalized with the Normalizer class.

import pandas
import scipy
import numpy
from sklearn.preprocessing import Normalizer

fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values

# input/output component separation
X = array[:,0:4]
Y = array[:,4]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=2)
print(normalizedX[0:10,:])

Try it Yourself »

Standardized Data with Python

Standardized data (also known as Z-score normalization) is useful when attributes have Gaussian distributions. Data is transformed to a standard deviation of 1 and 0 mean. The technique works best when the data has been rescaled. It is suitable for algorithms that use logistic regression, linear regression, and linear discriminate analysis.

Example

The Iris data is transformed to a standard deviation of 1 and a mean of 0.

import pandas
import scipy
import numpy
from sklearn.preprocessing import StandardScaler

fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values

# input/output component separation
X = array[:,0:4]
Y = array[:,4]
scaler = StandardScaler().fit(X)
standardX = scaler.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=2)
print(standardX[0:20,:])

Try it Yourself »

Binarized Data with Python

Raw data can be transformed into binary data known as binarization. A predefined threshold determines the data's on and off state. Data below the threshold is marked 0, while data above the threshold is marked 1.
Binarization is popular with image processing. A one could be used to represent an object with a certain intensity and a zero could represent the background. Binarization can be used to recognize shapes, objects, and characters.

Example

The iris data is binarized with a threshold of 1.4:

import pandas
import scipy
import numpy
from sklearn.preprocessing import Binarizer

fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values

# input/output component separation
X = array[:,0:4]
Y = array[:,4]

#treshold is set to 1.4
binarizer = Binarizer(threshold=1.4).fit(X)
binaryX = binarizer.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=2)
print(binaryX[0:20,:])

Try it Yourself »

❮ Previous Next ❯

★ +1

Track your progress - it's free!

Python Tutorial

File Handling

Python Modules

Python Matplotlib

Machine Learning

Python MySQL

Python MongoDB

Python Reference

Module Reference

Python How To

Python Examples

Python Machine Learning

Why Study Machine Language?

Machine Learning Applications

Why Use Python for Machine Learning?

Machine Learning Types

Supervised Learning

Unsupervised Learning

Reinforcement Learning

What You Should Already Know

Machine Learning Environment Set Up

Python Version

Libraries

How to Install

Viewing Data with Python

Example

Data Dimensions

Example

Data Types

Example

Statistics with Python

Example

Class Distribution

Example

Correlation and Skew with Python

Example

Skew Distributions

Example

Data Preprocessing and Rescale with Python

Data Rescale

Example

Normalize Data with Python

Example

Standardized Data with Python

Example

Binarized Data with Python

Example

COLOR PICKER

Contact Sales

Report Error

Top Tutorials

Top References

Top Examples

Get Certified