Python Machine Learning
Machine Learning is a branch of artificial intelligence.
With Machine Learning, computers can learn from data.
Python's syntax structure makes it ideal for developing machine learning applications.
Why Study Machine Language?
- Have computers recognize patterns
- Automate analytical models
- Have systems learn from data
- Optimize Processes
- Predict Outcomes
Machine Learning Applications
- Image and speech recognition
- Fraud detection
- Credit scores
- Medical Analysis
- Predicting Equipment Failure
- Advertisement Targeting
- Driverless Cars
- And much, much more!
Why Use Python for Machine Learning?
Machine Learning is easy to implement with Python's open source Libraries. Python lets you spend less time debugging and more time solving machine learning problems.
In addition, Python can use several of its core libraries to handle large amounts of data and save you time when you create and test your algorithms.
Machine Learning Types
There are four types of machine learning.
Supervised Learning
This is the oldest form of machine learning. The algorithm is given a target outcome. Data is trained until the desired outcome is reached. Great for classification.
Unsupervised Learning
The algorithm is not given an outcome. Unsupervised algorithms are used to segment data into different groups. Useful when data can not be easily classified. Great for finding trends.
Reinforcement Learning
The machine is programmed to make decisions. Feedback loops train the algorithm by trial and error. The machine learns to make better estimates based on past data.
What You Should Already Know
Before taking on machine learning, you should be familiar with:
- Python Coding
- Python Libraries
- Algorithms
This tutorial will not cover the basics of Python. For that, see our Python Tutorial.
Machine Learning Environment Set Up
A machine learning working environment must be set up for python.
Python Version
Be sure that Python 2.7 or later is installed. The version can easily be checked in the command line:
C:\Users\Your Name>python --version
If Python 2.7 or later is not installed a copy can be downloaded from python.org.
Libraries
A number of opensource libraries are commonly used with python machine learning. Be sure to install these before continuing on with the rest of the tutorial:
NumPy - Supports large scale arrays and matrices.
SciPy - A collection of mathematical algorithms built on NumPy.
Scikit-learn - Contains machine learning algorithms used for data analysis. Its preprocessing package helps standardize data for closer data fits.
Pandas - A data analysis library. Includes structures for manipulating tables.
Matpoltlib - 2D plotting library to produce publication-quality images. Visualize machine learning data sets with Matplotlib.
How to Install
Libraries can be installed using pip.
Navigate to the Python Scripts folder and type these individually into the command line to install the libraries:
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install numpy
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install scipy
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install sklearn
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install pandas
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install matplotlib
Viewing Data with Python
Most machine learning algorithms start with raw(untouched) data. Looking at the data may show insights that can help us preprocess and manage the machine learning algorithm.
Example
Here we view a sample dataset taken from the UCI machine learning depository:
import pandas
fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)
print(data.head(10))
Try it Yourself »
Data Dimensions
It's important to know how the data is shaped. It may take a long time to
read many rows and columns. With the shape
property, we can know the size of
the data.
Example
Print the data shape, the result should show that the iris data has 150 rows, and each row has 5 columns:
shape = data.shape
print(shape)
Try it Yourself »
Data Types
The dtype
property will be important. Strings might need to be transformed into integers to represent categories. The data types can be pulled from the raw data.
Statistics with Python
More information about an attribute's shape can be revealed with statistics.
Panda's describe()
function will display 8 statistics for all attributes:
- Count
- Mean - the average
- Standard Deviation - the spread of the values from the mean
- Minimum Value - the lowest volume
- 25th Percentile - 25% of all data falls bellow
- 50th Percentile - the median
- 75th Percentile - 75% of all data falls below
- Maximum value
Example
At this point, it's worth to note any NA values or any interesting distributions:
import pandas
fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)
#Print the statistics
pandas.set_option('display.width', 100)
pandas.set_option('precision', 2)
description = data.describe()
print(description)
Try it Yourself »
Class Distribution
Class distribution applies only to classification problems. Imbalanced problems that have more observations in a specific category may need special accommodation in the data set. Class distribution data can easily be pulled by Pandas.
Example
In this case, the data is easily balanced:
import pandas
fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)
#Print the class
distribution
class_counts = data.groupby('class').size()
print(class_counts)
Try it Yourself »
Correlation and Skew with Python
Correlation expresses the relationship between two variables. Pearson's correlation coefficient is most commonly used with normal distributions.
Example
import pandas
fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)
#Print the correlation
pandas.set_option('display.width', 100)
pandas.set_option('precision', 2)
correlations = data.corr(method='pearson')
print(correlations)
Try it Yourself »
Skew Distributions
With Skew Distributions, it's assumed that the Gaussian (normal distribution) is shifted left or right. Knowledge of the skews allows you to perform data preparation (explained next chapter) to improve the model accuracy. Skew can be calculated with the Panda's
skew()
Function.
Example
import pandas
fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
#Print the skew
skew =
data.skew()
print(skew)
Try it Yourself »
Data Preprocessing and Rescale with Python
Assumptions made by algorithms may require some form of data processing to help deliver better results. A popular method involves creating many different data views and transforms, then testing many algorithms. This method helps select the best transformations.
4 preprocessing techniques are shown in the next sections (although many more exist). All preprocessing techniques follow a similar structure:
- A dataset is loaded
- A dataset is split into input and output variables
- A preprocessing transform is applied
- Data is summarized to show a change
Data Rescale
Varied ranges can affect the prediction accuracy of our machine learning models. Data rescale (also called normalization) is used to scale different-sized attributes into a single scale. The rescale is typically ranged from 0 and 1 but can be of any range.
Data rescale is useful for certain machine learning algorithms such as gradient descent, regression, neural networks, and K-Nearest Neighbors.
Example
The MinMaxScaler
class from scikit-learn is used to rescale the iris
dataset:
import pandas
import scipy
import numpy
from sklearn.preprocessing
import MinMaxScaler
fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values
#
input/output component separation
X = array[:,0:4]
Y = array[:,4]
#Dataset is scaled to 0 and 3
scaler = MinMaxScaler(feature_range=(0,
3))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=2)
print(rescaledX[0:5,:])
Try it Yourself »
Normalize Data with Python
Unlike rescale, normalization in scikit-learn scales each observation to a length of 1 (algebraically referred to as the unit norm). Normalization is common with cluster and classification problems. Additionally, normalized data is useful in neural networks and distance algorithms such as K-Nearest Neighbors.
Example
Data can be normalized with the Normalizer
class.
import pandas
import scipy
import numpy
from sklearn.preprocessing
import Normalizer
fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values
#
input/output component separation
X = array[:,0:4]
Y = array[:,4]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
#
summarize transformed data
numpy.set_printoptions(precision=2)
print(normalizedX[0:10,:])
Try it Yourself »
Standardized Data with Python
Standardized data (also known as Z-score normalization) is useful when attributes have Gaussian distributions. Data is transformed to a standard deviation of 1 and 0 mean. The technique works best when the data has been rescaled. It is suitable for algorithms that use logistic regression, linear regression, and linear discriminate analysis.
Example
The Iris data is transformed to a standard deviation of 1 and a mean of 0.
import pandas
import scipy
import numpy
from sklearn.preprocessing
import StandardScaler
fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values
#
input/output component separation
X = array[:,0:4]
Y = array[:,4]
scaler = StandardScaler().fit(X)
standardX = scaler.transform(X)
#
summarize transformed data
numpy.set_printoptions(precision=2)
print(standardX[0:20,:])
Try it Yourself »
Binarized Data with Python
Raw data can be transformed into binary data known as binarization. A
predefined threshold determines the data's on and off state. Data below the
threshold is marked 0, while data above the threshold is marked 1.
Binarization is popular with image processing. A one could be used to represent
an object with a certain intensity and a zero could represent the background.
Binarization can be used to recognize shapes, objects, and characters.
Example
The iris data is binarized with a threshold of 1.4:
import pandas
import scipy
import numpy
from sklearn.preprocessing
import Binarizer
fileurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values
#
input/output component separation
X = array[:,0:4]
Y = array[:,4]
#treshold is set to 1.4
binarizer = Binarizer(threshold=1.4).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=2)
print(binaryX[0:20,:])
Try it Yourself »