
Python Tutorial

Python HOME Python Intro Python Get Started Python Syntax Python Comments Python Variables Python Data Types Python Numbers Python Casting Python Strings Python Booleans Python Operators Python Lists Python Tuples Python Sets Python Dictionaries Python If...Else Python While Loops Python For Loops Python Functions Python Lambda Python Arrays Python Classes/Objects Python Inheritance Python Iterators Python Polymorphism Python Scope Python Modules Python Dates Python Math Python JSON Python RegEx Python PIP Python Try...Except Python User Input Python String Formatting

File Handling

Python File Handling Python Read Files Python Write/Create Files Python Delete Files

Python Modules

NumPy Tutorial Pandas Tutorial SciPy Tutorial Django Tutorial

Python Matplotlib

Matplotlib Intro Matplotlib Get Started Matplotlib Pyplot Matplotlib Plotting Matplotlib Markers Matplotlib Line Matplotlib Labels Matplotlib Grid Matplotlib Subplot Matplotlib Scatter Matplotlib Bars Matplotlib Histograms Matplotlib Pie Charts

Machine Learning

Getting Started Mean Median Mode Standard Deviation Percentile Data Distribution Normal Data Distribution Scatter Plot Linear Regression Polynomial Regression Multiple Regression Scale Train/Test Decision Tree Confusion Matrix Hierarchical Clustering Logistic Regression Grid Search Categorical Data K-means Bootstrap Aggregation Cross Validation AUC - ROC Curve K-nearest neighbors

Python MySQL

MySQL Get Started MySQL Create Database MySQL Create Table MySQL Insert MySQL Select MySQL Where MySQL Order By MySQL Delete MySQL Drop Table MySQL Update MySQL Limit MySQL Join

Python MongoDB

MongoDB Get Started MongoDB Create DB MongoDB Collection MongoDB Insert MongoDB Find MongoDB Query MongoDB Sort MongoDB Delete MongoDB Drop Collection MongoDB Update MongoDB Limit

Python Reference

Python Overview Python Built-in Functions Python String Methods Python List Methods Python Dictionary Methods Python Tuple Methods Python Set Methods Python File Methods Python Keywords Python Exceptions Python Glossary

Module Reference

Random Module Requests Module Statistics Module Math Module cMath Module

Python How To

Remove List Duplicates Reverse a String Add Two Numbers

Python Examples

Python Examples Python Compiler Python Exercises Python Quiz Python Server Python Syllabus Python Study Plan Python Interview Q&A Python Bootcamp Python Certificate

Python Machine Learning

Machine Learning is a branch of artificial intelligence.

With Machine Learning, computers can learn from data.

Python's syntax structure makes it ideal for developing machine learning applications.

Why Study Machine Language?

  • Have computers recognize patterns
  • Automate analytical models
  • Have systems learn from data
  • Optimize Processes
  • Predict Outcomes

Machine Learning Applications

  • Image and speech recognition
  • Fraud detection
  • Credit scores
  • Medical Analysis
  • Predicting Equipment Failure
  • Advertisement Targeting
  • Driverless Cars
  • And much, much more!

Why Use Python for Machine Learning?

Machine Learning is easy to implement with Python's open source Libraries. Python lets you spend less time debugging and more time solving machine learning problems.

In addition, Python can use several of its core libraries to handle large amounts of data and save you time when you create and test your algorithms.

Machine Learning Types

There are four types of machine learning.

Supervised Learning

This is the oldest form of machine learning. The algorithm is given a target outcome. Data is trained until the desired outcome is reached. Great for classification.

Unsupervised Learning

The algorithm is not given an outcome. Unsupervised algorithms are used to segment data into different groups. Useful when data can not be easily classified. Great for finding trends.

Reinforcement Learning

The machine is programmed to make decisions. Feedback loops train the algorithm by trial and error. The machine learns to make better estimates based on past data.

What You Should Already Know

Before taking on machine learning, you should be familiar with:

  • Python Coding
  • Python Libraries
  • Algorithms

This tutorial will not cover the basics of Python. For that, see our Python Tutorial.

Machine Learning Environment Set Up

A machine learning working environment must be set up for python.

Python Version

Be sure that Python 2.7 or later is installed. The version can easily be checked in the command line:

C:\Users\Your Name>python --version

If Python 2.7 or later is not installed a copy can be downloaded from


A number of opensource libraries are commonly used with python machine learning. Be sure to install these before continuing on with the rest of the tutorial:

NumPy - Supports large scale arrays and matrices.

SciPy - A collection of mathematical algorithms built on NumPy.

Scikit-learn - Contains machine learning algorithms used for data analysis. Its preprocessing package helps standardize data for closer data fits.

Pandas - A data analysis library. Includes structures for manipulating tables.

Matpoltlib - 2D plotting library to produce publication-quality images. Visualize machine learning data sets with Matplotlib.

How to Install

Libraries can be installed using pip.

Navigate to the Python Scripts folder and type these individually into the command line to install the libraries:

C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install numpy
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install scipy
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install sklearn
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install pandas
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install matplotlib

Viewing Data with Python

Most machine learning algorithms start with raw(untouched) data. Looking at the data may show insights that can help us preprocess and manage the machine learning algorithm.


Here we view a sample dataset taken from the UCI machine learning depository:

import pandas

fileurl = ''
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)

Try it Yourself »

Data Dimensions

It's important to know how the data is shaped. It may take a long time to read many rows and columns. With the shape property, we can know the size of the data.


Print the data shape, the result should show that the iris data has 150 rows, and each row has 5 columns:

shape = data.shape
Try it Yourself »

Data Types

The dtype property will be important. Strings might need to be transformed into integers to represent categories. The data types can be pulled from the raw data.


Print the data type of each column:

types = data.dtypes
Try it Yourself »

Statistics with Python

More information about an attribute's shape can be revealed with statistics.

Panda's describe() function will display 8 statistics for all attributes:

  • Count
  • Mean - the average
  • Standard Deviation - the spread of the values from the mean
  • Minimum Value - the lowest volume
  • 25th Percentile - 25% of all data falls bellow
  • 50th Percentile - the median
  • 75th Percentile - 75% of all data falls below
  • Maximum value


At this point, it's worth to note any NA values or any interesting distributions:

import pandas

fileurl = ''
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)

#Print the statistics
pandas.set_option('display.width', 100)
pandas.set_option('precision', 2)
description = data.describe()
Try it Yourself »

Class Distribution

Class distribution applies only to classification problems. Imbalanced problems that have more observations in a specific category may need special accommodation in the data set. Class distribution data can easily be pulled by Pandas.


In this case, the data is easily balanced:

import pandas

fileurl = ''
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)

#Print the class distribution
class_counts = data.groupby('class').size()
Try it Yourself »

Correlation and Skew with Python

Correlation expresses the relationship between two variables. Pearson's correlation coefficient is most commonly used with normal distributions.


import pandas

fileurl = ''
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)

#Print the correlation
pandas.set_option('display.width', 100)
pandas.set_option('precision', 2)
correlations = data.corr(method='pearson')
Try it Yourself »

Skew Distributions

With Skew Distributions, it's assumed that the Gaussian (normal distribution) is shifted left or right. Knowledge of the skews allows you to perform data preparation (explained next chapter) to improve the model accuracy. Skew can be calculated with the Panda's skew() Function.


import pandas

fileurl = ''
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)

#Print the skew
skew = data.skew()
Try it Yourself »

Data Preprocessing and Rescale with Python

Assumptions made by algorithms may require some form of data processing to help deliver better results. A popular method involves creating many different data views and transforms, then testing many algorithms. This method helps select the best transformations.

4 preprocessing techniques are shown in the next sections (although many more exist). All preprocessing techniques follow a similar structure:

  • A dataset is loaded
  • A dataset is split into input and output variables
  • A preprocessing transform is applied
  • Data is summarized to show a change

Data Rescale

Varied ranges can affect the prediction accuracy of our machine learning models. Data rescale (also called normalization) is used to scale different-sized attributes into a single scale. The rescale is typically ranged from 0 and 1 but can be of any range.

Data rescale is useful for certain machine learning algorithms such as gradient descent, regression, neural networks, and K-Nearest Neighbors.


The MinMaxScaler class from scikit-learn is used to rescale the iris dataset:

import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler

fileurl = ''
names = ['sepal_length','sepal_width','petal_length','petal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values

# input/output component separation
X = array[:,0:4]
Y = array[:,4]

#Dataset is scaled to 0 and 3
scaler = MinMaxScaler(feature_range=(0, 3))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
Try it Yourself »

Normalize Data with Python

Unlike rescale, normalization in scikit-learn scales each observation to a length of 1 (algebraically referred to as the unit norm). Normalization is common with cluster and classification problems. Additionally, normalized data is useful in neural networks and distance algorithms such as K-Nearest Neighbors.


Data can be normalized with the Normalizer class.

import pandas
import scipy
import numpy
from sklearn.preprocessing import Normalizer

fileurl = ''
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values

# input/output component separation
X = array[:,0:4]
Y = array[:,4]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
Try it Yourself »

Standardized Data with Python

Standardized data (also known as Z-score normalization) is useful when attributes have Gaussian distributions. Data is transformed to a standard deviation of 1 and 0 mean. The technique works best when the data has been rescaled. It is suitable for algorithms that use logistic regression, linear regression, and linear discriminate analysis.


The Iris data is transformed to a standard deviation of 1 and a mean of 0.

import pandas
import scipy
import numpy
from sklearn.preprocessing import StandardScaler

fileurl = ''
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values

# input/output component separation
X = array[:,0:4]
Y = array[:,4]
scaler = StandardScaler().fit(X)
standardX = scaler.transform(X)

# summarize transformed data
Try it Yourself »

Binarized Data with Python

Raw data can be transformed into binary data known as binarization. A predefined threshold determines the data's on and off state. Data below the threshold is marked 0, while data above the threshold is marked 1.
Binarization is popular with image processing. A one could be used to represent an object with a certain intensity and a zero could represent the background. Binarization can be used to recognize shapes, objects, and characters.


The iris data is binarized with a threshold of 1.4:

import pandas
import scipy
import numpy
from sklearn.preprocessing import Binarizer

fileurl = ''
names = ['septal_length','sepal_with','petal_length','pedal_width','class']
data = pandas.read_csv(fileurl, names=names)
array = data.values

# input/output component separation
X = array[:,0:4]
Y = array[:,4]

#treshold is set to 1.4
binarizer = Binarizer(threshold=1.4).fit(X)
binaryX = binarizer.transform(X)

# summarize transformed data
Try it Yourself »


Contact Sales

If you want to use W3Schools services as an educational institution, team or enterprise, send us an e-mail:

Report Error

If you want to report an error, or if you want to make a suggestion, send us an e-mail:

W3Schools is optimized for learning and training. Examples might be simplified to improve reading and learning. Tutorials, references, and examples are constantly reviewed to avoid errors, but we cannot warrant full correctness of all content. While using W3Schools, you agree to have read and accepted our terms of use, cookie and privacy policy.

Copyright 1999-2025 by Refsnes Data. All Rights Reserved. W3Schools is Powered by W3.CSS.