Pandas - Data Correlations
Finding Relationships
A great aspect of the Pandas module is the corr()
method.
The corr()
method calculates the relationship between each column in your data set.
The examples in this page uses a CSV file called: 'data.csv'.
Download data.csv. or Open data.csv
ExampleGet your own Python Server
Show the relationship between the columns:
df.corr()
Try it Yourself »
Result
Duration Pulse Maxpulse Calories Duration 1.000000 -0.155408 0.009403 0.922721 Pulse -0.155408 1.000000 0.786535 0.025120 Maxpulse 0.009403 0.786535 1.000000 0.203814 Calories 0.922721 0.025120 0.203814 1.000000
Note:
The corr()
method ignores "not numeric"
columns.
Result Explained
The Result of the corr()
method is a table with a lot of numbers that represents
how well the relationship is between two columns.
The number varies from -1 to 1.
1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data set, each time a value went up in the first column, the other one went up as well.
0.9 is also a good relationship, and if you increase one value, the other will probably increase as well.
-0.9 would be just as good relationship as 0.9, but if you increase one value, the other will probably go down.
0.2 means NOT a good relationship, meaning that if one value goes up does not mean that the other will.
What is a good correlation?
It depends on the use, but I think it is safe to say you have to have at least 0.6
(or -0.6
) to call it a good correlation.
Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000
, which makes sense,
each column always has a perfect relationship with itself.
Good Correlation:
"Duration" and "Calories" got a 0.922721
correlation,
which is a very good correlation, and we can predict that the longer you work
out, the more calories you burn, and the other way around: if you burned a lot
of calories, you probably had a long work out.
Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403
correlation,
which is a very bad correlation, meaning that we can not predict the max pulse
by just looking at the duration of the work out, and vice versa.
Exercise?What is this?
Test your skills by answering a few questions about the topics of this page
True or false: A correlation of 0.9 is considered a good correlation.
Get Certified!
$10 ENROLL