2. Basic Data Exploration
# 2.1 Using Pandas to Get Familiar With Your Data
The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd
. We do this with the command
import pandas as pd
The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database.
Pandas has powerful methods for most things you'll want to do with this type of data.
As an example, we'll look at data about home prices in Melbourne, Australia. In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.
The example (Melbourne) data is at the file path ../input/melbourne-housing-snapshot/melb_data.csv
.
We load and explore the data with the following commands:
# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)
# print a summary of the data in Melbourne data
melbourne_data.describe()
2
3
4
5
6
# 2.2 Interpreting Data Description
The results show 8 numbers for each column in your original dataset. The first number, the count, shows how many rows have non-missing values.
Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.
The second value is the mean, which is the average. Under that, std is the standard deviation, which measures how numerically spread out the values are.
To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analogously, and the max is the largest number.
# 2.3 Exercise: Explore Your Data
Get started with your first coding exercise (opens new window)
# Step 1: Loading Data
Read the Iowa data file into a Pandas DataFrame called home_data
.
import pandas as pd
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
# Fill in the line below to read the file into a variable home_data
home_data = pd.read_csv(iowa_file_path)
# Call line below with no argument to check that you've loaded the data correctly
step_1.check()
2
3
4
5
6
7
8
9
10
# Step 2: Review The Data
Use the command you learned to view summary statistics of the data. Then fill in variables to answer the following questions
# Print summary statistics in next line
home_data.describe()
2
# What is the average lot size (rounded to nearest integer)?
avg_lot_size = round(home_data.LotArea.mean())
# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 2019 - home_data.YearBuilt.max()
# Checks your answers
step_2.check()
2
3
4
5
6
7
# Think About Your Data
The newest house in your data isn't that new. A few potential explanations for this:
- They haven't built new houses where this data was collected.
- The data was collected a long time ago. Houses built after the data publication wouldn't show up.
If the reason is explanation #1 above, does that affect your trust in the model you build with this data? What about if it is reason #2?
How could you dig into the data to see which explanation is more plausible?
Check out this discussion thread (opens new window) to see what others think or to add your ideas.