Describe
p35 Real world datasets could have millions of rows and columns.
""" Describe DataFrame
Real world cases could have millions of rows and columns.
We rely on pulling samples and summary statistics.
Describe do not always tell the full story.
Survived is categorical, but pandas treats it as numerical.
Both iloc and loc are very useful during data cleaning.
For output data (outside Jupyter) use DataFrame' to_markdown()
pip install tabulate
"""
import pandas as pd
import pathlib
DIR = pathlib.Path(__file__).resolve().parent / '../_data/'
df = pd.read_csv(DIR / 'titanic.csv')
print("Show dimensions | shape:")
print(df.shape) # (1313, 6)
print("First two rows | head(2): ")
print(df.head(2).to_markdown())
# | | Name | PClass | Age | Sex | Survived | SexCode |
# |---:|:---------------|:---------|------:|:-------|-----------:|----------:|
# | 0 | Allen, Miss Eli| 1st | 29 | female | 1 | 1 |
# | 1 | Allison, Miss H| 1st | 2 | female | 0 | 1 |
print("Show statistics | describe():")
print(df.describe().to_markdown())
# | | Age | Survived | SexCode |
# |:------|--------:|------------:|------------:|
# | count | 756 | 1313 | 1313 |
# | mean | 30.398 | 0.342727 | 0.351866 |
# | std | 14.259 | 0.474802 | 0.477734 |
# | min | 0.17 | 0 | 0 |
# | 25% | 21 | 0 | 0 |
# | 50% | 28 | 0 | 0 |
# | 75% | 39 | 1 | 1 |
# | max | 71 | 1 | 1 |
print("Select first row by index | iloc[0]:")
print(df.iloc[0].to_markdown()) # first
# | | 0 |
# |:---------|:-----------------------------|
# | Name | Allen, Miss Elisabeth Walton |
# | PClass | 1st |
# | Age | 29.0 |
# | Sex | female |
# | Survived | 1 |
# | SexCode | 1 |
print("Second, third and fourth | iloc[1:4]:")
print(df.iloc[1:4].to_markdown()) # second, third and fourth
print("Select up to and including fourth | iloc[:4]")
print(df.iloc[:4].to_markdown()) # up to, and including fourth
# Set index to non-numerical
df = df.set_index(df['Name'])
print("Select by Name:")
print(df.loc['Allen, Miss Elisabeth Walton'].to_markdown())
# | | Allen, Miss Elisabeth Walton |
# |:---------|:-------------------------------|
# | Name | Allen, Miss Elisabeth Walton |
# | PClass | 1st |
# | Age | 29.0 |
# | Sex | female |
# | Survived | 1 |
# | SexCode | 1 |
➥ Condition
Condition
p38 Conditional selecting and filtering data are common tasks.
""" Condition and filtering
Conditional selecting and filtering data are common tasks.
Sometinmes you are interseted only of some subset of dataset.
"""
import pandas as pd
import pathlib
DIR = pathlib.Path(__file__).resolve().parent / '../_data/'
df = pd.read_csv(DIR / 'titanic.csv')
print("Condition, Females only:")
print(df[df['Sex'] == 'female'].head(2).to_markdown())
# | | Name | PClass | Age | Sex | Survived | SexCode |
# |---:|:---------------|:---------|------:|:-------|-----------:|----------:|
# | 0 | Allen, Miss Eli| 1st | 29 | female | 1 | 1 |
# | 1 | Allison, Miss H| 1st | 2 | female | 0 | 1 |
print("Filter | Males age 60:")
print(df[(df['Sex'] == 'male') & (df['Age'] >= 60)].head(2).to_markdown())
# | | Name | PClass | Age | Sex | Survived | SexCode |
# |---:|:-----------------|:---------|------:|:------|-----------:|----------:|
# | 9 | Artagaveytia, Mr | 1st | 71 | male | 0 | 0 |
# | 72 | Crosby, Captain E| 1st | 70 | male | 0 | 0 |
➥ Replace
Replace
p39 Replace accepts regex expressions.
""" Replace in DataFrame
Pandas replace is an easy way to find and replace values.
Replace accepts regular expressions.
"""
import pandas as pd
import pathlib
DIR = pathlib.Path(__file__).resolve().parent / '../_data/'
df = pd.read_csv(DIR / 'titanic.csv')
# Replace on value
R = df['Sex'].replace("female", "Woman")
print("Replace in one column:")
print(R.head(2).to_markdown())
# | | Sex |
# |---:|:------|
# | 0 | Woman |
# | 1 | Woman |
# Replace multipe values
R = df['Sex'].replace(['female', 'male'], ['Woman', 'Man'])
print("Replace multiple values:")
print(R.head(5).to_markdown())
# | | Sex |
# |---:|:------|
# | 0 | Woman |
# | 1 | Woman |
# | 2 | Man |
# | 3 | Woman |
# | 4 | Man |
# Replace all
R = df.replace(1, 'one')
print("Replace all:")
print(R.head(2).to_markdown())
# | | Name | PClass | Age | Sex | Survived | SexCode |
# |---:|:---------------|:---------|------:|:-------|:-----------|:----------|
# | 0 | Allen, Miss Eli| 1st | 29 | female | one | one |
# | 1 | Allison, Miss H| 1st | 2 | female | 0 | one |
# Regex
R = df.replace(r'1st', 'First', regex=True)
print("Regex replace:")
print(R.head(2).to_markdown())
# | | Name | PClass | Age | Sex | Survived | SexCode |
# |---:|:---------------|:---------|------:|:-------|-----------:|----------:|
# | 0 | Allen, Miss Eli| First | 29 | female | 1 | 1 |
# | 1 | Allison, Miss H| First | 2 | female | 0 | 1 |
➥ Apply function
Apply function
p53 It is common to write a function to perform some useful operation.
""" Apply a function over all elements
Despite the temptation to fall back on for loops,
a more Pythonic solution uses pandas' apply method.
It is common to write a function to perform some useful operation,
like separating first and last names, converting strings to floats.
"""
import pandas as pd
import pathlib
DIR = pathlib.Path(__file__).resolve().parent / '../_data/'
df = pd.read_csv(DIR / 'titanic.csv')
print("First two names uppercased:")
for name in df['Name'][:2]:
print(name.upper())
# ALLEN, MISS ELISABETH WALTON
# ALLISON, MISS HELEN LORAINE
print("Use list comprehension:")
print([name.upper() for name in df['Name'][:2]])
# ['ALLEN, MISS ELISABETH WALTON',
# 'ALLISON, MISS HELEN LORAINE']
print("Better, usign pandas' apply")
def uppercase(x):
return x.upper()
print(df['Name'].apply(uppercase)[:2].to_markdown())
# | | Name |
# |---:|:-----------------------------|
# | 0 | ALLEN, MISS ELISABETH WALTON |
# | 1 | ALLISON, MISS HELEN LORAINE |
➥ Questions
Last update: 46 days ago