Programming

  minte9
LearnRemember



Pandas / Data Cleaning


Describe

Real world datasets could have millions of rows and columns.
 
""" Data clearing / Describe DataFrame

Real world cases could have millions of rows and columns.
Describe do not always tell the full story.

For example in `titanic.csv`, `survived` is categorical, 
but pandas treats it as numerical. 

Both iloc and loc are very useful during data cleaning.
For output data (outside Jupyter) use DataFrame' to_markdown()
    pip install tabulate
"""

import pandas as pd
import pathlib

DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / '../titanic.csv')

# Set index to non-numerical
df2 = df.set_index(df['Name'])

first_row        = df.iloc[0]
second_to_fourth = df.iloc[1:4]
up_to_fourth     = df.iloc[:4]
by_name          = df2.loc['Allen, Miss Elisabeth Walton']

print("Shape:", df.shape, "\n") # (1313, 6)
print("First two rows | head(2): "); print(df.head(2).to_markdown(), "\n")
print("Show statistics | describe(): "); print(df.describe().to_markdown(), "\n")
print("Select first row by index | iloc[0]: "); print(first_row.to_markdown(), "\n")
print("Second, third and fourth | iloc[1:4]: "); print(second_to_fourth.to_markdown(), "\n")
print("Up to and including fourth | iloc[:4]: "); print(up_to_fourth.to_markdown(), "\n")
print("Select by Name: "); print(by_name.to_markdown(), "\n")

"""
Shape: (1313, 6)

First two rows | head(2): 
|    | Name                         | PClass   |   Age | Sex    |   Survived |   SexCode |
|---:|:-----------------------------|:---------|------:|:-------|-----------:|----------:|
|  0 | Allen, Miss Elisabeth Walton | 1st      |    29 | female |          1 |         1 |
|  1 | Allison, Miss Helen Loraine  | 1st      |     2 | female |          0 |         1 |

Show statistics | describe():
|       |     Age |    Survived |     SexCode |
|:------|--------:|------------:|------------:|
| count | 756     | 1313        | 1313        |
| mean  |  30.398 |    0.342727 |    0.351866 |
| std   |  14.259 |    0.474802 |    0.477734 |
| min   |   0.17  |    0        |    0        |
| 25%   |  21     |    0        |    0        |
| 50%   |  28     |    0        |    0        |
| 75%   |  39     |    1        |    1        |
| max   |  71     |    1        |    1        |

Select first row by index | iloc[0]:
|          | 0                            |
|:---------|:-----------------------------|
| Name     | Allen, Miss Elisabeth Walton |
| PClass   | 1st                          |
| Age      | 29.0                         |
| Sex      | female                       |
| Survived | 1                            |
| SexCode  | 1                            |

Second, third and fourth | iloc[1:4]:
|    | Name                                          | PClass   |   Age | Sex    |   Survived |   SexCode |
|---:|:----------------------------------------------|:---------|------:|:-------|-----------:|----------:|
|  1 | Allison, Miss Helen Loraine                   | 1st      |     2 | female |          0 |         1 |
|  2 | Allison, Mr Hudson Joshua Creighton           | 1st      |    30 | male   |          0 |         0 |
|  3 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st      |    25 | female |          0 |         1 |

Select up to and including fourth | iloc[:4]
|    | Name                                          | PClass   |   Age | Sex    |   Survived |   SexCode |
|---:|:----------------------------------------------|:---------|------:|:-------|-----------:|----------:|
|  0 | Allen, Miss Elisabeth Walton                  | 1st      |    29 | female |          1 |         1 |
|  1 | Allison, Miss Helen Loraine                   | 1st      |     2 | female |          0 |         1 |
|  2 | Allison, Mr Hudson Joshua Creighton           | 1st      |    30 | male   |          0 |         0 |
|  3 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st      |    25 | female |          0 |         1 |

Select by Name:
|          | Allen, Miss Elisabeth Walton   |
|:---------|:-------------------------------|
| Name     | Allen, Miss Elisabeth Walton   |
| PClass   | 1st                            |
| Age      | 29.0                           |
| Sex      | female                         |
| Survived | 1                              |
| SexCode  | 1                              |
"""

Condition

Conditional selecting and filtering data are common tasks.
 
""" Data clearing / Condition and filtering

Conditional selecting and filtering data are common tasks.
Sometinmes you are interseted only of some subset of dataset.
"""

import pandas as pd
import pathlib

DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / '../titanic.csv')

females = df[df['Sex'] == 'female']
males_60 = df[(df['Sex'] == 'male') & (df['Age'] >= 60)]

print("Condition, Females only:") ; print(females.head(2).to_markdown(), "\n")
print("Filter | Males age 60:") ; print(males_60.head(2).to_markdown(), "\n")

"""
Condition, Females only:
|    | Name                         | PClass   |   Age | Sex    |   Survived |   SexCode |
|---:|:-----------------------------|:---------|------:|:-------|-----------:|----------:|
|  0 | Allen, Miss Elisabeth Walton | 1st      |    29 | female |          1 |         1 |
|  1 | Allison, Miss Helen Loraine  | 1st      |     2 | female |          0 |         1 | 

Filter | Males age 60:
|    | Name                           | PClass   |   Age | Sex   |   Survived |   SexCode |
|---:|:-------------------------------|:---------|------:|:------|-----------:|----------:|
|  9 | Artagaveytia, Mr Ramon         | 1st      |    71 | male  |          0 |         0 |
| 72 | Crosby, Captain Edward Gifford | 1st      |    70 | male  |          0 |         0 | 
"""

Replace

Replace accepts regex expressions.
 
""" Replace in DataFrame

Pandas replace is an easy way to find and replace values.
Replace accepts regular expressions.
"""

import pandas as pd
import pathlib

DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / '../titanic.csv')

R1 = df['Sex'].replace("female", "Woman")
R2 = df['Sex'].replace(['female', 'male'], ['Woman', 'Man'])
R3 = df.replace(1, 'one')
R4 = df.replace(r'1st', 'First', regex=True)

print("Replace in one column:"); print(R1.head(2).to_markdown(), "\n")
print("Replace multiple values:"); print(R2.head(5).to_markdown(), "\n")
print("Replace all:"); print(R3.head(2).to_markdown(), "\n")
print("Regex replace:"); print(R4.head(2).to_markdown(), "\n")

"""
Replace in one column:
|    | Sex   |
|---:|:------|
|  0 | Woman |
|  1 | Woman | 

Replace multiple values:
|    | Sex   |
|---:|:------|
|  0 | Woman |
|  1 | Woman |
|  2 | Man   |
|  3 | Woman |
|  4 | Man   | 

Replace all:
|    | Name                         | PClass   |   Age | Sex    | Survived   | SexCode   |
|---:|:-----------------------------|:---------|------:|:-------|:-----------|:----------|
|  0 | Allen, Miss Elisabeth Walton | 1st      |    29 | female | one        | one       |
|  1 | Allison, Miss Helen Loraine  | 1st      |     2 | female | 0          | one       | 

Regex replace:
|    | Name                         | PClass   |   Age | Sex    |   Survived |   SexCode |
|---:|:-----------------------------|:---------|------:|:-------|-----------:|----------:|
|  0 | Allen, Miss Elisabeth Walton | First    |    29 | female |          1 |         1 |
|  1 | Allison, Miss Helen Loraine  | First    |     2 | female |          0 |         1 | 
"""

Apply function

It is common to write a function to perform some useful operation.
 
""" Apply a function over all elements

Despite the temptation to fall back on for loops,
a more Pythonic solution uses pandas' apply method.

It is common to write a function to perform some useful operation, 
like separating first and last names, converting strings to floats.
"""

import pandas as pd
import pathlib

DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / '../titanic.csv')

def loop(data):
    for x in data:
        print(x.upper())

def list_comprehension(data):
        print([x.upper() for x in data])

def apply(data):
    def uppercase(x):
        return x.upper()
    print(df['Name'].apply(uppercase)[:2].to_markdown()) # Look Here

print("First two names uppercased:")
print("Loop:"); loop(df['Name'][:2])
print("Use list comprehension:"); list_comprehension(df['Name'][:2])
print("Use pandas' apply() - better!"); apply(df['Name'][:2])


"""
First two names uppercased: 

Loop:
 ALLEN, MISS ELISABETH WALTON
 ALLISON, MISS HELEN LORAINE

Use list comprehension:
 ['ALLEN, MISS ELISABETH WALTON', 'ALLISON, MISS HELEN LORAINE']

Use pandas' apply() - better!
|    | Name                         |
|---:|:-----------------------------|
|  0 | ALLEN, MISS ELISABETH WALTON |
|  1 | ALLISON, MISS HELEN LORAINE  |
"""





References