PROGRAMMING

  minte9
learningjourney




S R Q

Describe

p35 Real world datasets could have millions of rows and columns.
 
""" Describe DataFrame

Real world cases could have millions of rows and columns.
We rely on pulling samples and summary statistics.

Describe do not always tell the full story.
Survived is categorical, but pandas treats it as numerical. 
Both iloc and loc are very useful during data cleaning.

For output data (outside Jupyter) use DataFrame' to_markdown()
pip install tabulate
"""

import pandas as pd
import pathlib

DIR = pathlib.Path(__file__).resolve().parent / '../_data/'
df = pd.read_csv(DIR / 'titanic.csv')

print("Show dimensions | shape:")
print(df.shape) # (1313, 6)


print("First two rows | head(2): ")
print(df.head(2).to_markdown())

# |    | Name           | PClass   |   Age | Sex    |   Survived |   SexCode |
# |---:|:---------------|:---------|------:|:-------|-----------:|----------:|
# |  0 | Allen, Miss Eli| 1st      |    29 | female |          1 |         1 |
# |  1 | Allison, Miss H| 1st      |     2 | female |          0 |         1 |


print("Show statistics | describe():")
print(df.describe().to_markdown())

# |       |     Age |    Survived |     SexCode |
# |:------|--------:|------------:|------------:|
# | count | 756     | 1313        | 1313        |
# | mean  |  30.398 |    0.342727 |    0.351866 |
# | std   |  14.259 |    0.474802 |    0.477734 |
# | min   |   0.17  |    0        |    0        |
# | 25%   |  21     |    0        |    0        |
# | 50%   |  28     |    0        |    0        |
# | 75%   |  39     |    1        |    1        |
# | max   |  71     |    1        |    1        |



print("Select first row by index | iloc[0]:")
print(df.iloc[0].to_markdown()) # first

# |          | 0                            |
# |:---------|:-----------------------------|
# | Name     | Allen, Miss Elisabeth Walton |
# | PClass   | 1st                          |
# | Age      | 29.0                         |
# | Sex      | female                       |
# | Survived | 1                            |
# | SexCode  | 1                            |


print("Second, third and fourth | iloc[1:4]:")
print(df.iloc[1:4].to_markdown()) # second, third and fourth

print("Select up to and including fourth | iloc[:4]")
print(df.iloc[:4].to_markdown())  # up to, and including fourth


# Set index to non-numerical
df = df.set_index(df['Name'])
print("Select by Name:") 
print(df.loc['Allen, Miss Elisabeth Walton'].to_markdown())

# |          | Allen, Miss Elisabeth Walton   |
# |:---------|:-------------------------------|
# | Name     | Allen, Miss Elisabeth Walton   |
# | PClass   | 1st                            |
# | Age      | 29.0                           |
# | Sex      | female                         |
# | Survived | 1                              |
# | SexCode  | 1                              |
Condition

Condition

p38 Conditional selecting and filtering data are common tasks.
 
""" Condition and filtering

Conditional selecting and filtering data are common tasks.
Sometinmes you are interseted only of some subset of dataset.
"""

import pandas as pd
import pathlib

DIR = pathlib.Path(__file__).resolve().parent / '../_data/'
df = pd.read_csv(DIR / 'titanic.csv')


print("Condition, Females only:")
print(df[df['Sex'] == 'female'].head(2).to_markdown())

# |    | Name           | PClass   |   Age | Sex    |   Survived |   SexCode |
# |---:|:---------------|:---------|------:|:-------|-----------:|----------:|
# |  0 | Allen, Miss Eli| 1st      |    29 | female |          1 |         1 |
# |  1 | Allison, Miss H| 1st      |     2 | female |          0 |         1 |


print("Filter | Males age 60:")
print(df[(df['Sex'] == 'male') & (df['Age'] >= 60)].head(2).to_markdown())

# |    | Name             | PClass   |   Age | Sex   |   Survived |   SexCode |
# |---:|:-----------------|:---------|------:|:------|-----------:|----------:|
# |  9 | Artagaveytia, Mr | 1st      |    71 | male  |          0 |         0 |
# | 72 | Crosby, Captain E| 1st      |    70 | male  |          0 |         0 |
Replace

Replace

p39 Replace accepts regex expressions.
 
""" Replace in DataFrame

Pandas replace is an easy way to find and replace values.
Replace accepts regular expressions.
"""

import pandas as pd
import pathlib

DIR = pathlib.Path(__file__).resolve().parent / '../_data/'
df = pd.read_csv(DIR / 'titanic.csv')


# Replace on value
R = df['Sex'].replace("female", "Woman")
print("Replace in one column:")
print(R.head(2).to_markdown())

# |    | Sex   |
# |---:|:------|
# |  0 | Woman |
# |  1 | Woman |


# Replace multipe values
R = df['Sex'].replace(['female', 'male'], ['Woman', 'Man'])
print("Replace multiple values:")
print(R.head(5).to_markdown())

# |    | Sex   |
# |---:|:------|
# |  0 | Woman |
# |  1 | Woman |
# |  2 | Man   |
# |  3 | Woman |
# |  4 | Man   |


# Replace all
R = df.replace(1, 'one')
print("Replace all:")
print(R.head(2).to_markdown())

# |    | Name           | PClass   |   Age | Sex    | Survived   | SexCode   |
# |---:|:---------------|:---------|------:|:-------|:-----------|:----------|
# |  0 | Allen, Miss Eli| 1st      |    29 | female | one        | one       |
# |  1 | Allison, Miss H| 1st      |     2 | female | 0          | one       |


# Regex
R = df.replace(r'1st', 'First', regex=True)
print("Regex replace:") 
print(R.head(2).to_markdown())

# |    | Name           | PClass   |   Age | Sex    |   Survived |   SexCode |
# |---:|:---------------|:---------|------:|:-------|-----------:|----------:|
# |  0 | Allen, Miss Eli| First    |    29 | female |          1 |         1 |
# |  1 | Allison, Miss H| First    |     2 | female |          0 |         1 |
Apply function

Apply function

p53 It is common to write a function to perform some useful operation.
 
""" Apply a function over all elements

Despite the temptation to fall back on for loops,
a more Pythonic solution uses pandas' apply method.

It is common to write a function to perform some useful operation, 
like separating first and last names, converting strings to floats.
"""

import pandas as pd
import pathlib

DIR = pathlib.Path(__file__).resolve().parent / '../_data/'
df = pd.read_csv(DIR / 'titanic.csv')


print("First two names uppercased:")
for name in df['Name'][:2]:
    print(name.upper())

# ALLEN, MISS ELISABETH WALTON
# ALLISON, MISS HELEN LORAINE


print("Use list comprehension:")
print([name.upper() for name in df['Name'][:2]])

# ['ALLEN, MISS ELISABETH WALTON', 
#  'ALLISON, MISS HELEN LORAINE']


print("Better, usign pandas' apply")
def uppercase(x):
    return x.upper()
print(df['Name'].apply(uppercase)[:2].to_markdown())

# |    | Name                         |
# |---:|:-----------------------------|
# |  0 | ALLEN, MISS ELISABETH WALTON |
# |  1 | ALLISON, MISS HELEN LORAINE  |

Questions    
Last update: 46 days ago
Pandas, Find Values