🐍

Chapter 2: Python Libraries and Data Manipulation

In the previous chapter, we learned about the basics of AI and Python programming. Now, let's explore some powerful Python libraries that are essential for data manipulation and analysis in AI projects.

Introduction to NumPy:

NumPy is a fundamental library for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical functions to operate on these arrays efficiently. NumPy forms the foundation for many other Python libraries used in data science and machine learning.

Installing NumPy: To install NumPy, open a terminal or command prompt and run the following command:

pip install numpy

Creating NumPy Arrays: NumPy arrays are the core data structure in NumPy. They are similar to Python lists but offer more efficient storage and computation, especially for large datasets.

import numpy as np

# Create a 1-dimensional array from a list
arr1 = np.array([1, 2, 3, 4, 5])
print(arr1)

# Create a 2-dimensional array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2)

# Create an array with a specific data type
arr3 = np.array([1, 2, 3], dtype=np.float64)
print(arr3)

# Create an array filled with zeros
zeros_arr = np.zeros((3, 4))
print(zeros_arr)

# Create an array filled with ones
ones_arr = np.ones((2, 3))
print(ones_arr)

# Create an array with a range of values
range_arr = np.arange(0, 10, 2)
print(range_arr)

# Create an array with evenly spaced values
linspace_arr = np.linspace(0, 1, 5)
print(linspace_arr)

Output:

[1 2 3 4 5]
[[1 2 3]
 [4 5 6]]
[1. 2. 3.]
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
[[1. 1. 1.]
 [1. 1. 1.]]
[0 2 4 6 8]
[0.   0.25 0.5  0.75 1.  ]

Array Operations and Broadcasting: NumPy provides a wide range of operations that can be performed on arrays. These operations can be mathematical, statistical, or logical in nature. NumPy also supports broadcasting, which allows arrays with different shapes to be used in arithmetic operations.

import numpy as np

arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])

# Element-wise addition
print(arr1 + arr2)

# Element-wise subtraction
print(arr2 - arr1)

# Element-wise multiplication
print(arr1 * arr2)

# Element-wise division
print(arr2 / arr1)

# Matrix multiplication
print(np.dot(arr1, arr2.T))

# Broadcasting example
print(arr1 + 5)

# Statistical operations
print(np.mean(arr1))
print(np.median(arr1))
print(np.std(arr1))

Output:

[[ 8 10 12]
 [14 16 18]]
[[6 6 6]
 [6 6 6]]
[[ 7 16 27]
 [40 55 72]]
[[7.  4.  3. ]
 [2.5 2.2 2. ]]
[[ 50  68]
 [122 167]]
[[ 6  7  8]
 [ 9 10 11]]
3.5
3.5
1.707825127659933

Array Indexing and Slicing: NumPy arrays support indexing and slicing operations similar to Python lists. You can access individual elements, rows, columns, or subsets of an array using indices and slices.

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Accessing elements
print(arr[0, 0])  # First element
print(arr[1, 2])  # Element at row 1, column 2

# Slicing arrays
print(arr[:2, :2])  # First two rows and columns
print(arr[1:, :])   # From the second row to the end
print(arr[:, 1])    # Second column

# Conditional indexing
print(arr[arr > 5])  # Elements greater than 5

Output:

1
6
[[1 2]
 [4 5]]
[[4 5 6]
 [7 8 9]]
[2 5 8]
[6 7 8 9]

Introduction to Pandas:

Pandas is a powerful library for data manipulation and analysis in Python. It provides data structures like Series and DataFrame that allow you to work with structured data efficiently. Pandas is built on top of NumPy and integrates well with other libraries in the data science ecosystem.

Installing Pandas: To install Pandas, run the following command:

pip install pandas

Creating Series and DataFrames: Series is a one-dimensional labeled array capable of holding any data type. DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

import pandas as pd

# Creating a Series
series = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
print(series)

# Creating a DataFrame
data = {'name': ['Alice', 'Bob', 'Claire', 'David'],
        'age': [25, 30, 27, 35],
        'city': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
print(df)

# Creating a DataFrame from a CSV file
csv_df = pd.read_csv('data.csv')
print(csv_df.head())  # Display the first few rows

Data file:

data.csv0.2KB

Output:

a    1
b    2
c    3
d    4
dtype: int64
     name  age      city
0   Alice   25  New York
1     Bob   30    London
2  Claire   27     Paris
3   David   35     Tokyo
     name  age      city
0   Alice   25  New York
1     Bob   30    London
2  Claire   27     Paris
3   David   35     Tokyo
4    John   25       NYC

Data Selection and Filtering: Pandas provides various methods to select and filter data in Series and DataFrames. You can access data using labels, positions, or boolean conditions.

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Claire', 'David'],
        'age': [25, 30, 27, 35],
        'city': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)

# Selecting a single column
print(df['name'])

# Selecting multiple columns
print(df[['name', 'age']])

# Selecting rows based on a condition
print(df[df['age'] > 30])

# Selecting rows based on multiple conditions
print(df[(df['age'] > 25) & (df['city'] == 'London')])

# Selecting rows and columns using loc and iloc
print(df.loc[0, 'name'])  # Select by labels
print(df.iloc[1, 2])      # Select by positions

Output:

0     Alice
1       Bob
2    Claire
3     David
Name: name, dtype: object
name  age
0   Alice   25
1     Bob   30
2  Claire   27
3   David   35
name  age   city
3  David   35  Tokyo
name  age    city
1  Bob   30  London
Alice
London0     Alice
1       Bob
2    Claire
3     David
Name: name, dtype: object

     name  age
0   Alice   25
1     Bob   30
2  Claire   27
3   David   35

     name  age    city
3  David   35  Tokyo

   name  age    city
1   Bob   30  London

Alice
London

Data Manipulation: Pandas provides a wide range of functions and methods for data manipulation, including merging, grouping, reshaping, and transforming data.

import pandas as pd

data1 = {'name': ['Alice', 'Bob', 'Claire'],
         'age': [25, 30, 27]}
df1 = pd.DataFrame(data1, index=['A', 'B', 'C'])

data2 = {'name': ['David', 'Emma', 'Frank'],
         'city': ['Tokyo', 'Paris', 'London']}
df2 = pd.DataFrame(data2, index=['D', 'E', 'F'])

# Concatenating DataFrames
concat_df = pd.concat([df1, df2])
print(concat_df)

# Merging DataFrames
merged_df = pd.merge(df1, df2, on='name', how='outer')
print(merged_df)

# Grouping data
grouped_df = df1.groupby('age').mean()
print(grouped_df)

# Reshaping data
reshaped_df = df1.pivot(index='name', columns='age', values='age')
print(reshaped_df)

Output:

     name   age    city
A   Alice  25.0     NaN
B     Bob  30.0     NaN
C  Claire  27.0     NaN
D   David   NaN   Tokyo
E    Emma   NaN   Paris
F   Frank   NaN  London

     name   age    city
A   Alice  25.0     NaN
B     Bob  30.0     NaN
C  Claire  27.0     NaN
D   David   NaN   Tokyo
E    Emma   NaN   Paris
F   Frank   NaN  London

         age
age
25.0    25.0
27.0    27.0
30.0    30.0

age      25    27    30
name
Alice  25.0   NaN   NaN
Bob     NaN   NaN  30.0
Claire   NaN  27.0   NaN

Handling Missing Data: Real-world datasets often contain missing or null values. Pandas provides methods to handle missing data effectively, such as dropping or filling missing values.

import pandas as pd

data1 = {'name': ['Alice', 'Bob', 'Claire'],
         'age': [25, 30, 27]}
df1 = pd.DataFrame(data1, index=['A', 'B', 'C'])

data2 = {'name': ['David', 'Emma', 'Frank'],
         'city': ['Tokyo', 'Paris', 'London']}
df2 = pd.DataFrame(data2, index=['D', 'E', 'F'])

# Concatenating DataFrames (should specify axis=0 for row-wise)
concat_df = pd.concat([df1, df2], axis=0)
print("Concatenated DataFrame:")
print(concat_df)

# Merging DataFrames (this will fail since there is no common 'name')
# To fix it, we'll add a 'name' column to df2
data2 = {'name': ['Alice', 'Claire', 'David'],  # Changed names for demonstration
         'city': ['Tokyo', 'Paris', 'London']}
df2 = pd.DataFrame(data2, index=['A', 'B', 'C'])

merged_df = pd.merge(df1, df2, on='name', how='outer')
print("\nMerged DataFrame:")
print(merged_df)

# Grouping data (since df1 only has one numerical column 'age', we can still use it)
grouped_df = df1.groupby('age').size()
print("\nGrouped DataFrame:")
print(grouped_df)

# Reshaping data (pivoting)
reshaped_df = df1.pivot(index='name', columns='age', values='age')
print("\nReshaped DataFrame:")
print(reshaped_df)

Output:

Concatenated DataFrame:
     name   age    city
A   Alice  25.0     NaN
B     Bob  30.0     NaN
C  Claire  27.0     NaN
D   David   NaN   Tokyo
E    Emma   NaN   Paris
F   Frank   NaN  London

Merged DataFrame:
     name   age    city
0   Alice  25.0   Tokyo
1     Bob  30.0     NaN
2  Claire  27.0   Paris
3   David   NaN  London

Grouped DataFrame:
age
25    1
27    1
30    1
dtype: int64

Reshaped DataFrame:
age       25    27    30
name                    
Alice   25.0   NaN   NaN
Bob      NaN   NaN  30.0
Claire   NaN  27.0   

Feature Engineering and Data Transformation: Feature engineering involves creating new features from existing ones to improve the performance of machine learning models. Data transformation includes tasks like scaling, encoding, and normalization. The process of feature engineering and what it does will be explained in subsequent chapters.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

data = {'name': ['Alice', 'Bob', 'Claire', 'David'],
        'age': [25, 30, 27, 35],
        'city': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)

# Creating new features
df['age_squared'] = df['age'] ** 2
df['is_adult'] = df['age'] >= 18

# Scaling numerical features
scaler = MinMaxScaler()
df[['age', 'age_squared']] = scaler.fit_transform(df[['age', 'age_squared']])

# Encoding categorical features
encoder = LabelEncoder()
df['city'] = encoder.fit_transform(df['city'])

print(df)

Output:

     name  age  city  age_squared  is_adult
0   Alice  0.0     1     0.000000      True
1     Bob  0.5     0     0.458333      True
2  Claire  0.2     2     0.173333      True
3   David  1.0     3     1.000000      True

These are just a few examples of the powerful functionalities provided by NumPy and Pandas. In the upcoming chapters, we'll leverage these libraries to preprocess and analyze data for various AI tasks.

Data manipulation and preprocessing are crucial steps in building effective AI models. By mastering these techniques, you'll be well-prepared to tackle real-world AI problems and build robust solutions.

Let's continue our journey and explore the fascinating world of machine learning in the next chapter! We'll dive into the fundamentals of machine learning, different types of algorithms, and how to build and evaluate machine learning models using Python libraries like scikit-learn.