In the previous chapter, we learned about the basics of AI and Python programming. Now, let's explore some powerful Python libraries that are essential for data manipulation and analysis in AI projects.
Introduction to NumPy:
NumPy is a fundamental library for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical functions to operate on these arrays efficiently. NumPy forms the foundation for many other Python libraries used in data science and machine learning.
Installing NumPy: To install NumPy, open a terminal or command prompt and run the following command:
pip install numpy
Creating NumPy Arrays: NumPy arrays are the core data structure in NumPy. They are similar to Python lists but offer more efficient storage and computation, especially for large datasets.
import numpy as np
# Create a 1-dimensional array from a list
arr1 = np.array([1, 2, 3, 4, 5])
print(arr1)
# Create a 2-dimensional array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2)
# Create an array with a specific data type
arr3 = np.array([1, 2, 3], dtype=np.float64)
print(arr3)
# Create an array filled with zeros
zeros_arr = np.zeros((3, 4))
print(zeros_arr)
# Create an array filled with ones
ones_arr = np.ones((2, 3))
print(ones_arr)
# Create an array with a range of values
range_arr = np.arange(0, 10, 2)
print(range_arr)
# Create an array with evenly spaced values
linspace_arr = np.linspace(0, 1, 5)
print(linspace_arr)
Output:
[1 2 3 4 5]
[[1 2 3]
[4 5 6]]
[1. 2. 3.]
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
[[1. 1. 1.]
[1. 1. 1.]]
[0 2 4 6 8]
[0. 0.25 0.5 0.75 1. ]
Array Operations and Broadcasting: NumPy provides a wide range of operations that can be performed on arrays. These operations can be mathematical, statistical, or logical in nature. NumPy also supports broadcasting, which allows arrays with different shapes to be used in arithmetic operations.
import numpy as np
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])
# Element-wise addition
print(arr1 + arr2)
# Element-wise subtraction
print(arr2 - arr1)
# Element-wise multiplication
print(arr1 * arr2)
# Element-wise division
print(arr2 / arr1)
# Matrix multiplication
print(np.dot(arr1, arr2.T))
# Broadcasting example
print(arr1 + 5)
# Statistical operations
print(np.mean(arr1))
print(np.median(arr1))
print(np.std(arr1))
Output:
[[ 8 10 12]
[14 16 18]]
[[6 6 6]
[6 6 6]]
[[ 7 16 27]
[40 55 72]]
[[7. 4. 3. ]
[2.5 2.2 2. ]]
[[ 50 68]
[122 167]]
[[ 6 7 8]
[ 9 10 11]]
3.5
3.5
1.707825127659933
Array Indexing and Slicing: NumPy arrays support indexing and slicing operations similar to Python lists. You can access individual elements, rows, columns, or subsets of an array using indices and slices.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Accessing elements
print(arr[0, 0]) # First element
print(arr[1, 2]) # Element at row 1, column 2
# Slicing arrays
print(arr[:2, :2]) # First two rows and columns
print(arr[1:, :]) # From the second row to the end
print(arr[:, 1]) # Second column
# Conditional indexing
print(arr[arr > 5]) # Elements greater than 5
Output:
1
6
[[1 2]
[4 5]]
[[4 5 6]
[7 8 9]]
[2 5 8]
[6 7 8 9]
Introduction to Pandas:
Pandas is a powerful library for data manipulation and analysis in Python. It provides data structures like Series and DataFrame that allow you to work with structured data efficiently. Pandas is built on top of NumPy and integrates well with other libraries in the data science ecosystem.
Installing Pandas: To install Pandas, run the following command:
pip install pandas
Creating Series and DataFrames: Series is a one-dimensional labeled array capable of holding any data type. DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
import pandas as pd
# Creating a Series
series = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
print(series)
# Creating a DataFrame
data = {'name': ['Alice', 'Bob', 'Claire', 'David'],
'age': [25, 30, 27, 35],
'city': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
print(df)
# Creating a DataFrame from a CSV file
csv_df = pd.read_csv('data.csv')
print(csv_df.head()) # Display the first few rows
Data file:
Output:
a 1
b 2
c 3
d 4
dtype: int64
name age city
0 Alice 25 New York
1 Bob 30 London
2 Claire 27 Paris
3 David 35 Tokyo
name age city
0 Alice 25 New York
1 Bob 30 London
2 Claire 27 Paris
3 David 35 Tokyo
4 John 25 NYC
Data Selection and Filtering: Pandas provides various methods to select and filter data in Series and DataFrames. You can access data using labels, positions, or boolean conditions.
import pandas as pd
data = {'name': ['Alice', 'Bob', 'Claire', 'David'],
'age': [25, 30, 27, 35],
'city': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
# Selecting a single column
print(df['name'])
# Selecting multiple columns
print(df[['name', 'age']])
# Selecting rows based on a condition
print(df[df['age'] > 30])
# Selecting rows based on multiple conditions
print(df[(df['age'] > 25) & (df['city'] == 'London')])
# Selecting rows and columns using loc and iloc
print(df.loc[0, 'name']) # Select by labels
print(df.iloc[1, 2]) # Select by positions
Output:
0 Alice
1 Bob
2 Claire
3 David
Name: name, dtype: object
name age
0 Alice 25
1 Bob 30
2 Claire 27
3 David 35
name age city
3 David 35 Tokyo
name age city
1 Bob 30 London
Alice
London0 Alice
1 Bob
2 Claire
3 David
Name: name, dtype: object
name age
0 Alice 25
1 Bob 30
2 Claire 27
3 David 35
name age city
3 David 35 Tokyo
name age city
1 Bob 30 London
Alice
London
Data Manipulation: Pandas provides a wide range of functions and methods for data manipulation, including merging, grouping, reshaping, and transforming data.
import pandas as pd
data1 = {'name': ['Alice', 'Bob', 'Claire'],
'age': [25, 30, 27]}
df1 = pd.DataFrame(data1, index=['A', 'B', 'C'])
data2 = {'name': ['David', 'Emma', 'Frank'],
'city': ['Tokyo', 'Paris', 'London']}
df2 = pd.DataFrame(data2, index=['D', 'E', 'F'])
# Concatenating DataFrames
concat_df = pd.concat([df1, df2])
print(concat_df)
# Merging DataFrames
merged_df = pd.merge(df1, df2, on='name', how='outer')
print(merged_df)
# Grouping data
grouped_df = df1.groupby('age').mean()
print(grouped_df)
# Reshaping data
reshaped_df = df1.pivot(index='name', columns='age', values='age')
print(reshaped_df)
Output:
name age city
A Alice 25.0 NaN
B Bob 30.0 NaN
C Claire 27.0 NaN
D David NaN Tokyo
E Emma NaN Paris
F Frank NaN London
name age city
A Alice 25.0 NaN
B Bob 30.0 NaN
C Claire 27.0 NaN
D David NaN Tokyo
E Emma NaN Paris
F Frank NaN London
age
age
25.0 25.0
27.0 27.0
30.0 30.0
age 25 27 30
name
Alice 25.0 NaN NaN
Bob NaN NaN 30.0
Claire NaN 27.0 NaN
Handling Missing Data: Real-world datasets often contain missing or null values. Pandas provides methods to handle missing data effectively, such as dropping or filling missing values.
import pandas as pd
data1 = {'name': ['Alice', 'Bob', 'Claire'],
'age': [25, 30, 27]}
df1 = pd.DataFrame(data1, index=['A', 'B', 'C'])
data2 = {'name': ['David', 'Emma', 'Frank'],
'city': ['Tokyo', 'Paris', 'London']}
df2 = pd.DataFrame(data2, index=['D', 'E', 'F'])
# Concatenating DataFrames (should specify axis=0 for row-wise)
concat_df = pd.concat([df1, df2], axis=0)
print("Concatenated DataFrame:")
print(concat_df)
# Merging DataFrames (this will fail since there is no common 'name')
# To fix it, we'll add a 'name' column to df2
data2 = {'name': ['Alice', 'Claire', 'David'], # Changed names for demonstration
'city': ['Tokyo', 'Paris', 'London']}
df2 = pd.DataFrame(data2, index=['A', 'B', 'C'])
merged_df = pd.merge(df1, df2, on='name', how='outer')
print("\nMerged DataFrame:")
print(merged_df)
# Grouping data (since df1 only has one numerical column 'age', we can still use it)
grouped_df = df1.groupby('age').size()
print("\nGrouped DataFrame:")
print(grouped_df)
# Reshaping data (pivoting)
reshaped_df = df1.pivot(index='name', columns='age', values='age')
print("\nReshaped DataFrame:")
print(reshaped_df)
Output:
Concatenated DataFrame:
name age city
A Alice 25.0 NaN
B Bob 30.0 NaN
C Claire 27.0 NaN
D David NaN Tokyo
E Emma NaN Paris
F Frank NaN London
Merged DataFrame:
name age city
0 Alice 25.0 Tokyo
1 Bob 30.0 NaN
2 Claire 27.0 Paris
3 David NaN London
Grouped DataFrame:
age
25 1
27 1
30 1
dtype: int64
Reshaped DataFrame:
age 25 27 30
name
Alice 25.0 NaN NaN
Bob NaN NaN 30.0
Claire NaN 27.0
Feature Engineering and Data Transformation: Feature engineering involves creating new features from existing ones to improve the performance of machine learning models. Data transformation includes tasks like scaling, encoding, and normalization. The process of feature engineering and what it does will be explained in subsequent chapters.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
data = {'name': ['Alice', 'Bob', 'Claire', 'David'],
'age': [25, 30, 27, 35],
'city': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
# Creating new features
df['age_squared'] = df['age'] ** 2
df['is_adult'] = df['age'] >= 18
# Scaling numerical features
scaler = MinMaxScaler()
df[['age', 'age_squared']] = scaler.fit_transform(df[['age', 'age_squared']])
# Encoding categorical features
encoder = LabelEncoder()
df['city'] = encoder.fit_transform(df['city'])
print(df)
Output:
name age city age_squared is_adult
0 Alice 0.0 1 0.000000 True
1 Bob 0.5 0 0.458333 True
2 Claire 0.2 2 0.173333 True
3 David 1.0 3 1.000000 True
These are just a few examples of the powerful functionalities provided by NumPy and Pandas. In the upcoming chapters, we'll leverage these libraries to preprocess and analyze data for various AI tasks.
Data manipulation and preprocessing are crucial steps in building effective AI models. By mastering these techniques, you'll be well-prepared to tackle real-world AI problems and build robust solutions.
Let's continue our journey and explore the fascinating world of machine learning in the next chapter! We'll dive into the fundamentals of machine learning, different types of algorithms, and how to build and evaluate machine learning models using Python libraries like scikit-learn.