Python Libraries for Data Science: NumPy and Pandas

General-purpose programming language.

Python is a powerful programming language that is widely used in the field of data science. Two of the most important libraries for data science in Python are NumPy and Pandas. These libraries provide a range of functions and data structures that make it easier to work with data.

NumPy

NumPy, which stands for 'Numerical Python', is a library that provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Understanding and Creating NumPy Arrays

A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of non-negative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

To create a NumPy array, you can use the numpy.array() function. For example:

import numpy as np

# Create a 1-dimensional array
a = np.array([1, 2, 3])
print(a)

Basic Operations with NumPy Arrays

NumPy arrays support a variety of operations. For example, you can perform arithmetic operations on arrays of the same size, and NumPy will apply the operation element-wise:

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Add the arrays
print(a + b)

Pandas

Pandas is another library that provides data structures and data analysis tools that are very helpful for data science. The two main data structures provided by Pandas are Series and DataFrame.

Introduction to Pandas DataFrames

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used pandas object.

To create a DataFrame, you can use the pandas.DataFrame() function. For example:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c']
})

print(df)

Data Manipulation with Pandas

Pandas provides a variety of functions for manipulating data. For example, you can use the head() function to get the first few rows of the DataFrame, or the describe() function to get a statistical summary of the DataFrame:

# Get the first 5 rows of the DataFrame
print(df.head())

# Get a statistical summary of the DataFrame
print(df.describe())

By understanding and utilizing these Python libraries, you can effectively manipulate, analyze, and visualize data, which are crucial skills in data science.

Data Science 101