Data Visualisation with Python
Description
This project will dive into working with pandas, numpy, and Matplotlib for data analysis and visualization. Participants will create a sample DataFrame using pandas, generate random data using numpy, and produce a scatter plot. This project will effectively enable participants to handle, manipulate, and interpret complex datasets using Python libraries. This is a crucial skill set for anyone looking to venture into data science or data-focused roles.
Part #1: Basics of Python for Data Science
Python is a powerful tool for handling large amounts of data. The following sections go through the basic concepts of Python for Data Science, with practical examples using pandas and numpy.
Python Libraries Setup:
We will need the NumPy and pandas libraries. Here is how to install them:
pip install numpy pandas
Data Importing:
Pandas can import data from various formats such as CSV, JSON, SQL, Excel. We will focus on CSV as it is commonly used.
import pandas as pd
# Read data from CSV file
df = pd.read_csv('data.csv')
Data Exploration:
After importing, we need to explore and understand data.
# Display first 5 rows
df.head()
# Display last 5 rows
df.tail()
# Display the shape (number of rows, number of columns)
df.shape
# Display columns names
df.columns
# Info about dataframe (columns names, types, non-null values, and memory usage)
df.info()
# Summary statistics for numerical columns
df.describe()
Data Cleaning:
Cleaning data is crucial for getting accurate results.
# Drop NA values
df = df.dropna()
# Or fill NA values
df = df.fillna(value)
# Drop a column
df = df.drop('column_name', axis=1)
Data Manipulation:
# Select a column
df['column_name']
# Select multiple columns
df[['column1', 'column2']]
# Select rows that meet a condition
df[df['column_name'] > value]
# Apply a function to a column
df['column_name'].apply(lambda x: function(x))
# Sort values
df.sort_values('column_name', ascending=False)
Data Visualization:
To visualize data, we will use matplotlib, which can be installed with:
pip install matplotlib
Here is a basic example:
import matplotlib.pyplot as plt
# Bar chart
df['column_name'].value_counts().plot(kind='bar')
plt.show()
NumPy Arrays:
NumPy provides an efficient interface to store and operate on dense data buffers. The following shows some of the basic operations:
import numpy as np
# Create a numpy array
arr = np.array([1, 2, 3, 4, 5])
# Get the shape (dimensions) of the array
arr.shape
# Indexing
arr[0]
# Slicing
arr[1:3]
# Assigning
arr[0] = 8
# Basic operations
arr + 10
arr * 2
These were the basics of Python for Data Science. From here, you can progress to more advanced data manipulation and visualization, machine learning, and statistical modeling.
Sure, let's dive into practical data manipulation using pandas. We will start by importing the necessary libraries, loading data from a CSV file, performing various manipulations like sorting, filtering, aggregating, merging, statistical operations, and data visualization. We already have Python installed along with required data science libraries like pandas and numpy.
1. Importing Necessary Libraries
Let's import the libraries needed in this solution.
import pandas as pd
import numpy as np
2. Data Loading
We would use pandas' read_csv
method to read a CSV file and convert it into a dataframe.
Consider we have a sales.csv file having sales data.
df = pd.read_csv('sales.csv')
3. Data Exploration
Some basic functions to understand our data.
The head
function gives the first 5 rows in the dataframe.
df.head()
info
function provides a concise summary of our dataframe with column types, non-null entries, and memory usage.
df.info()
4. Data Manipulation
4.1 Sorting
We can sort our data based on a column. Let's sort the sales data on Profit.
df.sort_values('Profit', ascending=False)
4.2 Filtering
We can filter data based on some criteria.
#Get all the rows where Sales > 500
df_filtered = df[df['Sales']>500]
4.3 Aggregation
We can apply aggregate operations to our data as well.
#Get total sales
total_sales = df['Sales'].sum()
#Get average profit
avg_profit = df['Profit'].mean()
#Group by 'Region' and get sum of 'Sales' in each 'Region'
region_wise_sales = df.groupby('Region')['Sales'].sum().reset_index()
5. Merging Datasets
Let's consider we have another dataframe df_targets
which contains sales target for each region. We can merge this data with our sales data.
df_merged = pd.merge(df, df_targets, on='Region', how='left')
6. Handling Missing Values
We can check for missing or NaN values in our data and fill or drop them according to our need.
#Check for missing values in each column
df.isnull().sum()
#Fill missing values with a stipulated value, here 0.
df_filled = df.fillna(0)
#Drop rows with missing values
df_dropped = df.dropna()
7. Applying Functions
We can also apply custom functions to our data.
#Function to categorize sales
def categorize_sales(x):
if x < 300:
return 'Low'
elif x < 700:
return 'Medium'
else:
return 'High'
df['Sales Category'] = df['Sales'].apply(categorize_sales)
8. Data Visualization
Let's create a simple bar plot for region wise sales
import matplotlib.pyplot as plt
plt.bar(region_wise_sales['Region'], region_wise_sales['Sales'])
plt.xlabel('Region')
plt.ylabel('Sales')
plt.title('Region Wise Sales')
plt.show()
Conclusion
This is just a small introduction but pandas provide many more methods for different requirements. You can explore the pandas documentation according to your needs. Always remember that your aim should be to convert raw data into an understandable format and extract useful information or insights from it. Stick to this goal and you are good to go.
Generating and Manipulating Data with numpy
Part 1: Generating Data using numpy
numpy is a powerful Python library that supports large, multi-dimensional arrays and matrices. It also provides a large collection of high-level mathematical functions to operate on these arrays.
Let's start with the basics - creating arrays.
import numpy as np
# one-dimensional array
one_d_array = np.array([1, 2, 3, 4, 5])
# two-dimensional array
two_d_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(one_d_array)
print(two_d_array)
We can also create arrays with specific values, like all zeros or all ones, in a given shape.
# all zeros
zeros_array = np.zeros((3,3))
# all ones
ones_array = np.ones((3,3))
print(zeros_array)
print(ones_array)
numpy provides a function to generate an array with evenly spaced values between a specified start, end, and step value.
range_array = np.arange(0, 100, 10)
print(range_array)
numpy also allows us to generate random data for a specified shape.
random_array = np.random.rand(3, 3)
print(random_array)
Part 2: Manipulating Data using numpy
We've got our data in numpy arrays. Now let's tweak them with some basic manipulations like reshaping, indexing, slicing, iterating and joining.
Reshaping
Reshaping changes the arrangement of items so that the shape of the array is modified, while maintaining the same number of dimensions.
original = np.array([1, 2, 3, 4, 5, 6])
reshaped_array = original.reshape(2, 3)
print(reshaped_array)
Indexing and Slicing
In numpy, we can select elements or groups of elements from the array much like we do with regular Python lists.
# Indexing
one_d_array = np.array([1, 2, 3, 4, 5])
print(one_d_array[2]) # prints '3'
# Slicing
two_d_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(two_d_array[:,1]) # prints '[2 5 8]'
Iterating
Iterating over numpy arrays is similar to iterating over lists in Python.
for row in two_d_array:
for item in row:
print(item)
Joining
numpy provides multiple functions to combine arrays.
# vertically
vertical = np.vstack((two_d_array, two_d_array))
print(vertical)
# horizontally
horizontal = np.hstack((two_d_array, two_d_array))
print(horizontal)
These are just simple examples of what you can do with numpy. With these basic functions, you can create far more complex data manipulation operations to suit any data science project's needs.
Part 4: Creating Data Visualizations using Matplotlib
Once you are comfortable with Python, pandas, and numpy, it's time to move onto data visualization using Matplotlib.
Here, we will walk through an example from scratch to visualize some data using line plots, scatter plots, and histograms.
# Import necessary libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
Line Plot
Let's create a simple graph using matplotlib. We'll plot a sine wave using numpy and matplotlib
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y)
plt.show()
In the above example, np.linspace(0, 10, 100)
generates 100 uniformly spaced values between 0 and 10. np.sin(x)
calculates the sine value of each of these points. plt.plot(x, y)
plots these points and plt.show()
displays the plot.
Scatter Plot
Scatter plots are used to represent a 2-dimensional set of values. Let's utilize numpy to generate two arrays of random values and then plot them using the scatter function.
x = np.random.rand(100)
y = np.random.rand(100)
plt.scatter(x, y)
plt.show()
In the above code, np.random.rand(100)
generates 100 random values. Then plt.scatter(x, y)
creates a scatter plot of these points.
Histogram
Histograms are used to represent the frequency distribution of numerical data. Let's create a histogram of randomly generated data.
data = np.random.randn(1000)
plt.hist(data, bins=30)
plt.show()
In the above code, np.random.randn(1000)
generates 1000 normally distributed random values. plt.hist(data, bins=30)
creates a histogram of this data with 30 equally spaced bins.
Customized Plot
Let's customize our graph to add labels, title, legend and also change the color of our plot. We will plot both a sine and cosine wave for the same range of values.
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x), label='sine wave', color='blue')
plt.plot(x, np.cos(x), label='cosine wave', color='red')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Sine and Cosine waves')
plt.legend()
plt.show()
In this example, plt.plot(x, np.sin(x), label='sine wave', color='blue')
and plt.plot(x, np.cos(x), label='cosine wave', color='red')
plot the sine and cosine waves and also assign labels to them. plt.xlabel('X-axis')
and plt.ylabel('Y-axis')
label the respective axes. plt.title('Sine and Cosine waves')
assigns a title to the graph and plt.legend()
adds a legend to the graph.
You can further customize your graphs as per your requirements using the different functionalities provided by matplotlib.
Remember to always have matplotlib imported at the beginning of your code using import matplotlib.pyplot as plt
.
1. Data Creation
Firstly, let's generate some data using numpy.
import numpy as np
# Number of data points
num_data = 500
# Generating random data
feature1 = np.random.normal(size=num_data)
feature2 = np.random.uniform(size=num_data)
# Creating a labels (target variable) based on features
labels = 2*feature1 - 0.5*feature2 + 0.3*np.random.normal(size=num_data)
# Combining features into 2D array
data = np.column_stack((feature1, feature2))
2. Load and Handling Data using pandas
We'll use pandas to load the data and perform some basic exploratory data analysis.
import pandas as pd
# Load data into pandas DataFrame
df = pd.DataFrame(data, columns=["Feature1", "Feature2"])
df['Label'] = labels
# Print the first 5 rows of the DataFrame
print(df.head())
3. Data Manipulation using pandas
Let's perform some basic operations to manipulate this data.
# Create a new feature as sum of Feature1 and Feature2
df['FeatureSum'] = df['Feature1'] + df['Feature2']
# Create a new feature as difference of Feature1 and Feature2
df['FeatureDiff'] = df['Feature1'] - df['Feature2']
# Display the DataFrame
print(df.head())
4. Data Analysis with numpy and pandas
We can use numpy and pandas together for exploratory data analysis.
# Calculate the mean and variance of labels
mean = df['Label'].mean()
variance = df['Label'].var()
print("Mean: ", mean)
print("Variance: ", variance)
# Generate descriptive statistics that summarize the central tendency,
# dispersion and shape of dataset's distribution
print(df.describe())
5. Data Visualization
Lastly, let's visualize this data using matplotlib.
import matplotlib.pyplot as plt
# Plot histogram of Feature1
plt.hist(df['Feature1'], bins=30, alpha=0.5, label='Feature1')
# Plot histogram of Feature2
plt.hist(df['Feature2'], bins=30, alpha=0.5, label='Feature2')
plt.legend(loc='upper right')
plt.show()
# Create a scatter plot of Feature1 vs Label
plt.scatter(df['Feature1'], df['Label'])
plt.xlabel('Feature1')
plt.ylabel('Label')
plt.title('Scatter plot of Feature1 vs Label')
plt.show()
Be sure to replace the dummy data and data manipulation methods with your actual ones as this is just a generalized example.