Supermarket Store Branches' Sales Analysis

Distribution of Store Sales
Store Area vs. Sales Scatter Plot
Correlation Heatmap
Daily Customer Count vs. Store Sales Scatter Plot
Top 10 Stores by Sales Bar Chart

Project Information

  • Category: Data Analysis / Exploratory Data Analysis (EDA)
  • Client/Context: Quantum Analytics (Internship Project)
  • Project Date: Nov 2023
  • Tools Used: Python (Pandas, NumPy, Matplotlib, Seaborn)
  • Data Source: Supermarket Store Branches Sales Dataset
  • Project URL: View Code on GitHub

Introduction: My Role in the Supermarket Sales Analysis Project

In my capacity as a Data Analyst at Quantum Analytics, my primary responsibility for this project was to conduct a comprehensive Exploratory Data Analysis (EDA) on a provided dataset of supermarket store sales. The core objective was to discern the performance characteristics of various store branches, identify influential factors affecting their sales, and subsequently extract actionable insights to inform strategic business decisions. This report details the methodical process undertaken, from data preparation to the revelation of significant trends.

Project Goals

The central challenge of this project was to identify the drivers of sales performance across diverse supermarket branches. Specifically, the analysis aimed to ascertain the direct impact of physical store dimensions (`Store_Area`) and average daily customer traffic (`Daily_Customer_Count`) on a store's total sales revenue (`Store_Sales`).

My key project goals encompassed:

  • To load and meticulously clean the raw sales data, ensuring its integrity and readiness for precise analysis.
  • To explore the fundamental statistical characteristics of the dataset, including average sales figures, store dimensions, and customer traffic metrics.
  • To visually represent the relationships between salient store attributes (such as area and customer footfall) and their corresponding sales performance.
  • To systematically identify top-performing stores and analyze their common attributes to establish performance benchmarks.
  • To derive substantiated and actionable insights capable of informing effective strategies for enhancing sales across the entire network of branches.

Data Description

The dataset employed for this analysis furnished detailed information pertaining to 896 supermarket store branches. Each record uniquely identified a store and incorporated the following critical attributes that characterize both its physical properties and operational performance:

  • Store ID: A singular identification number assigned to each supermarket branch.
  • Store_Area: The physical expanse of the store, quantified in square yards, providing an indication of its spatial capacity.
  • Items_Available: The total count of distinct product items stocked within the corresponding store, reflecting the breadth of its product offerings.
  • Daily_Customer_Count: The average number of customers who frequented the store per day over a designated monthly period.
  • Store_Sales: The cumulative sales revenue generated by the store, denominated in US dollars.

Methodology & Execution

My methodology for this project involved a structured sequence of data loading, cleaning, and exploratory analysis, primarily utilizing Python. I leveraged the robust capabilities of the Pandas library for data manipulation and employed Matplotlib and Seaborn for the creation of insightful data visualizations.

1. Environment Setup

The initial step involved configuring the Python environment by importing the requisite libraries:

Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
                

2. Data Loading & Initial Inspection

The `stores.csv` dataset was loaded, and a copy was created to safeguard the original data during the analytical process.

Python

stores = pd.read_csv('stores.csv')
df = stores.copy()
print(df.head()) # Display first few rows to confirm loading
                
Store ID Store_Area Items_Available Daily_Customer_Count Store_Sales
0 1 1659 1961 530 66490
1 2 1461 1752 210 39820
2 3 1340 1609 720 54010
3 4 1451 1748 620 53730
4 5 1770 2111 450 46620

3. Data Cleaning & Preprocessing

To enhance data usability and consistency, the column names were systematically renamed:

Python

columns_names = ("Store_id","Store_area","Item_Available","Daily_customer_count","Store_sales")
                

Renamed Columns:

Python

df.columns = columns_names
df.columns
                

DataFrame after Renaming (Head):


Index(['Store_id', 'Store_area', 'Item_Available', 'Daily_customer_count', 'Store_sales'], dtype='object')
                
Python

df.head()
                
Store_id Store_area Item_Available Daily_customer_count Store_sales
0 1 1659 1961 530 66490
1 2 1461 1752 210 39820
2 3 1340 1609 720 54010
3 4 1451 1748 620 53730
4 5 1770 2111 450 46620

DataFrame Shape:

Python

df.shape
                

(896, 5)
                

Missing Value Assessment:

Python

df.isnull().sum()
                

Store_id                  0
Store_area                0
Item_Available            0
Daily_customer_count      0
Store_sales               0
dtype: int64
                

4. Specific Analysis: Identifying Top Performers

Initially, I confirmed the total count of unique store identifiers present within the dataset:

Python

# Count of Store ID
count_of_store_id = df["Store_id"].count()
print(f"\nThe total count of unique store IDs is: {count_of_store_id}.")
                

Output:


The total count of unique store IDs is: 896.
                

Subsequently, I identified and extracted the top 10 stores distinguished by their highest sales performance:

Python

# Top 10 Stores with the Most Store Sales
print("\n--- Top 10 Stores by Sales ---")
top_store_sales = df.sort_values('Store_sales', ascending=False)
top_10_stores = top_store_sales.head(10)
print(top_10_stores)
                

Top 10 Stores by Sales:

Store_id Store_area Item_Available Daily_customer_count Store_sales
704 705 1766 2106 1460 116320
243 244 2229 2667 890 105150
123 124 1800 2154 1380 102820
649 650 2093 2505 1040 102710
491 492 1537 1851 1560 102310
176 177 1806 2159 1390 102140
187 188 1763 2106 1400 101880
20 21 1955 2334 1090 101650
887 888 1842 2200 1430 101180
848 849 1748 2089 1340 100860

A bar chart was then generated to facilitate a direct visual comparison of the sales performance among these top 10 stores:

Python

# Plotting Top 10 Stores with the Most Store Sales
plt.figure(figsize=(12, 7))
top_10_stores.plot(kind='bar', x='Store_id', y='Store_sales',
                   title='Top 10 Stores with the Most Store Sales',
                   color='darkred', edgecolor='black', figsize=(12,7))
plt.xlabel('Store ID')
plt.ylabel('Store Sales (US $)')
plt.xticks(rotation=45) # Rotate x-axis labels for enhanced readability
plt.tight_layout()
plt.show()
                
Top 10 Stores by Sales Bar Chart

Results & Key Insights

The visualizations generated provided critical insights into the underlying patterns and relationships within the sales data. Each chart contributed distinct findings:

Distribution of Store Sales (Histogram)

This histogram illustrates the frequency distribution of `Store_sales` values, allowing for an understanding of typical sales ranges and the identification of any distributional skewness or multiple peaks.

Python

plt.figure(figsize=(10, 6))
sns.histplot(df['Store_sales'], kde=True, bins=30, color='skyblue')
plt.title('Distribution of Store Sales')
plt.xlabel('Store Sales (US $)')
plt.ylabel('Number of Stores')
plt.grid(axis='y', alpha=0.75)
plt.tight_layout()
plt.show()
                
Distribution of Store Sales

Analysis: The majority of stores exhibit sales figures concentrated within a central range, indicating a relatively consistent performance across the bulk of the branches. A smaller proportion of stores register exceptionally high or notably low sales. This observation suggests that while most stores operate within expected parameters, there is potential to gain valuable insights from high-performing outliers to replicate success, and from underperforming ones to address underlying challenges and improve overall sales efficacy.

Store Area vs. Sales (Scatter Plot)

This scatter plot visually examines the correlation between `Store_area` and `Store_sales`. Each data point represents a store, and the plot's pattern reveals whether larger store sizes generally correspond to higher sales.

Python

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Store_area', y='Store_sales', data=df, alpha=0.7, color='coral')
plt.title('Store Area vs. Store Sales')
plt.xlabel('Store Area (sq. yards)')
plt.ylabel('Store Sales (US $)')
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
                
Store Area vs. Sales Scatter Plot

Analysis: A moderately positive correlation is observed: as store area increases, there is a general tendency for store sales to increase. However, this relationship is not perfectly linear, indicating that store size is one contributing factor, but not the sole determinant of sales performance. The presence of smaller stores achieving high sales, and larger stores not maximizing their potential, implies that other variables, such as efficient space utilization or strategic product placement, significantly influence a store's success.

Daily Customer Count vs. Sales (Scatter Plot)

This scatter plot directly illustrates the relationship between the `Daily_customer_count` and `Store_sales`. It provides clear evidence of whether a higher volume of daily customer visits translates into increased sales revenue.

Python

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Daily_customer_count', y='Store_sales', data=df, alpha=0.7, color='lightgreen')
plt.title('Daily Customer Count vs. Store Sales')
plt.xlabel('Daily Customer Count')
plt.ylabel('Store Sales (US $)')
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
                
Daily Customer Count vs. Sales Scatter Plot

Analysis: A strong positive correlation is clearly evident: an increase in daily customer count is consistently associated with higher store sales. This intuitive finding underscores the critical impact of customer footfall on revenue. Consequently, strategic initiatives aimed at attracting and retaining customers, such as targeted marketing campaigns or promotional activities, are highly likely to yield substantial improvements in store revenue.

Correlation Heatmap

This heatmap offers a concise visual representation of the correlation matrix among the numerical variables in the dataset. It enables rapid identification of the strength and direction (positive or negative) of linear relationships between variable pairs. Correlation coefficients closer to +1 signify a strong positive correlation, while values near -1 indicate a strong negative correlation, and values approaching 0 suggest a weak or negligible linear relationship.

Python

plt.figure(figsize=(10, 8))
# Select only numerical columns for correlation calculation
numerical_cols = ['Store_area', 'Item_Available', 'Daily_customer_count', 'Store_sales']
correlation_matrix = df[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5, cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Correlation Heatmap of Store Data')
plt.tight_layout()
plt.show()
                
Correlation Heatmap of Store Data

Analysis: The heatmap explicitly reinforces a strong positive correlation between `Daily_customer_count` and `Store_sales`, consistent with observations from the scatter plot. A moderate positive correlation between `Store_area` and `Store_sales` was also confirmed. Furthermore, `Items_Available` demonstrated a positive correlation with sales (approximately 0.5), suggesting that a broader product variety may also contribute to increased sales. These correlation patterns collectively indicate that optimizing customer traffic is the most impactful immediate strategy for sales enhancement, followed by considerations of store size and merchandise diversity.

Overall Outcomes & Conclusion

  • Data Preparedness: The raw sales dataset was successfully loaded, meticulously cleaned, and logically organized with standardized column naming. This foundational step ensured data reliability and facilitated subsequent analytical procedures.
  • Key Metric Understanding: Through the generation of descriptive statistics, essential quantitative insights were immediately derived. For instance, the analysis revealed an average store sales figure of approximately $59,351 and an average daily customer count of approximately 786.
  • Identification of High-Performing Branches: The top 10 stores based on sales performance were precisely identified, serving as potential benchmarks for best practices and further comparative analysis.
  • Sales Drivers Ascertained: The comprehensive analysis strongly indicates that customer traffic is the predominant factor influencing store sales. Store size and the breadth of item availability were also identified as significant, albeit secondary, contributing factors.

This "Supermarket Store Branches' Sales Analysis" project effectively demonstrates my proficiency in employing Python for Exploratory Data Analysis. By systematically executing data inspection, cleaning, and visualization, I was able to glean valuable insights into sales dynamics, ascertain data quality, and identify initial relationships between store attributes and sales performance. This project underscores my competence in fundamental data science tools and methodologies, establishing a robust groundwork for more advanced analytical modeling and facilitating data-driven strategic recommendations for businesses.