Crafting Real-World Like Data For E-Commerce Domain Databases

In the vast expanse of the e-commerce realm, the efficiency and robustness of databases are paramount. It's a complex task to create databases that faithfully replicate real-world situations. Nevertheless, it is essential for evaluating the robustness and functionality of e-commerce platforms. We will be delving into the meticulous process of generating real-lifelike data in the world of e-commerce.

We present a comprehensive approach to synthesizing data that mimics the intricacies of actual e-commerce transactions. We will determine what schemas and their structure are crucial to e-commerce operations, add data from common distributions to them and explain how these synthetic data may be used for many purposes, including testing and teaching.

We'll be starting with how to run a Postgres image using docker, followed by a discussion on how the entire E-commerce schema will be designed. Next, we'll code out to fill the tables as per the schema, finally ending the blog with the suggestions provided by models like GPT-4 and Claude on how the data creation and distribution can be made more efficient.

Running Postgres image using Docker

Step 1: Pulling the postgres image from Docker Hub:

Image source: Docker Hub

We shall start by visiting the Docker Hub page from where can pull the postgres image ( click here ). Now, let us pull the postgres image, which is packed with latest version(16.2) with the following command on your command prompt/terminal :

docker pull postgres

Once it's done, on your Docker Desktop, you can see that the notebook image is displayed on the Images module, which confirms the successful procedure carried out.


Step 2: Starting a postgres instance:

Let us now run the following command on terminal/command prompt:

docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres
  • In this command, some-postgres is the name of your container, which can be changed as per your liking. The -p parameter refers to the port, which exposes the port from the Docker container to our main operating system.
  • The -e refers to the environment variables.in our command, we only have one environment variable which is the POSTGRES_PASSWORD, by default it is mysecretpassword, which also can be changed before running the command.
  • -d refers to detached mode, which allows us to run the container in the background and we can use the terminal for other commands.
  • Lastly, the final part is the path to the image which will be downloaded, i.e the word postgres.

To check the status of the image and confirm if the docker image is running, run the following command:

docker ps

This will let you know the status of the image and confirm if the docker image is running. The expected output looks like this:


Step 3: Using DBeaver/other IDEs to create connections and run SQL queries:

DBeaver is a free and open-source SQL client software application and a database administration tool.

💡
For a comprehensive guide on the installation process and setting up he connection, we've got you covered! Head over to another insightful blog post on our website where you'll find detailed explanations and step-by-step instructions. Dive into the process seamlessly by checking out the resource below!
DBeaver vs Python: Two Methods To Run Your Queries
This is a step-by-step guide on how to run your SQL queries using Dbeaver and Python.

After installation:

Once installed, open the application and click on New Database Connection, which will display the following dialog box:

Image source: DBeaver

After selecting PostgreSQL, click on Next.

Image source: DBeaver

Here, fill the details as per the docker run command we ran earlier, fill the port number, username, host and password. To test if the connection is established, select Test Connection.

Image source: DBeaver

If the connection test shows the status as Connected , great job! We are now ready to dive into the schema section!

Follow the below video if you get stuck on any of the steps!


An alternative to get PostgreSQL is to use neon.tech, which provides serverless Postgres with free tiers!

Neon — Serverless, Fault-Tolerant, Branchable Postgres
Postgres made for developers. Easy to Use, Scalable, Cost efficient solution for your next project.

Different tables in the E-commerce database schema:

Table 1: Users Table

CREATE TABLE Users (
 UserID VARCHAR(8) PRIMARY KEY,
 FirstName VARCHAR(255),
 LastName VARCHAR(255),
 Email VARCHAR(255) UNIQUE,
 PasswordHash VARCHAR(255),
 CreateDate TIMESTAMP,
 LastLogin TIMESTAMP
);

We create a table named Users to store information about each user, with following columns:

  1. UserID: Designated as the primary key for the table to enure that each UserID value is unique within the table.
  2. FirstName and LastName: Intended to store the first and last names of users, respectively.
  3. Email column is marked as UNIQUE, this constraint ensures that no two users can have the same email address. PasswordHash column stores the hashed password of each user to enhance security by protecting user passwords in case of a data breach.
  4. CreateDate and LastLogin columns are of type TIMESTAMP to store date and time values, intended to store the date and time when each user account was created & the date and time of the user's last login.

Table 2: Orders Table

CREATE TABLE Orders (
 OrderID VARCHAR(8) PRIMARY KEY,
 UserID VARCHAR(8),
 OrderDate TIMESTAMP,
 ShippingAddress TEXT,
 Status VARCHAR(50),
 FOREIGN KEY (UserID) REFERENCES Users(UserID)
);

The Orders table is used to manage order-related data that includes:

  • OrderID: Unique identifier for each order, serving as the primary key
  • UserID: Identifies the user associated with the order which acts as a foriegn key and references the UserID in the "Users" table.
  • OrderDate: Records the date and time when the order was placed.
  • ShippingAddress: Stores the shipping address for the order.
  • Status: Indicates the current status of the order.

Table 3: ProductCategories Table

CREATE TABLE ProductCategories (
    CategoryID VARCHAR(8) PRIMARY KEY,
    CategoryName VARCHAR(255),
    Description TEXT
);

We establish a table called ProductCategories to organize product-category information which features:

  • CategoryID is aunique identifier for each product category, functioning as the primary key along with CategoryName to record the name of each product category and the Description to accommodate detailed descriptions of the product categories.

Table 4: Products Table

CREATE TABLE Products (
    ProductID VARCHAR(10) PRIMARY KEY,
    CategoryID VARCHAR(10),
    Name VARCHAR(511),
    Category VARCHAR(511),
    Image VARCHAR(511),
    Link VARCHAR(511),
    Ratings NUMERIC(4, 1),
    No_of_ratings INT,
    Discount_Price DECIMAL(30, 3),
    Actual_Price DECIMAL(30, 3),
    FOREIGN KEY (CategoryID) REFERENCES ProductCategories(CategoryID)
);

We define a table named Products to catalog information about various products. It serves as a structured repository for managing product data to facilitate efficient organization and retrieval of product information. It contains:

  • ProductID: A unique identifier for each product, serving as the primary key.
  • CategoryID: Associates each product with a specific category, referencing the CategoryID in the "ProductCategories" table.
  • Name: Captures the name of each product, Category: Provides a descriptive label for the product's category.
  • Image: Stores the file path or URL of the product's image, Link: Contains the URL or link to the product page.
  • Ratings: Represents the average rating of the product, allowing one decimal place, No_of_ratings: Tracks the total number of ratings received by the product.
  • Discount_Price: Indicates the discounted price of the product, allowing up to three decimal places, Actual_Price: Specifies the original price of the product before any discounts, allowing up to three decimal places.

Table 5: OrderProducts Table

CREATE TABLE OrderProducts (
    OrderID VARCHAR(10) NOT NULL,
    ProductID VARCHAR(10) NOT NULL,
    Quantity INT NOT NULL,
    PriceAtPurchase DECIMAL(30, 3) NOT NULL,
    TotalPrice DECIMAL(30, 3) NOT NULL,
    PRIMARY KEY (OrderID, ProductID),
    FOREIGN KEY (OrderID) REFERENCES Orders(OrderID),
    FOREIGN KEY (ProductID) REFERENCES Products(ProductID)
);

The table OrderProducts manages the relationship between orders and products having following columns:

  • OrderID: Identifies the order associated with the product, ProductID: Identifies the product included in the order.
  • Quantity: Specifies the quantity of the product ordered.
  • PriceAtPurchase: Records the price of the product at the time of purchase, allowing up to three decimal places, TotalPrice: Represents the total price of the product based on the quantity ordered, allowing up to three decimal places.

The combination of OrderID and ProductID forms the primary key, ensuring each pair uniquely identifies an entry. Additionally, foreign key constraints link the OrderID to the "Orders" table and the ProductID to the "Products" table, maintaining data integrity between related entities.


Table 6: OrderDetails table

CREATE TABLE OrderDetails (
    OrderID VARCHAR(10) NOT NULL,
    TotalOrderPrice DECIMAL(30, 3),
    NumberOfItems INT,
    PRIMARY KEY (OrderID),
    FOREIGN KEY (OrderID) REFERENCES Orders(OrderID)
);

The table OrderDetails consolidates information about each order that has the following:

  • OrderID: Identifies the order and serves as the primary key.
  • TotalOrderPrice: Represents the total price of the order, allowing up to three decimal places.
  • NumberOfItems: Indicates the total number of items included in the order.

The OrderID column serves as the primary key, uniquely identifying each order. Furthermore, a foreign key constraint links the OrderID to the "Orders" table, ensuring data integrity and maintaining consistency.


Table 7: ProductReviews table

CREATE TABLE ProductReviews (
 ReviewID VARCHAR(10) not null,
 ProductID VARCHAR(10) not null,
 UserID VARCHAR(10) not NULL,
 Name VARCHAR(511),
 Category VARCHAR(511),
 Ratings NUMERIC(4, 1),
 ReviewText TEXT, 
 ReviewDate TIMESTAMP,
 FOREIGN KEY (ProductID) REFERENCES Products(ProductID),
 FOREIGN KEY (UserID) REFERENCES Users(UserID)
);

The ProductReviews manages product review data, featuring the following columns:

  • ReviewID: Serves as a unique identifier for each product review, ProductID: Identifies the product associated with each review, UserID: Represents the user who submitted the review, establishing a link to the user profile.
  • Name: Stores the name of the product being reviewed, Category: Specifies the category to which the reviewed product belongs, aiding in the categorization and analysis of review data.
  • Ratings: Captures the rating assigned to the product in each review, allowing users to evaluate product quality and facilitating data-driven decision-making.
  • ReviewText: Accommodates the textual content of each review, enabling users to express their opinions and provide detailed feedback on products.
  • ReviewDate: Records the date and time when each review was submitted, allowing for chronological analysis and tracking of review activity over time.

The table includes foreign key constraints that link the ProductID column to the Products table and the UserID column to the Users table.

Populating the tables using Python

Initiation:

from faker import Faker
import psycopg2 # or pymysql for MySQL
import random

fake = Faker()

# Establish your database connection here
conn = psycopg2.connect(
 dbname='postgres', user='postgres',
 password='mysecretpassword', host='localhost'
)
conn.autocommit = True
cursor = conn.cursor()

Firstly, we begin by importing necessary libraries, including Faker for generating fake data and psycopg2 for PostgreSQL database interaction. It establishes a connection to a PostgreSQL database hosted locally, providing the appropriate credentials. Autocommit mode is enabled for automatic database commits.

Then, it creates a cursor object to execute SQL commands. This setup prepares the environment for generating fake data and inserting it into the PostgreSQL database.


Filling the Users table :

def generate_users(n=1000000):
    generated_emails = set()  # Set to keep track of generated emails
    user_counter = 1  # Start counter for sequential user IDs
    for _ in range(n):
        # Generate unique email address
        while True:
            email = fake.email()
            if email not in generated_emails:
                generated_emails.add(email)
                break

        # Generate user attributes with more variability
        first_name = fake.first_name()
        last_name = fake.last_name()
        password_hash = fake.md5()
        
        # Simulating that user creation becomes less frequent over time by using a weighted choice
        if random.choices([True, False], weights=(30, 70), k=1)[0]:
            create_date = fake.date_time_between(start_date='-2y', end_date='now', tzinfo=None)
        else:
            create_date = fake.date_time_between(start_date='-5y', end_date='now', tzinfo=None)

        # Simulate different user behaviors
        if random.random() < 0.7:  # 70% chance of having a recent login
            create_date = fake.date_time_between(start_date='-5y', end_date='now', tzinfo=None)
            last_login = fake.date_time_between(start_date=create_date, end_date='now')
        else:  # 30% chance of having an old login
            create_date = fake.date_time_between(start_date='-2y', end_date='-6m', tzinfo=None)
            last_login = fake.date_time_between(start_date='-6m', end_date=create_date)

        # Generate a sequential user ID
        user_id = f'U{user_counter:07d}'
        user_counter += 1  # Increment counter for next user

        # Execute SQL query to insert user data
        cursor.execute(
            """
            INSERT INTO Users (UserID, FirstName, LastName, Email, PasswordHash, CreateDate, LastLogin)
            VALUES (%s, %s, %s, %s, %s, %s, %s)
            """,
            (user_id, first_name, last_name, email, password_hash, create_date, last_login)
        )

generate_users()


The function generate_users() is designed to simulate the creation of large-scale user data , here we are creating a database of 1 million users.

  • It uses the Faker library to generate realistic user attributes such as first names, last names, and email addresses.
  • To mimic real-life scenarios, it includes variations such as less frequent users creating new accounts over time and simulating different user behaviors, like a larger percentage of population having recent login into their account.
  • Each user is assigned a unique sequential user ID (user_id) to ensure data integrity and easy identification. The function inserts the generated data into an SQL table named "Users" using an SQL INSERT query executed through a cursor object.
  • The SQL INSERT query includes placeholders (%s) for user attributes, which are filled with the generated values using tuple substitution.

Output:


Populating the Orders table:

def generate_orders(users_count=1000000, orders_count=3000000):
    user_ids = ['U' + f'{i:07d}' for i in range(1, users_count + 1)]
    significant_dates = ["12-25", "07-04", "11-27", "05-12"]  # Additional significant dates
    monthly_seasonal_factors = {"12": 1.2, "07": 1.1, "09": 1.05}  # Monthly seasonal factors for increasing order frequency
    inactive_user_probability = 0.05  # Probability of a user being inactive (never making a purchase)
    order_counter = 1
    
    for _ in range(orders_count):
        # Decide whether the user will place an order or not
        if random.random() < inactive_user_probability:
            continue  # Skip this iteration if the user is inactive

        user_id = random.choice(user_ids)
        order_date = fake.date_time_between(start_date="-5y", end_date="now")

        # Apply monthly seasonal factors for increasing order frequency during certain months
        for month, factor in monthly_seasonal_factors.items():
            if order_date.strftime("%m") == month:
                if random.random() < factor:
                    user_id = random.choice(user_ids)
                    order_date = fake.date_time_between(start_date="-5y", end_date="now")
                    break  # Exit loop if monthly seasonal factor applied

        # Increase likelihood of an order being placed around significant dates
        if order_date.strftime("%m-%d") in significant_dates:
            if random.random() < 0.3:  # 30% chance to make another order
                user_id = random.choice(user_ids)
                order_date = fake.date_time_between(start_date="-5y", end_date="now")

        # Generate random order status based on specified distribution
        status_choices = ["Delivered"] * 90 + ["Cancelled"] * 5 + ["Shipped"] * 3 + ["Processing"] * 2
        status = random.choice(status_choices)

        # Generate a unique order ID
        order_id = f'O{order_counter:07d}'
        order_counter += 1  # Increment counter for next order
        
        cursor.execute(
            """
            INSERT INTO Orders (OrderID, UserID, OrderDate, ShippingAddress, Status)
            VALUES (%s, %s, %s, %s, %s)
            """,
            (order_id,user_id, order_date, fake.address(), status)
        )

generate_orders(users_count=1000000, orders_count=3000000)  # Generate orders

We build the function generate_orders() which is used to create data for orders by different users, in our scenario, we are simulating 3 million orders.

  • Significant Dates and Seasonal Factors: Defines additional significant dates and monthly seasonal factors to influence order frequency. For instance, certain months such as December and July are assigned factors to increase the likelihood of orders being placed, simulating seasonal trends in consumer behavior.
  • Inactive User Probability: Sets a probability for users to be inactive (never making a purchase). This probability determines whether a user will skip placing an order during the iteration.
  • Order Generation Loop: Iterates orders_count times to generate orders. Within each iteration:
    • Determines whether the user will place an order based on the inactive user probability.
    • Randomly selects a user ID and order date within a specified timeframe.
    • Applies monthly seasonal factors to increase the likelihood of order placement during certain months.
    • Increases the likelihood of order placement around significant dates.
    • Generates a random order status based on specified distribution.
  • Order Status Distribution: Defines a distribution for order statuses, including "Delivered," "Cancelled," "Shipped," and "Processing," with different probabilities. This distribution determines the status of each generated order.
  • We execute SQL INSERT queries to insert generated order data into the "Orders" table. This includes order ID, user ID, order date, shipping address, and status. Parameterized queries are used for safe insertion of data into the database.

Output:


ProductCategories table updation:

def generate_product_categories():
    categories = [
        ("Electronics", "Products related to electronics, including gadgets, devices, and accessories."),
        ("Home and Kitchen", "Items for home and kitchen use, such as appliances, utensils, and decor."),
        ("Toys and Games", "Toys, games, and entertainment products for children and adults."),
        ("Personal Care", "Products for personal hygiene and grooming."),
        ("Sports and Outdoors", "Equipment, gear, and accessories for sports and outdoor activities."),
        ("Baby and Child Care", "Products for babies and young children, including diapers and childcare essentials."),
        ("Ethnic Wear", "Ethnic and Festive Clothing, accessories and fashion items for women."),
        ("Fitness Accessories", "Products promoting fitness, including supplements and wellness accessories."),
        ("Home Improvement", "Supplies and tools for home improvement and DIY projects."),
        ("Jewellery", "Jewellery products and luxury items."),
        ("Travel Accessories", "Accessories and gear for travel and outdoor adventures.")
        # Add more categories as needed
    ]
    
        
    all_categories = categories

    # Generate SQL insert statements for each category
    insert_queries = []
    category_counter = 1 
    for category_name, description in all_categories:
        category_id = f'CAT{category_counter:03d}01'
        cursor.execute(
                    """
                    INSERT INTO ProductCategories (CategoryID, CategoryName, Description)
                    VALUES (%s, %s, %s)
                    """,
                    (category_id, category_name, description)
                )
        category_counter += 1
generate_product_categories()

Next, we move onto generate_product_categories() where we define the 11 most popular categories in the E-commerce sector.

  • We define a list of tuples (categories), where each tuple contains a category name and its corresponding description. These categories represent various product categories available for sale.
  • Secondly, we iterate through each category in the categories list and execute SQL INSERT queries to insert category data into the ProductCategories table. Each query includes the category ID, name, and description as values.

Output:


Filling in the Products table:

We have used the following dataset, which contains around 1600 products belonging to different categories defined in the earlier table.

The dataset below was compiled from an existing dataset generously shared by a contributor (lokeshparab) on Kaggle. While initially comprehensive, the presence of a few errors that required correction was done manually (for example, a small percentage of products were included in the wrong category).

Additionally, the product categories were listed separately within the original dataset. To enhance usability and convenience, consolidation of these categories was done into a single cohesive file. Moreover, the missing values present in some rows of the previous dataset was meticulously addressed, ensuring the integrity and completeness of the new dataset.

Please download the dataset by clicking below!

E-Commerce Dataset With Most Popular Categories
Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.
💡
Our dataset was made using the Kaggle dataset provided below, do check it out!
https://www.kaggle.com/datasets/lokeshparab/amazon-products-dataset

Code:

import csv
import psycopg2

def insert_products_data(csv_file_path):
    

    # Open the CSV file
    with open(csv_file_path, 'r',encoding='utf-8') as file:
        reader = csv.DictReader(file)
        
        # Iterate over each row in the CSV file
        for row in reader:
            # Extract values from the current row
            product_id = row['ProductID']
            category_id = row['CategoryID']
            name = row['name']
            category = row['category']
            image = row['image']
            link = row['link']
            ratings = float(row['ratings'])
            no_of_ratings = int(row['no_of_ratings'].replace(',', ''))
            discount_price = float(row['discount_price'].replace(',', ''))
            actual_price = float(row['actual_price'].replace(',', ''))

            # Execute SQL insert statement
            cursor.execute(
                """
                INSERT INTO Products (ProductID, CategoryID, Name, Category, Image, Link, Ratings, No_of_ratings, Discount_Price, Actual_Price)
                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                """,
                (product_id, category_id, name, category, image, link, ratings, no_of_ratings, discount_price, actual_price)
            )

    # Commit the transaction and close the cursor and connection
    conn.commit()
    cursor.close()
    conn.close()

insert_products_data('Full_dataset.csv')

The function insert_products_data() is executed to fill in product data from the dataset given.

  • CSV File Processing: Opens the specified CSV file using the csv.DictReader object, which allows reading the file as a dictionary where column headers are keys and row values are dictionary values.
  • Iteration Over CSV Rows: Iterates over each row in the CSV file, extracting values for each column (e.g., ProductID, CategoryID, Name, etc.) from the current row.
  • We extract values from the CSV row and converts them to the appropriate data types (e.g., converting string representations of numerical values to floats or integers).
  • We execute an SQL INSERT statement for each row of data extracted from the CSV file. The statement inserts the extracted values into the "Products" table of the database.

Output:


Filling the OrderProducts table:


# Fetch product details
cursor.execute("SELECT CategoryID, ProductID, Discount_Price FROM Products")
product_details = cursor.fetchall()

# Example category popularity (adjust as per your usage case)
category_popularity = {"CAT00101": 0.2, "CAT00201": 0.15, "CAT00301": 0.15,"CAT00401": 0.1,"CAT00501": 0.1,"CAT00601": 0.07,"CAT00701": 0.05,"CAT00801": 0.05,"CAT00901": 0.05,"CAT01001": 0.04,"CAT01101": 0.04 }  # Assuming category IDs

def select_product_based_on_category():
    dice_roll = random.random()
    cumulative = 0
    for category, probability in category_popularity.items():
        cumulative += probability
        if dice_roll <= cumulative:
            # Filter product_details to only include products in the selected category
            products_in_category = [(product[1], product[2]) for product in product_details if product[0].startswith(category)]
            return products_in_category
    return []

def generate_order_products(orders_count=3000000):
    cursor.execute("SELECT OrderID FROM Orders")
    orders = [row[0] for row in cursor.fetchall()]
    ec = 0
    for order_id in orders:
        if ec >= orders_count:
            break
        # Determine the number of products in this order
        if random.random() <= 0.7:  # 70% of orders have 1-5 products
            num_products = random.randint(1, 5)
        else:
            num_products = random.randint(6, 10)  # 30% of orders have 6-10 products
        
        for _ in range(num_products):
            products_in_category = select_product_based_on_category()
            if not products_in_category:  # Handle empty list case
                continue
            product_id, discount_price = random.choice(products_in_category)
            
            # Check if the combination already exists in the OrderProducts table
            cursor.execute("""
                SELECT COUNT(*)
                FROM OrderProducts
                WHERE OrderID = %s AND ProductID = %s
            """, (order_id, product_id))
            count = cursor.fetchone()[0]
            
            # If the combination already exists, skip insertion
            if count > 0:
                continue
            
            # Select quantity (more realistic distribution)
            quantity = random.randint(1, 15)
            # Calculate total price
            total_price = discount_price * quantity

            cursor.execute("""
                INSERT INTO OrderProducts (OrderID, ProductID, Quantity, PriceAtPurchase, TotalPrice)
                VALUES (%s, %s, %s, %s, %s)
            """, (order_id, product_id, quantity, discount_price, total_price))

            ec += 1

generate_order_products()


Moving on, we define generate_order_products() which keeps track of the multiple products that have been assigned to different orders made by the users.

  • Product Details Retrieval: Executes an SQL query to fetch product details including CategoryID, ProductID, and Discount_Price from the "Products" table and stores the results in product_details.
  • We now define a dictionary category_popularity mapping category IDs to their respective popularity probabilities. This data is assumed or derived from usage analytics, indicating the likelihood of a category being chosen for purchase.
  • Order Product Generation:
    • generate_order_products() function generates order-product pairs for a specified number of orders.
    • It randomly determines the number of products in each order, with a 70% chance of having 1-5 products and a 30% chance of having 6-10 products.
    • For each order, it selects products based on category probabilities and inserts them into the "OrderProducts" table, ensuring no duplicate combinations of OrderID and ProductID.
    • The quantity of each product is randomly selected, and the total price is calculated based on the discount price and quantity.

Output:


OrderDetails table updation:

def populate_order_details():
    cursor.execute("""
        SELECT OrderID, SUM(TotalPrice) AS TotalOrderPrice, COUNT(ProductID) AS NumberOfItems
        FROM OrderProducts
        GROUP BY OrderID
    """)
    
    for row in cursor.fetchall():
        order_id, total_order_price, number_of_items = row
        
        cursor.execute("""
            INSERT INTO OrderDetails (OrderID, TotalOrderPrice, NumberOfItems)
            VALUES (%s, %s, %s)
        """, (order_id, total_order_price, number_of_items))

populate_order_details()

The populate_order_details() function is executed to obtain the final total amount that would be required to be paid by the users.

  • Order Details Population: Executes an SQL query to calculate aggregate values for each order from the "OrderProducts" table.
    • SELECT statement retrieves OrderID, the sum of TotalPrice (representing the total order price), and the count of ProductID (representing the number of items) for each order.
    • The SUM() and COUNT() functions are used to aggregate data, and the results are grouped by OrderID.
  • Iteration Over Results: Iterates over the results obtained from the SQL query, where each row represents an order along with its total order price and the number of items.

Output:


Populating the ProductReviews table:

The reviews dataset can be downloaded using the Kaggle link mentioned below, the dataset contains reviews for all the categories that belong to the ProductCategories table.

Mixtral-8X7B-Generated-Product-Reviews-Ecommerce
Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.

The dataset was generated using the Products table mentioned earlier, where we prompted the Mixtral 8X7B model (using the APIs provided by together.ai) to generate human-like reviews using Few-shot Prompting technique!

This technique in prompt engineering usually involves having a prompt sent to a Large Language Model along with presenting examples of the kind of responses that are expected so the generated response would be much more accurate and fulfils the requirement more efficiently.

For example, when we generated the reviews for Electronics category, the prompt looked something like this:

prompt = f"Generate a concise, real-life human-like review of the {category.lower()} {name}. Make it informal and varied, with a few sentences max. Reflect a rating of {rating}/5, focusing on key features of Electronics such as battery life, cameras, durability etc. For example: 1. " Looks durable, charging is fine too, till now satisfied with the quality,this is a good product." 2. " I am using it for 14 days now. The experience is pretty good as of now. Picture quality is also not bad.", Adjust tone according to the rating."

Similarly for remaining categories, the examples in the prompt were modified accordingly so that it can provide with the best results! The complete code for generating the reviews is given below:

import pandas as pd
import requests

# Function to prompt the API and generate reviews
def generate_review(category, name, rating):
    api_key = "ENTER-YOUR-API-KEY"
    endpoint = "https://api.together.xyz/v1/completions"
    headers = {"Authorization": f"Bearer {api_key}"}
    prompt = f"Generate a concise, real-life human-like review of the {category.lower()} {name}. Make it informal and varied, with 1-2 sentences max. Reflect a rating of {rating}/5, focusing on key features of the {category} etc. Adjust tone according to the rating."
 
    data = {
        "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "prompt": prompt,
        "max_tokens": 80,
        "stop": ["</s>", "[/INST]"],
        "temperature": 0.9,
        "top_p": 0.7,
        "top_k": 40,
        "repetition_penalty": 1.2,
        "n": 1
    }
    try:
        response = requests.post(endpoint, headers=headers, json=data)
        response.raise_for_status()  # This will raise an exception for HTTP errors.
        review_data = response.json()
        review = review_data.get('choices', [{}])[0].get('text', '').strip()
        return review
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return "Review generation failed"

# Function to read the dataset, generate reviews, and append them as a new column
def process_dataset(file_path):
    # Loading the dataset
    df = pd.read_csv(file_path)
    
    # Generating reviews for each product
    reviews = [generate_review(row['category'], row['name'], row['ratings']) for index, row in df.iterrows()]
    
    # Appending the reviews as a new column
    df['reviews'] = reviews
    
    # Saving the updated dataset
    new_file_path = file_path.replace('.csv', '_newFile.csv')
    df.to_csv(new_file_path, index=False)
    print(f"Dataset with reviews saved to {new_file_path}")


process_dataset('Full_dataset.csv')

Code for the table:

def product_reviews():
    # Load the CSV file into a DataFrame
    df = pd.read_csv('Reviews_dataset.csv')

    # Initialize Faker for generating fake data
    fake = Faker()

    # Define the start and end dates for the review date range
    start_date = '-5y'  # 5 years ago
    end_date = 'now'    # Now

    # Iterate over each row in the DataFrame and insert data into the table
    for index, row in df.iterrows():
        # Generate a review date between the specified range
        review_date = fake.date_time_between(start_date=start_date, end_date=end_date)

        # Insert data into the table using cursor.execute
        cursor.execute("""
            INSERT INTO ProductReviews (ReviewID, ProductID, UserID, Name, Category, Ratings, ReviewText, ReviewDate)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
        """, (row['ReviewID'], row['ProductID'], row['UserID'], row['name'], row['category'], row['ratings'], row['reviews'], review_date))

product_reviews()


The product_reviews() function is used to populate the table with product review data extracted from the CSV file in the dataset provided.

  • Firstly, we load the contents of the CSV file 'updated_file.csv' into a pandas DataFrame (df) using pd.read_csv() function.
  • Review Date Generation: Generates a random review date between the specified start and end dates using the fake.date_time_between() function from the Faker library.
  • Data Insertion: Inserts data from each row of the DataFrame into the ProductReviews table of the database using an SQL INSERT statement executed through cursor.execute().

Output:

To simulate real-world distribution of data:
GPT-4 and Claude suggestions

Schema suggestions:

1. Users Table

Addition of a new column for UserType to distinguish between regular customers and sellers/vendors.

UserType ENUM('customer', 'vendor')

2. Orders Table

Creating an additional PaymentMethod column to account for diverse payment options.

PaymentMethod ENUM('credit_card', 'paypal', 'bank_transfer', 'voucher')

3. Returns Table

The addition of a Returns table to cater to a significant user percentage returning the products due to dissatisfaction or other reasons:

CREATE TABLE Returns (
    ReturnID SERIAL PRIMARY KEY,
    OrderDetailID INT,
    ReturnDate TIMESTAMP,
    Reason VARCHAR(255),
    FOREIGN KEY (OrderDetailID) REFERENCES OrderDetails(OrderDetailID)
);


Distribution suggestions:

Generate User Data with Activity Levels

  1. Determining the weights for different user groups. For example:
  • Frequent buyers: 10% of users
  • Occasional buyers: 30% of users
  • Rare buyers: 60% of users

Modify the user generation script to assign each user to a group based on these weights.

user_groups = ['frequent', 'occasional', 'rare']
weights = [0.1, 0.3, 0.6] # Corresponding to the distribution of user groups

  1. Generate a mix of 'customer' and 'vendor' types, roughly in the ratio of 10:1, to mimic the greater number of customers over vendors in real life.

Assign Product Popularity Tiers

When generating products, assign each product to a popularity tier based on predefined weights:

product_popularity = ['high', 'medium', 'low']
popularity_weights = [0.1, 0.3, 0.6]

popularity_level = random.choices(product_popularity, popularity_weights, k=1)[0]

Generating and Inserting Orders

  • Order Generation Logic: Leverage the Pareto Principle (80/20 rule), where 20% of customers are responsible for 80% of orders.
  • Multiple Products per Order: Ensure that around 60-70% of orders contain more than one product to reflect realistic shopping baskets.

ProductReviews

  • Rating Distribution: Use a skewed distribution favoring higher ratings, as people tend to post reviews either when very satisfied or dissatisfied.

Returns

  • Return Rate and Reasons: Assigning a return rate of about 5-10%, consistent with e-commerce averages, and generate a mix of reasons (wrong item, poor quality, etc.).

Summary

In our post, we explored how we make data for e-commerce databases. We've started by figuring out the most common types of tables these databases use. Further, we went ahead and created data that looks and acts like what one would see in the real world, using typical patterns and distributions.

Well, this synthetic data presents us with a lot of value. It's super useful for testing out e-commerce systems to make sure they work properly. Plus, it's a great learning tool, giving insights into how these databases function in the real world.

Please subscribe to our newsletter! It contains the latest updates and insights happening in the world of AI!

Image source: Datascience.fm

Stay tuned!