Crafting Real-World Like Data For E-Commerce Domain Databases
In the vast expanse of the e-commerce realm, the efficiency and robustness of databases are paramount. It's a complex task to create databases that faithfully replicate real-world situations. Nevertheless, it is essential for evaluating the robustness and functionality of e-commerce platforms. We will be delving into the meticulous process of generating real-lifelike data in the world of e-commerce.
We present a comprehensive approach to synthesizing data that mimics the intricacies of actual e-commerce transactions. We will determine what schemas and their structure are crucial to e-commerce operations, add data from common distributions to them and explain how these synthetic data may be used for many purposes, including testing and teaching.
We'll be starting with how to run a Postgres image using docker, followed by a discussion on how the entire E-commerce schema will be designed. Next, we'll code out to fill the tables as per the schema, finally ending the blog with the suggestions provided by models like GPT-4 and Claude on how the data creation and distribution can be made more efficient.
Running Postgres image using Docker
Step 1: Pulling the postgres image from Docker Hub:
We shall start by visiting the Docker Hub page from where can pull the postgres image ( click here ). Now, let us pull the postgres
image, which is packed with latest version(16.2) with the following command on your command prompt/terminal :
docker pull postgres
Once it's done, on your Docker Desktop, you can see that the notebook image is displayed on the Images module, which confirms the successful procedure carried out.
Step 2: Starting a postgres instance:
Let us now run the following command on terminal/command prompt:
docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres
- In this command,
some-postgres
is the name of your container, which can be changed as per your liking. The-p
parameter refers to the port, which exposes the port from the Docker container to our main operating system. - The
-e
refers to the environment variables.in our command, we only have one environment variable which is thePOSTGRES_PASSWORD
, by default it ismysecretpassword
, which also can be changed before running the command. -d
refers to detached mode, which allows us to run the container in the background and we can use the terminal for other commands.- Lastly, the final part is the path to the image which will be downloaded, i.e the word
postgres
.
To check the status of the image and confirm if the docker image is running, run the following command:
docker ps
This will let you know the status of the image and confirm if the docker image is running. The expected output looks like this:
Step 3: Using DBeaver/other IDEs to create connections and run SQL queries:
DBeaver is a free and open-source SQL client software application and a database administration tool.
After installation:
Once installed, open the application and click on New Database Connection, which will display the following dialog box:
After selecting PostgreSQL, click on Next.
Here, fill the details as per the docker run
command we ran earlier, fill the port number, username, host and password. To test if the connection is established, select Test Connection.
If the connection test shows the status as Connected
, great job! We are now ready to dive into the schema section!
Follow the below video if you get stuck on any of the steps!
An alternative to get PostgreSQL is to use neon.tech, which provides serverless Postgres with free tiers!
Different tables in the E-commerce database schema:
Table 1: Users Table
CREATE TABLE Users (
UserID VARCHAR(8) PRIMARY KEY,
FirstName VARCHAR(255),
LastName VARCHAR(255),
Email VARCHAR(255) UNIQUE,
PasswordHash VARCHAR(255),
CreateDate TIMESTAMP,
LastLogin TIMESTAMP
);
We create a table named Users
to store information about each user, with following columns:
- UserID: Designated as the primary key for the table to enure that each UserID value is unique within the table.
- FirstName and LastName: Intended to store the first and last names of users, respectively.
- Email column is marked as UNIQUE, this constraint ensures that no two users can have the same email address. PasswordHash column stores the hashed password of each user to enhance security by protecting user passwords in case of a data breach.
- CreateDate and LastLogin columns are of type TIMESTAMP to store date and time values, intended to store the date and time when each user account was created & the date and time of the user's last login.
Table 2: Orders Table
CREATE TABLE Orders (
OrderID VARCHAR(8) PRIMARY KEY,
UserID VARCHAR(8),
OrderDate TIMESTAMP,
ShippingAddress TEXT,
Status VARCHAR(50),
FOREIGN KEY (UserID) REFERENCES Users(UserID)
);
The Orders
table is used to manage order-related data that includes:
- OrderID: Unique identifier for each order, serving as the primary key
- UserID: Identifies the user associated with the order which acts as a foriegn key and references the UserID in the "Users" table.
- OrderDate: Records the date and time when the order was placed.
- ShippingAddress: Stores the shipping address for the order.
- Status: Indicates the current status of the order.
Table 3: ProductCategories Table
CREATE TABLE ProductCategories (
CategoryID VARCHAR(8) PRIMARY KEY,
CategoryName VARCHAR(255),
Description TEXT
);
We establish a table called ProductCategories
to organize product-category information which features:
- CategoryID is aunique identifier for each product category, functioning as the primary key along with CategoryName to record the name of each product category and the Description to accommodate detailed descriptions of the product categories.
Table 4: Products Table
CREATE TABLE Products (
ProductID VARCHAR(10) PRIMARY KEY,
CategoryID VARCHAR(10),
Name VARCHAR(511),
Category VARCHAR(511),
Image VARCHAR(511),
Link VARCHAR(511),
Ratings NUMERIC(4, 1),
No_of_ratings INT,
Discount_Price DECIMAL(30, 3),
Actual_Price DECIMAL(30, 3),
FOREIGN KEY (CategoryID) REFERENCES ProductCategories(CategoryID)
);
We define a table named Products
to catalog information about various products. It serves as a structured repository for managing product data to facilitate efficient organization and retrieval of product information. It contains:
- ProductID: A unique identifier for each product, serving as the primary key.
- CategoryID: Associates each product with a specific category, referencing the CategoryID in the "ProductCategories" table.
- Name: Captures the name of each product, Category: Provides a descriptive label for the product's category.
- Image: Stores the file path or URL of the product's image, Link: Contains the URL or link to the product page.
- Ratings: Represents the average rating of the product, allowing one decimal place, No_of_ratings: Tracks the total number of ratings received by the product.
- Discount_Price: Indicates the discounted price of the product, allowing up to three decimal places, Actual_Price: Specifies the original price of the product before any discounts, allowing up to three decimal places.
Table 5: OrderProducts Table
CREATE TABLE OrderProducts (
OrderID VARCHAR(10) NOT NULL,
ProductID VARCHAR(10) NOT NULL,
Quantity INT NOT NULL,
PriceAtPurchase DECIMAL(30, 3) NOT NULL,
TotalPrice DECIMAL(30, 3) NOT NULL,
PRIMARY KEY (OrderID, ProductID),
FOREIGN KEY (OrderID) REFERENCES Orders(OrderID),
FOREIGN KEY (ProductID) REFERENCES Products(ProductID)
);
The table OrderProducts
manages the relationship between orders and products having following columns:
- OrderID: Identifies the order associated with the product, ProductID: Identifies the product included in the order.
- Quantity: Specifies the quantity of the product ordered.
- PriceAtPurchase: Records the price of the product at the time of purchase, allowing up to three decimal places, TotalPrice: Represents the total price of the product based on the quantity ordered, allowing up to three decimal places.
The combination of OrderID and ProductID forms the primary key, ensuring each pair uniquely identifies an entry. Additionally, foreign key constraints link the OrderID to the "Orders" table and the ProductID to the "Products" table, maintaining data integrity between related entities.
Table 6: OrderDetails table
CREATE TABLE OrderDetails (
OrderID VARCHAR(10) NOT NULL,
TotalOrderPrice DECIMAL(30, 3),
NumberOfItems INT,
PRIMARY KEY (OrderID),
FOREIGN KEY (OrderID) REFERENCES Orders(OrderID)
);
The table OrderDetails
consolidates information about each order that has the following:
- OrderID: Identifies the order and serves as the primary key.
- TotalOrderPrice: Represents the total price of the order, allowing up to three decimal places.
- NumberOfItems: Indicates the total number of items included in the order.
The OrderID column serves as the primary key, uniquely identifying each order. Furthermore, a foreign key constraint links the OrderID to the "Orders" table, ensuring data integrity and maintaining consistency.
Table 7: ProductReviews table
CREATE TABLE ProductReviews (
ReviewID VARCHAR(10) not null,
ProductID VARCHAR(10) not null,
UserID VARCHAR(10) not NULL,
Name VARCHAR(511),
Category VARCHAR(511),
Ratings NUMERIC(4, 1),
ReviewText TEXT,
ReviewDate TIMESTAMP,
FOREIGN KEY (ProductID) REFERENCES Products(ProductID),
FOREIGN KEY (UserID) REFERENCES Users(UserID)
);
The ProductReviews
manages product review data, featuring the following columns:
- ReviewID: Serves as a unique identifier for each product review, ProductID: Identifies the product associated with each review, UserID: Represents the user who submitted the review, establishing a link to the user profile.
- Name: Stores the name of the product being reviewed, Category: Specifies the category to which the reviewed product belongs, aiding in the categorization and analysis of review data.
- Ratings: Captures the rating assigned to the product in each review, allowing users to evaluate product quality and facilitating data-driven decision-making.
- ReviewText: Accommodates the textual content of each review, enabling users to express their opinions and provide detailed feedback on products.
- ReviewDate: Records the date and time when each review was submitted, allowing for chronological analysis and tracking of review activity over time.
The table includes foreign key constraints that link the ProductID
column to the Products
table and the UserID
column to the Users
table.
Populating the tables using Python
Initiation:
from faker import Faker
import psycopg2 # or pymysql for MySQL
import random
fake = Faker()
# Establish your database connection here
conn = psycopg2.connect(
dbname='postgres', user='postgres',
password='mysecretpassword', host='localhost'
)
conn.autocommit = True
cursor = conn.cursor()
Firstly, we begin by importing necessary libraries, including Faker
for generating fake data and psycopg2 for PostgreSQL database interaction. It establishes a connection to a PostgreSQL database hosted locally, providing the appropriate credentials. Autocommit mode is enabled for automatic database commits.
Then, it creates a cursor object to execute SQL commands. This setup prepares the environment for generating fake data and inserting it into the PostgreSQL database.
Filling the Users table :
def generate_users(n=1000000):
generated_emails = set() # Set to keep track of generated emails
user_counter = 1 # Start counter for sequential user IDs
for _ in range(n):
# Generate unique email address
while True:
email = fake.email()
if email not in generated_emails:
generated_emails.add(email)
break
# Generate user attributes with more variability
first_name = fake.first_name()
last_name = fake.last_name()
password_hash = fake.md5()
# Simulating that user creation becomes less frequent over time by using a weighted choice
if random.choices([True, False], weights=(30, 70), k=1)[0]:
create_date = fake.date_time_between(start_date='-2y', end_date='now', tzinfo=None)
else:
create_date = fake.date_time_between(start_date='-5y', end_date='now', tzinfo=None)
# Simulate different user behaviors
if random.random() < 0.7: # 70% chance of having a recent login
create_date = fake.date_time_between(start_date='-5y', end_date='now', tzinfo=None)
last_login = fake.date_time_between(start_date=create_date, end_date='now')
else: # 30% chance of having an old login
create_date = fake.date_time_between(start_date='-2y', end_date='-6m', tzinfo=None)
last_login = fake.date_time_between(start_date='-6m', end_date=create_date)
# Generate a sequential user ID
user_id = f'U{user_counter:07d}'
user_counter += 1 # Increment counter for next user
# Execute SQL query to insert user data
cursor.execute(
"""
INSERT INTO Users (UserID, FirstName, LastName, Email, PasswordHash, CreateDate, LastLogin)
VALUES (%s, %s, %s, %s, %s, %s, %s)
""",
(user_id, first_name, last_name, email, password_hash, create_date, last_login)
)
generate_users()
The function generate_users()
is designed to simulate the creation of large-scale user data , here we are creating a database of 1 million users.
- It uses the Faker library to generate realistic user attributes such as first names, last names, and email addresses.
- To mimic real-life scenarios, it includes variations such as less frequent users creating new accounts over time and simulating different user behaviors, like a larger percentage of population having recent login into their account.
- Each user is assigned a unique sequential user ID (
user_id
) to ensure data integrity and easy identification. The function inserts the generated data into an SQL table named "Users" using an SQLINSERT
query executed through a cursor object. - The SQL
INSERT
query includes placeholders (%s) for user attributes, which are filled with the generated values using tuple substitution.
Output:
Populating the Orders table:
def generate_orders(users_count=1000000, orders_count=3000000):
user_ids = ['U' + f'{i:07d}' for i in range(1, users_count + 1)]
significant_dates = ["12-25", "07-04", "11-27", "05-12"] # Additional significant dates
monthly_seasonal_factors = {"12": 1.2, "07": 1.1, "09": 1.05} # Monthly seasonal factors for increasing order frequency
inactive_user_probability = 0.05 # Probability of a user being inactive (never making a purchase)
order_counter = 1
for _ in range(orders_count):
# Decide whether the user will place an order or not
if random.random() < inactive_user_probability:
continue # Skip this iteration if the user is inactive
user_id = random.choice(user_ids)
order_date = fake.date_time_between(start_date="-5y", end_date="now")
# Apply monthly seasonal factors for increasing order frequency during certain months
for month, factor in monthly_seasonal_factors.items():
if order_date.strftime("%m") == month:
if random.random() < factor:
user_id = random.choice(user_ids)
order_date = fake.date_time_between(start_date="-5y", end_date="now")
break # Exit loop if monthly seasonal factor applied
# Increase likelihood of an order being placed around significant dates
if order_date.strftime("%m-%d") in significant_dates:
if random.random() < 0.3: # 30% chance to make another order
user_id = random.choice(user_ids)
order_date = fake.date_time_between(start_date="-5y", end_date="now")
# Generate random order status based on specified distribution
status_choices = ["Delivered"] * 90 + ["Cancelled"] * 5 + ["Shipped"] * 3 + ["Processing"] * 2
status = random.choice(status_choices)
# Generate a unique order ID
order_id = f'O{order_counter:07d}'
order_counter += 1 # Increment counter for next order
cursor.execute(
"""
INSERT INTO Orders (OrderID, UserID, OrderDate, ShippingAddress, Status)
VALUES (%s, %s, %s, %s, %s)
""",
(order_id,user_id, order_date, fake.address(), status)
)
generate_orders(users_count=1000000, orders_count=3000000) # Generate orders
We build the function generate_orders()
which is used to create data for orders by different users, in our scenario, we are simulating 3 million orders.
- Significant Dates and Seasonal Factors: Defines additional significant dates and monthly seasonal factors to influence order frequency. For instance, certain months such as December and July are assigned factors to increase the likelihood of orders being placed, simulating seasonal trends in consumer behavior.
- Inactive User Probability: Sets a probability for users to be inactive (never making a purchase). This probability determines whether a user will skip placing an order during the iteration.
- Order Generation Loop: Iterates
orders_count
times to generate orders. Within each iteration:- Determines whether the user will place an order based on the inactive user probability.
- Randomly selects a user ID and order date within a specified timeframe.
- Applies monthly seasonal factors to increase the likelihood of order placement during certain months.
- Increases the likelihood of order placement around significant dates.
- Generates a random order status based on specified distribution.
- Order Status Distribution: Defines a distribution for order statuses, including "Delivered," "Cancelled," "Shipped," and "Processing," with different probabilities. This distribution determines the status of each generated order.
- We execute SQL
INSERT
queries to insert generated order data into the "Orders" table. This includes order ID, user ID, order date, shipping address, and status. Parameterized queries are used for safe insertion of data into the database.
Output:
ProductCategories table updation:
def generate_product_categories():
categories = [
("Electronics", "Products related to electronics, including gadgets, devices, and accessories."),
("Home and Kitchen", "Items for home and kitchen use, such as appliances, utensils, and decor."),
("Toys and Games", "Toys, games, and entertainment products for children and adults."),
("Personal Care", "Products for personal hygiene and grooming."),
("Sports and Outdoors", "Equipment, gear, and accessories for sports and outdoor activities."),
("Baby and Child Care", "Products for babies and young children, including diapers and childcare essentials."),
("Ethnic Wear", "Ethnic and Festive Clothing, accessories and fashion items for women."),
("Fitness Accessories", "Products promoting fitness, including supplements and wellness accessories."),
("Home Improvement", "Supplies and tools for home improvement and DIY projects."),
("Jewellery", "Jewellery products and luxury items."),
("Travel Accessories", "Accessories and gear for travel and outdoor adventures.")
# Add more categories as needed
]
all_categories = categories
# Generate SQL insert statements for each category
insert_queries = []
category_counter = 1
for category_name, description in all_categories:
category_id = f'CAT{category_counter:03d}01'
cursor.execute(
"""
INSERT INTO ProductCategories (CategoryID, CategoryName, Description)
VALUES (%s, %s, %s)
""",
(category_id, category_name, description)
)
category_counter += 1
generate_product_categories()
Next, we move onto generate_product_categories()
where we define the 11 most popular categories in the E-commerce sector.
- We define a list of tuples (
categories
), where each tuple contains a category name and its corresponding description. These categories represent various product categories available for sale. - Secondly, we iterate through each category in the
categories
list and execute SQLINSERT
queries to insert category data into theProductCategories
table. Each query includes the category ID, name, and description as values.
Output:
Filling in the Products table:
We have used the following dataset, which contains around 1600 products belonging to different categories defined in the earlier table.
The dataset below was compiled from an existing dataset generously shared by a contributor (lokeshparab) on Kaggle. While initially comprehensive, the presence of a few errors that required correction was done manually (for example, a small percentage of products were included in the wrong category).
Additionally, the product categories were listed separately within the original dataset. To enhance usability and convenience, consolidation of these categories was done into a single cohesive file. Moreover, the missing values present in some rows of the previous dataset was meticulously addressed, ensuring the integrity and completeness of the new dataset.
Please download the dataset by clicking below!
https://www.kaggle.com/datasets/lokeshparab/amazon-products-dataset
Code:
import csv
import psycopg2
def insert_products_data(csv_file_path):
# Open the CSV file
with open(csv_file_path, 'r',encoding='utf-8') as file:
reader = csv.DictReader(file)
# Iterate over each row in the CSV file
for row in reader:
# Extract values from the current row
product_id = row['ProductID']
category_id = row['CategoryID']
name = row['name']
category = row['category']
image = row['image']
link = row['link']
ratings = float(row['ratings'])
no_of_ratings = int(row['no_of_ratings'].replace(',', ''))
discount_price = float(row['discount_price'].replace(',', ''))
actual_price = float(row['actual_price'].replace(',', ''))
# Execute SQL insert statement
cursor.execute(
"""
INSERT INTO Products (ProductID, CategoryID, Name, Category, Image, Link, Ratings, No_of_ratings, Discount_Price, Actual_Price)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
""",
(product_id, category_id, name, category, image, link, ratings, no_of_ratings, discount_price, actual_price)
)
# Commit the transaction and close the cursor and connection
conn.commit()
cursor.close()
conn.close()
insert_products_data('Full_dataset.csv')
The function insert_products_data()
is executed to fill in product data from the dataset given.
- CSV File Processing: Opens the specified CSV file using the
csv.DictReader
object, which allows reading the file as a dictionary where column headers are keys and row values are dictionary values. - Iteration Over CSV Rows: Iterates over each row in the CSV file, extracting values for each column (e.g., ProductID, CategoryID, Name, etc.) from the current row.
- We extract values from the CSV row and converts them to the appropriate data types (e.g., converting string representations of numerical values to floats or integers).
- We execute an SQL
INSERT
statement for each row of data extracted from the CSV file. The statement inserts the extracted values into the "Products" table of the database.
Output:
Filling the OrderProducts table:
# Fetch product details
cursor.execute("SELECT CategoryID, ProductID, Discount_Price FROM Products")
product_details = cursor.fetchall()
# Example category popularity (adjust as per your usage case)
category_popularity = {"CAT00101": 0.2, "CAT00201": 0.15, "CAT00301": 0.15,"CAT00401": 0.1,"CAT00501": 0.1,"CAT00601": 0.07,"CAT00701": 0.05,"CAT00801": 0.05,"CAT00901": 0.05,"CAT01001": 0.04,"CAT01101": 0.04 } # Assuming category IDs
def select_product_based_on_category():
dice_roll = random.random()
cumulative = 0
for category, probability in category_popularity.items():
cumulative += probability
if dice_roll <= cumulative:
# Filter product_details to only include products in the selected category
products_in_category = [(product[1], product[2]) for product in product_details if product[0].startswith(category)]
return products_in_category
return []
def generate_order_products(orders_count=3000000):
cursor.execute("SELECT OrderID FROM Orders")
orders = [row[0] for row in cursor.fetchall()]
ec = 0
for order_id in orders:
if ec >= orders_count:
break
# Determine the number of products in this order
if random.random() <= 0.7: # 70% of orders have 1-5 products
num_products = random.randint(1, 5)
else:
num_products = random.randint(6, 10) # 30% of orders have 6-10 products
for _ in range(num_products):
products_in_category = select_product_based_on_category()
if not products_in_category: # Handle empty list case
continue
product_id, discount_price = random.choice(products_in_category)
# Check if the combination already exists in the OrderProducts table
cursor.execute("""
SELECT COUNT(*)
FROM OrderProducts
WHERE OrderID = %s AND ProductID = %s
""", (order_id, product_id))
count = cursor.fetchone()[0]
# If the combination already exists, skip insertion
if count > 0:
continue
# Select quantity (more realistic distribution)
quantity = random.randint(1, 15)
# Calculate total price
total_price = discount_price * quantity
cursor.execute("""
INSERT INTO OrderProducts (OrderID, ProductID, Quantity, PriceAtPurchase, TotalPrice)
VALUES (%s, %s, %s, %s, %s)
""", (order_id, product_id, quantity, discount_price, total_price))
ec += 1
generate_order_products()
Moving on, we define generate_order_products()
which keeps track of the multiple products that have been assigned to different orders made by the users.
- Product Details Retrieval: Executes an SQL query to fetch product details including CategoryID, ProductID, and Discount_Price from the "Products" table and stores the results in
product_details
. - We now define a dictionary
category_popularity
mapping category IDs to their respective popularity probabilities. This data is assumed or derived from usage analytics, indicating the likelihood of a category being chosen for purchase. - Order Product Generation:
generate_order_products()
function generates order-product pairs for a specified number of orders.- It randomly determines the number of products in each order, with a 70% chance of having 1-5 products and a 30% chance of having 6-10 products.
- For each order, it selects products based on category probabilities and inserts them into the "OrderProducts" table, ensuring no duplicate combinations of OrderID and ProductID.
- The quantity of each product is randomly selected, and the total price is calculated based on the discount price and quantity.
Output:
OrderDetails table updation:
def populate_order_details():
cursor.execute("""
SELECT OrderID, SUM(TotalPrice) AS TotalOrderPrice, COUNT(ProductID) AS NumberOfItems
FROM OrderProducts
GROUP BY OrderID
""")
for row in cursor.fetchall():
order_id, total_order_price, number_of_items = row
cursor.execute("""
INSERT INTO OrderDetails (OrderID, TotalOrderPrice, NumberOfItems)
VALUES (%s, %s, %s)
""", (order_id, total_order_price, number_of_items))
populate_order_details()
The populate_order_details()
function is executed to obtain the final total amount that would be required to be paid by the users.
- Order Details Population: Executes an SQL query to calculate aggregate values for each order from the "OrderProducts" table.
SELECT
statement retrieves OrderID, the sum of TotalPrice (representing the total order price), and the count of ProductID (representing the number of items) for each order.- The
SUM()
andCOUNT()
functions are used to aggregate data, and the results are grouped by OrderID.
- Iteration Over Results: Iterates over the results obtained from the SQL query, where each row represents an order along with its total order price and the number of items.
Output:
Populating the ProductReviews table:
The reviews dataset can be downloaded using the Kaggle link mentioned below, the dataset contains reviews for all the categories that belong to the ProductCategories
table.
The dataset was generated using the Products table mentioned earlier, where we prompted the Mixtral 8X7B model (using the APIs provided by together.ai) to generate human-like reviews using Few-shot Prompting technique!
This technique in prompt engineering usually involves having a prompt sent to a Large Language Model along with presenting examples of the kind of responses that are expected so the generated response would be much more accurate and fulfils the requirement more efficiently.
For example, when we generated the reviews for Electronics category, the prompt looked something like this:
prompt = f"Generate a concise, real-life human-like review of the {category.lower()} {name}. Make it informal and varied, with a few sentences max. Reflect a rating of {rating}/5, focusing on key features of Electronics such as battery life, cameras, durability etc. For example: 1. " Looks durable, charging is fine too, till now satisfied with the quality,this is a good product." 2. " I am using it for 14 days now. The experience is pretty good as of now. Picture quality is also not bad.", Adjust tone according to the rating."
Similarly for remaining categories, the examples in the prompt were modified accordingly so that it can provide with the best results! The complete code for generating the reviews is given below:
import pandas as pd
import requests
# Function to prompt the API and generate reviews
def generate_review(category, name, rating):
api_key = "ENTER-YOUR-API-KEY"
endpoint = "https://api.together.xyz/v1/completions"
headers = {"Authorization": f"Bearer {api_key}"}
prompt = f"Generate a concise, real-life human-like review of the {category.lower()} {name}. Make it informal and varied, with 1-2 sentences max. Reflect a rating of {rating}/5, focusing on key features of the {category} etc. Adjust tone according to the rating."
data = {
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"prompt": prompt,
"max_tokens": 80,
"stop": ["</s>", "[/INST]"],
"temperature": 0.9,
"top_p": 0.7,
"top_k": 40,
"repetition_penalty": 1.2,
"n": 1
}
try:
response = requests.post(endpoint, headers=headers, json=data)
response.raise_for_status() # This will raise an exception for HTTP errors.
review_data = response.json()
review = review_data.get('choices', [{}])[0].get('text', '').strip()
return review
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
return "Review generation failed"
# Function to read the dataset, generate reviews, and append them as a new column
def process_dataset(file_path):
# Loading the dataset
df = pd.read_csv(file_path)
# Generating reviews for each product
reviews = [generate_review(row['category'], row['name'], row['ratings']) for index, row in df.iterrows()]
# Appending the reviews as a new column
df['reviews'] = reviews
# Saving the updated dataset
new_file_path = file_path.replace('.csv', '_newFile.csv')
df.to_csv(new_file_path, index=False)
print(f"Dataset with reviews saved to {new_file_path}")
process_dataset('Full_dataset.csv')
Code for the table:
def product_reviews():
# Load the CSV file into a DataFrame
df = pd.read_csv('Reviews_dataset.csv')
# Initialize Faker for generating fake data
fake = Faker()
# Define the start and end dates for the review date range
start_date = '-5y' # 5 years ago
end_date = 'now' # Now
# Iterate over each row in the DataFrame and insert data into the table
for index, row in df.iterrows():
# Generate a review date between the specified range
review_date = fake.date_time_between(start_date=start_date, end_date=end_date)
# Insert data into the table using cursor.execute
cursor.execute("""
INSERT INTO ProductReviews (ReviewID, ProductID, UserID, Name, Category, Ratings, ReviewText, ReviewDate)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
""", (row['ReviewID'], row['ProductID'], row['UserID'], row['name'], row['category'], row['ratings'], row['reviews'], review_date))
product_reviews()
The product_reviews()
function is used to populate the table with product review data extracted from the CSV file in the dataset provided.
- Firstly, we load the contents of the CSV file 'updated_file.csv' into a pandas DataFrame (
df
) usingpd.read_csv()
function. - Review Date Generation: Generates a random review date between the specified start and end dates using the
fake.date_time_between()
function from the Faker library. - Data Insertion: Inserts data from each row of the DataFrame into the
ProductReviews
table of the database using an SQLINSERT
statement executed throughcursor.execute()
.
Output:
To simulate real-world distribution of data:
GPT-4 and Claude suggestions
Schema suggestions:
1. Users Table
Addition of a new column for UserType
to distinguish between regular customers and sellers/vendors.
UserType ENUM('customer', 'vendor')
2. Orders Table
Creating an additional PaymentMethod
column to account for diverse payment options.
PaymentMethod ENUM('credit_card', 'paypal', 'bank_transfer', 'voucher')
3. Returns Table
The addition of a Returns table to cater to a significant user percentage returning the products due to dissatisfaction or other reasons:
CREATE TABLE Returns (
ReturnID SERIAL PRIMARY KEY,
OrderDetailID INT,
ReturnDate TIMESTAMP,
Reason VARCHAR(255),
FOREIGN KEY (OrderDetailID) REFERENCES OrderDetails(OrderDetailID)
);
Distribution suggestions:
Generate User Data with Activity Levels
- Determining the weights for different user groups. For example:
- Frequent buyers: 10% of users
- Occasional buyers: 30% of users
- Rare buyers: 60% of users
Modify the user generation script to assign each user to a group based on these weights.user_groups = ['frequent', 'occasional', 'rare'
]weights = [0.1, 0.3, 0.6] # Corresponding to the distribution of user groups
- Generate a mix of 'customer' and 'vendor' types, roughly in the ratio of 10:1, to mimic the greater number of customers over vendors in real life.
Assign Product Popularity Tiers
When generating products, assign each product to a popularity tier based on predefined weights:
product_popularity = ['high', 'medium', 'low']
popularity_weights = [0.1, 0.3, 0.6]
popularity_level = random.choices(product_popularity, popularity_weights, k=1)[0]
Generating and Inserting Orders
- Order Generation Logic: Leverage the Pareto Principle (80/20 rule), where 20% of customers are responsible for 80% of orders.
- Multiple Products per Order: Ensure that around 60-70% of orders contain more than one product to reflect realistic shopping baskets.
ProductReviews
- Rating Distribution: Use a skewed distribution favoring higher ratings, as people tend to post reviews either when very satisfied or dissatisfied.
Returns
- Return Rate and Reasons: Assigning a return rate of about 5-10%, consistent with e-commerce averages, and generate a mix of reasons (wrong item, poor quality, etc.).
Summary
In our post, we explored how we make data for e-commerce databases. We've started by figuring out the most common types of tables these databases use. Further, we went ahead and created data that looks and acts like what one would see in the real world, using typical patterns and distributions.
Well, this synthetic data presents us with a lot of value. It's super useful for testing out e-commerce systems to make sure they work properly. Plus, it's a great learning tool, giving insights into how these databases function in the real world.
Please subscribe to our newsletter! It contains the latest updates and insights happening in the world of AI!
Stay tuned!