The Subscription Scoop: Analyzing Subscription Box Reviews using GridDB and AI¶

Subscription Boxes are a rapidly growing segment in today's e-commerce landscape. They have become increasingly popular among both businesses and customers. For businesses, they help create recurring revenue models whereas for customers, they offer highly customized and often personalized experiences. Several e-commerce platforms enable niche players and established brands to sell subscription boxes. In this case study, we specifically explore the subscription boxes sold by various independent stores on Amazon. We will store the data in containers, specialized structures within GridDB that allow for high-performance data ingestion. The highly scalable nature of containers makes them suitable for scenarios where data is continuously generated, such as customer reviews of subscription box services.

The data we use for this analysis is sourced from the official website of the University of California San Diego. There are two ways to download the dataset 1. Manually 2. Programmatically using Python. To download it manually, use the URL - https://amazon-reviews-2023.github.io/ and access the section 'Grouped by Category'. Then, select 'Subscription Boxes'. Download the 'Reviews' and the 'Metadata' (highlighted in yellow in the image below). In this article, we use option 2 to download the datasets programmatically using the 'requests' library. The reviews and metadata datasets are then joined to get a combined dataframe which in turn, is loaded to the GridDB cloud (using the Web API). We then use GridDB SQL to query the data and gather insights. We also leverage the power of AI- the popular Hugging Face Library to analyze customer reviews and determine stores where the sentiments are either overwhelmingly positive or negative. We use pre-trained sentiment analysis models from Hugging Face to efficiently assess customer feedback and gain valuable insights into which subscription boxes are highly rated or need improvement.

Dataset Citation¶

Rappaz, J., McAuley, J., & Aberer, K. (2021). Recommendation on live-streaming platforms: Dynamic availability and repeat consumption. Proceedings of the 15th ACM Conference on Recommender Systems (RecSys). https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews

In [11]:

Image(filename='GridDB Dataset Screenshot.png',width=800, height=200)

Out[11]:

No description has been provided for this image

Importing Libraries¶

In [64]:

import os
import pandas as pd
import requests
from IPython.display import Image
import base64 # for utf-8 encoding
import gzip # to handle .gz file extensions
import json
import http.client
import textwrap
import logging # To suppress the logging for requests to avoid seeing the send/reply details
import plotly.express as px
import plotly.graph_objects as go
from transformers import AutoTokenizer, pipeline
from tqdm import tqdm  # Import tqdm for progress bar

# Enable debugging output
http.client.HTTPConnection.debuglevel = 1

Configuring Database Users in GridDB for querying¶

In [80]:

Image(filename='GridDBCloud4.png',width=500, height=200)

Out[80]:

Creating the Database User Credentials¶

GridDB has two types of passwords - Portal User Credentials and Database User Credentials. When you sign up with GridDB cloud, you will be assigned a set of credentials to be able to access the GridDB portal. You can change the provided password. These are the portal user credentials. Once you are in the portal, you can add yourself as a database user and set your credentials after assigning permissions to the database. Note that this also involves whitelisting your IP Address.

In [77]:

Image(filename='GridDBCloud-1-db_user.png',width=500, height=200)

Out[77]:

Once the database user is created, click on the 'Database' icon to access the cluster and corresponding databases within the cluster.

Once this is done, click on the Database User name as shown below.

In [78]:

Image(filename='GridDBCloud2.png',width=500, height=200)

Out[78]:

Give the necessary permissions to the database resource as shown below.

In [79]:

Image(filename='GridDBCloud3.png',width=500, height=200)

Out[79]:

Loading the Datasets as Python DataFrames¶

As mentioned in the Introduction, we use the programmatic way to download the files JSON datasets into a Python Dataframe. For this, we use the 'requests' and the 'gzip' library.

Downloading the reviews¶

In [44]:

# URL of the .jsonl.gz file
url = "https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/review_categories/Subscription_Boxes.jsonl.gz"

# Download the file
response = requests.get(url, stream=True)
if response.status_code == 200:
    # Save the file as a .gz file
    with open("dataset1.jsonl.gz", "wb") as file:
        for chunk in response.iter_content(chunk_size=1024):
            file.write(chunk)
    print("File downloaded and saved as dataset1.jsonl.gz")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

# Read and process the .jsonl.gz file
data = []
with gzip.open("dataset1.jsonl.gz", "rt", encoding="utf-8") as gz_file:
    for line in gz_file:
        # Parse each line as JSON
        data.append(json.loads(line))

send: b'GET /mcauley_group/data/amazon_2023/raw/review_categories/Subscription_Boxes.jsonl.gz HTTP/1.1\r\nHost: datarepo.eng.ucsd.edu\r\nUser-Agent: python-requests/2.32.2\r\nAccept-Encoding: gzip, deflate, br, zstd\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 18 Jan 2025 18:45:02 GMT
header: Server: Apache/2.4.29 (Ubuntu)
header: Last-Modified: Mon, 26 Feb 2024 20:33:58 GMT
header: ETag: "29c073-6124ed44c058d"
header: Accept-Ranges: bytes
header: Content-Length: 2736243
header: Keep-Alive: timeout=5, max=100
header: Connection: Keep-Alive
header: Content-Type: application/x-gzip
File downloaded and saved as dataset1.jsonl.gz

In [45]:

subscription_boxes_reviews_df = pd.DataFrame(data)

In [46]:

subscription_boxes_reviews_df.columns

Out[46]:

Index(['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id',
       'timestamp', 'helpful_vote', 'verified_purchase'],
      dtype='object')

Downloading the Metadata¶

In [47]:

# URL of the .jsonl.gz file
url = "https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/meta_categories/meta_Subscription_Boxes.jsonl.gz"

# Download the file
response = requests.get(url, stream=True)
if response.status_code == 200:
    # Save the file as a .gz file
    with open("dataset2.jsonl.gz", "wb") as file:
        for chunk in response.iter_content(chunk_size=1024):
            file.write(chunk)
    print("File downloaded and saved as dataset2.jsonl.gz")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

# Read and process the .jsonl.gz file
metadata = []
with gzip.open("dataset2.jsonl.gz", "rt", encoding="utf-8") as gz_file:
    for line in gz_file:
        # Parse each line as JSON
        metadata.append(json.loads(line))

send: b'GET /mcauley_group/data/amazon_2023/raw/meta_categories/meta_Subscription_Boxes.jsonl.gz HTTP/1.1\r\nHost: datarepo.eng.ucsd.edu\r\nUser-Agent: python-requests/2.32.2\r\nAccept-Encoding: gzip, deflate, br, zstd\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 18 Jan 2025 18:45:33 GMT
header: Server: Apache/2.4.29 (Ubuntu)
header: Last-Modified: Mon, 26 Feb 2024 19:00:46 GMT
header: ETag: "45b79-6124d86ffdb2f"
header: Accept-Ranges: bytes
header: Content-Length: 285561
header: Keep-Alive: timeout=5, max=100
header: Connection: Keep-Alive
header: Content-Type: application/x-gzip
File downloaded and saved as dataset2.jsonl.gz

In [48]:

subscription_boxes_metadata_df = pd.DataFrame(metadata)
subscription_boxes_combined_df = subscription_boxes_reviews_df.merge(subscription_boxes_metadata_df, on="parent_asin", how="inner")
#theData = "{\"rows\":" + request_body_subscription_boxes_json + "}"
# Removing unnecessary columns
subscription_boxes_combined_df = subscription_boxes_combined_df.drop(columns=['images_x','images_y'])

Encoding Credentials for GridDB¶

While connecting to GridDB cloud, the Request that is constructed should have the username and password encoded to base64. To determine the base64 encoded credentials for your username and password, you can use a tool such as https://www.base64encode.org/ or the base64 library in python. GridDB prefers the 'basic authentication' format - "Basic username:password". Refer to the GridDB resource to learn more.

In [ ]:

username = "XXXX" # Replace with your username
password =   "YYYYY" # Replace with your password
credentials = f"{username}:{password}"
encoded_credentials = base64.b64encode(credentials.encode()).decode()
print(f"Encoded credentials: Basic {encoded_credentials}")

In [5]:

#Construct an object to hold the request headers (ensure that you replace the XXX placeholder with the correct value that matches the credentials for your GridDB instance)
#header_obj = {"Authorization":"Basic dXNlck0wMWt3VzE0Y3Q6Q1BYZ2liNzg4","Content-Type":"application/json; charset=UTF-8","User-Agent":"PostmanRuntime/7.29.0"}

header_obj = {
    "Authorization": f"Basic {encoded_credentials}",  # Add encoded credentials here
    "Content-Type": "application/json",  # Optional; depends on API requirements
    "charset": "UTF-8",
    "User-Agent":"PostmanRuntime/7.29.0"
}

#Construct the base URL based on your GridDB cluster you'd like to connect to (ensure that you replace the placeholders in the URL below with the correct values that correspond to your GridDB instance)
base_url = 'https://[host]:[port]/griddb/v2/[clustername]/dbs/[database_name]/'

In [61]:

data_obj = {
    "container_name": "Amazon_subscription_boxes",
    "container_type": "COLLECTION",
    "rowkey": False,
    "columns": [
        {"name": "rating", "type": "DOUBLE"}, 
        {"name": "title_x", "type": "STRING"},
        {"name": "text", "type": "STRING"},
        {"name": "asin", "type": "STRING"},
        {"name": "parent_asin", "type": "STRING"},
        {"name": "user_id", "type": "STRING"},
        {"name": "timestamp", "type": "LONG"}, 
        {"name": "helpful_vote", "type": "INTEGER"},  
        {"name": "verified_purchase", "type": "BOOL"},
        {"name": "main_category", "type": "STRING"},
        {"name": "title_y", "type": "STRING"},
        {"name": "average_rating", "type": "DOUBLE"},
        {"name": "rating_number", "type": "INTEGER"},
        {"name": "features", "type": "STRING"},  
        {"name": "description", "type": "STRING"},
        {"name": "price", "type": "DOUBLE"},  
        {"name": "videos", "type": "STRING"}, 
        {"name": "store", "type": "STRING"},
        {"name": "categories", "type": "STRING"}, 
        {"name": "details", "type": "STRING"},  
        {"name": "bought_together", "type": "STRING"} 
    ]
}

#Set up the GridDB WebAPI URL
url = base_url + 'containers'

#Invoke the GridDB WebAPI with the headers and the request body
x = requests.post(url, json = data_obj, headers = header_obj)

Loading the Container (Row Registration)¶

Row Registration is the process of inserting rows into the newly created container. It is similar to the INSERT SQL DDL task in standard relational databases. In GridDB, row registration is performed by passing rows as a JSON object while using the WebAPI. Rows are denoted in the form of JSON objects whereas column names and data are denoted as key-value pairs. Row Registration can be done as single rows or in batches.

In [23]:

# Replace empty lists with empty strings or None
subscription_boxes_combined_df = subscription_boxes_combined_df.applymap(
    lambda x: "" if isinstance(x, list) and not x else x
)

# Check for problematic values in DOUBLE columns
double_columns = ["average_rating", "price"]

# Replace invalid or missing values with 0.0
for col in double_columns:
    subscription_boxes_combined_df[col] = pd.to_numeric(subscription_boxes_combined_df[col], errors='coerce').fillna(0.0)

C:\Users\mg_su\AppData\Local\Temp\ipykernel_15956\3612489214.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  subscription_boxes_combined_df = subscription_boxes_combined_df.applymap(

In [25]:

# Ensure the DataFrame columns match the schema order for GridDB
column_order = [
    "rating", "title_x", "text", "asin", "parent_asin", "user_id", 
    "timestamp", "helpful_vote", "verified_purchase",
    "main_category", "title_y", "average_rating", "rating_number", 
    "features", "description", "price", "videos", "store", 
    "categories", "details", "bought_together"
]

# Reorder columns in the DataFrame
subscription_boxes_combined_df = subscription_boxes_combined_df[column_order]

# Convert any non-string fields to JSON strings if required
json_fields = ["features", "videos", "categories", "details"]

for col in json_fields:
    subscription_boxes_combined_df[col] = subscription_boxes_combined_df[col].apply(
        lambda x: json.dumps(x) if isinstance(x, (list, dict)) else x
    )

In [26]:

subscription_boxes_combined_df['average_rating'] = subscription_boxes_combined_df['average_rating'].astype(float)
subscription_boxes_combined_df['rating'] = subscription_boxes_combined_df['rating'].astype(float)
subscription_boxes_combined_df['price'] = subscription_boxes_combined_df['price'].astype(float)

# Parse the JSON string into a Python list of dictionaries
request_body_subscription_boxes = subscription_boxes_combined_df.to_json(orient='values')

In [ ]:

#Setup the URL to be used to invoke the GridDB WebAPI to register rows in the container created previously
url = base_url + 'containers/Amazon_subscription_boxes/rows'

#Invoke the GridDB WebAPI using the request constructed
x = requests.put(url, data=request_body_subscription_boxes, headers=header_obj)

Analysis using GridDB SQL¶

Top 10 subscription boxes by Average Ratings and Popularity¶

Though not a standard relational database, GridDB supports subqueries and aggregations similar to relational databases. This makes it a hybrid of relational databases and NoSQL databases. Refer to this resource to learn more on the SELECT clause in GridDB. To learn about subqueries, refer to this resource.

In [11]:

sql_query1 = (f"""
                SELECT *
                FROM (
                    SELECT 
                        store,
                        MAX(average_rating) AS Avg_User_Rating, 
                        SUM(rating_number) AS Total_Ratings
                    FROM 
                        Amazon_subscription_boxes
                    GROUP BY 
                        store
                    ORDER BY 
                        Avg_User_Rating DESC,
                        Total_Ratings DESC
                    LIMIT 10
                ) AS Top_Stores
                ORDER BY Total_Ratings DESC;
                    """)

In [15]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query1
    }
])

# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req1 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req1.json()
top_rated_subscription_boxes = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])
#top_rated_subscription_boxes.style.hide(axis="index")

In [16]:

# Create a bar chart
fig = px.bar(
    top_rated_subscription_boxes,
    x='store',
    y='Total_Ratings',
    title='Top Rated Subscription Boxes by Total Ratings',
    text='Total_Ratings',
    labels={'store': 'Store', 'Total_Ratings': 'Total Ratings'}
)

# Customize layout
fig.update_traces(textposition='outside')
fig.update_layout(
    xaxis_title='Store',
    yaxis_title='Total Ratings',
    xaxis_tickangle=-45,
    title_x=0.5
)

# Display the chart
fig.show()

Insights: Little Dreamers Club and Annie's have been the most rated subscription boxes on Amazon as of 2023.

Stores with Unverified Purchases: Analyzing Trustworthiness¶

GridDB supports standard aggregations like SUM, COUNT, ROUND, AVERAGE, and so on. The list of Aggregation functions supported in GridDB can be found in this resource.

In [36]:

sql_query2 = (f"""
				SELECT 
                    store,
					COUNT(*) AS total_reviews,
					SUM(CASE WHEN verified_purchase THEN 1 ELSE 0 END) AS verified_reviews,
					SUM(CASE WHEN NOT verified_purchase THEN 1 ELSE 0 END) AS unverified_reviews,
					ROUND(SUM(CASE WHEN verified_purchase THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS verified_percentage,
					ROUND(SUM(CASE WHEN NOT verified_purchase THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS unverified_percentage
				FROM 
					Amazon_subscription_boxes
				GROUP BY 
                    store
                HAVING 
                    ROUND(SUM(CASE WHEN NOT verified_purchase THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) > 
					ROUND(SUM(CASE WHEN verified_purchase THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2)
				ORDER BY 
					unverified_reviews DESC
				LIMIT 10;
                """)


#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query2
    }
])

# Validate the constructed request body
#print(request_body)

# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req2 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req2.json()
verified_unverified_reviews = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"],myJson[0]["columns"][3]["name"],myJson[0]["columns"][4]["name"],myJson[0]["columns"][5]["name"]])

In [39]:

import plotly.graph_objects as go

# Data from the DataFrame
stores = verified_unverified_reviews["store"]
verified_reviews = verified_unverified_reviews["verified_reviews"]
unverified_reviews = verified_unverified_reviews["unverified_reviews"]

# Create the figure
fig = go.Figure()

# Add verified reviews as the first trace with numbers
fig.add_trace(go.Bar(
    y=stores,
    x=verified_reviews,
    name="Verified Reviews",
    orientation="h",  # Horizontal orientation
    marker=dict(color="green"),
    text=verified_reviews,  # Display numbers
    textposition="auto"  # Automatically adjust position for better visibility
))

# Add unverified reviews as the second trace with numbers
fig.add_trace(go.Bar(
    y=stores,
    x=unverified_reviews,
    name="Unverified Reviews",
    orientation="h",  # Horizontal orientation
    marker=dict(color="orange"),
    text=unverified_reviews,  # Display numbers
    textposition="auto"  # Automatically adjust position for better visibility
))

# Update layout for better appearance
fig.update_layout(
    title="Distribution of Verified and Unverified Reviews by Store",
    barmode="stack",  # Stacked bar mode
    xaxis_title="Number of Reviews",
    yaxis_title="Store",
    legend_title="Review Type",
    template="plotly_white",  # Optional: A clean white background
    uniformtext_minsize=8,  # Ensure text is readable
    uniformtext_mode="show",  # Force display of text, even in tight spaces
    xaxis=dict(tickangle=0)  # Ensure x-axis ticks are easy to read
)

# Show the chart
fig.show()

Insight(s): There are a few stores where unverified reviews are more than verified reviews. While this raises potential credibility concerns, it may also denote organic popularity wherein users who bought the product outside the platform end up posting a review.

Top 10 Verified Reviews that were upvoted¶

In [153]:

sql_query3 = (f"""
				SELECT 
					store,
                    text,
					COUNT(helpful_vote) AS helpful_votes
				FROM 
					Amazon_subscription_boxes
                WHERE verified_purchase = True
				GROUP BY 
                    store
				ORDER BY 
					3 DESC
				LIMIT 10;
                """)


#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query3
    }
])

# Validate the constructed request body
#print(request_body)

# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req3 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req3.json()
helpful_votes = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])

In [154]:

# Define a function to wrap text
def wrap_text(text, width=50):
    return "<br>".join(textwrap.wrap(text, width=width))

# Apply wrapping to the 'text' column
helpful_votes['text_wrapped'] = helpful_votes['text'].apply(lambda x: wrap_text(x, width=40))

# Select specific columns for display
subset_df = helpful_votes[['store', 'text_wrapped', 'helpful_votes']]

# Use Styler to display with gridlines and wrapped text
styled_df = subset_df.style.set_table_styles([
    {'selector': 'table', 'props': [('border-collapse', 'collapse'), ('width', '100%')]},
    {'selector': 'th, td', 'props': [('border', '1px solid black'), ('padding', '8px')]}
])

# Render the DataFrame with gridlines and no index
styled_df = styled_df.hide(axis='index')

# Display the styled DataFrame
styled_df

Out[154]:

store	text_wrapped	helpful_votes
Allure Beauty Box	Avid fan of Allure Mag, been reading for at Least 10 yrs so readily excitedly joined PROS:::: cheap convenient way to try new Beauty/skin products Fast delivery />CONS::: None of products I received were What was depicted in the graphic. I contacted seller, asking if what was sent monthly was same as shown in graphic and received a puzzling reply ((smh)) scented sample was strong & harsh No lipstick Was hoping for at least tube of lipstick, gloss, etc />Received a notice that price increases with Ictober shipment. MAIN reason I subscribed was because of low cost to try new Skin products. Am going to wait for next shipment before I decide to continue or unsubscribe	1686
Funko	Always good products. Very exciting experience each time.	1576
STEM Club	I got 5 to 6 subscriptions so far and most are good. The only thing is sometimes the boxes are not age appropriate.	1377
KitNipBox	Absolutely useless nonsense and a complete waste of money. Kitty didn't like any of the items	732
Barkbox	Two SMALL stuffed animals and 2 little bags of treats for $35- nope nope nope. I’ll just be shopping a la cart on Amazon from now on	391
TheraBox	Some of the items are nice, but I am not interested in about half of them. This might be a nice idea as a gift but would be even better if you had a little bit of control over what you receive.	337
COCOTIQUE	I loved every thing and could use it all. Things were in it that I would never order Coctique	325
Simple Loose Leaf Tea Company	I was going to get the first box, which is a great value, and then cancel. But I have changed my mind. The flavor choices were Great and it is a fair amount of tea for the price.	258
FACETORY	Ok	249
Bokksu	I DID ENJOY THE ADVENTURE OF THE BOX AND BEING INTRODUCED TO NEW THINGS. BUT THE SNACKS THEMSELVES WERE JUST OK. THERE WERE TOO MAY THINGS THAT WERE SIMILAR FOR THE MONTH OF JULY, LIKE TOO MANY TYPES OF WHORCHEWTISHIRE RICE CRACKERS. A COUPLE OF THOSE WOULD HAVE SUFFICED. THE HAND MADE FIZZING CANDIES WERE INTERESTING AND THE STRAWBERRY MOUSSE LAYERED WAFFLE COOKIES IN THE SHAPE OF A FISH WERE EXCELLENT. I WANTED MORE DESCRIPTION ON THE INFORMATIONAL POSTCARD, LIKE WHY WERE THESE ITEMS PICKED?? I DID NOT RECOGNIZE A THEME.	226

Total Product Variations¶

In the below query, we look at the number of variations (child ASINs) of a parent ASIN. When a product has variations, it means that the product is available in different versions, options, or configurations under the same overarching parent. This concept is commonly used in e-commerce platforms, including Amazon, to organize and display related products efficiently.

In [156]:

sql_query4 = (f"""
                SELECT 
                  parent_asin, 
                  store, 
                  COUNT(DISTINCT asin) AS total_variations
                FROM Amazon_subscription_boxes
                GROUP BY 
                    parent_asin,store
                ORDER BY 
                    total_variations DESC
                    LIMIT 10;
                """)

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query4
    }
])

# Validate the constructed request body
#print(request_body)

# Disable debug output after running the query
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req4 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req4.json()
Product_Variations = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])
Product_Variations.style.hide(axis="index")

Out[156]:

parent_asin	store	total_variations
B084ZSS5H1	Prime Accessories HQ	24
B07QWTTPK7	STEM Discovery Boxes	20
B0992LL6TG	JARMABOX	18
B08SMCV47C	ERIKA'S TEA ROOM MEMORIES MADE WITH EVERY CUP!	17
B07L9GL97H	Geek Fuel EXP	16
B08W23KDFR	KIDS READ DAILY	16
B08KWLQD46	Annie's	15
B09R2MXZHC	Funko	14
B0BCGQ8SJR	The Dally Grind	13
B0BM57HJXV	Funko	12

Insight(s): We see that Prime Accessories HQ and STEM Discovery Boxes have the highest number of variations (Child ASINS). This indicates that these products have a vast selection of products with features that can be customized or personalized.

Top 10 Subscription boxes with the Highest Rating¶

In [48]:

sql_query5 = (f"""
                SELECT 
                  store, 
                  COUNT(rating) AS total_ratings
                FROM Amazon_subscription_boxes
                WHERE rating >= 4
                GROUP BY 
                    store
                ORDER BY 
                    total_ratings DESC
                    LIMIT 10;
                """)

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query5
    }
])

# Validate the constructed request body
#print(request_body)

# Disable debug output after running the query
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req5 = requests.post(url, data=request_body, headers=header_obj)

In [49]:

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req5.json()
top_rated_products = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])
#top_rated_products.style.hide(axis="index")
#fig = ff.create_table(top_rated_products)
fig = px.bar(top_rated_products, x="store", y="total_ratings", orientation='v',text="total_ratings",title='Top 10 Subscription boxes with the Highest Rating')
fig.update_traces(textposition='inside')
fig.show()

Analyzing Sentiments based on Customer Reviews¶

In [52]:

sql_query6 = (f"""
                SELECT 
                    store,
                    text
                FROM Amazon_subscription_boxes
                WHERE (rating = 5 OR rating =1)
                AND verified_purchase = True;
                """)

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query6
    }
])

# Validate the constructed request body
#print(request_body)

# Disable debug output after running the query
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req6 = requests.post(url, data=request_body, headers=header_obj)

In [53]:

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req6.json()
product_features = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])
#product_features
def truncate_review(review, max_length=256):
    return review[:max_length]

product_features['text'] = product_features['text'].apply(truncate_review)

In [56]:

# Load the tokenizer and sentiment-analysis pipeline (with GPU support)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=-1)  # Use CPU

# Define a function to truncate text to 512 tokens and tokenize in batches
def batch_tokenize(text_list, tokenizer, batch_size=64):
    tokenized_batches = []
    for i in range(0, len(text_list), batch_size):
        batch = text_list[i:i + batch_size]
        tokenized_batch = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="pt")
        tokenized_batches.append(tokenized_batch)
    return tokenized_batches

# Truncate and tokenize the 'features' column in batches
tokenized_batches = batch_tokenize(product_features['text'].tolist(), tokenizer)

# Perform sentiment analysis in batches (faster with GPU) with progress bar
results = []
for i, batch in tqdm(enumerate(tokenized_batches), total=len(tokenized_batches), desc="Processing batches"):
    # Perform inference using the model
    outputs = sentiment_analyzer.model(**batch)
    predicted_classes = outputs.logits.argmax(dim=1)  # Get predicted class indices (0 or 1 for the batch)
    sentiments = ['POSITIVE' if cls == 1 else 'NEGATIVE' for cls in predicted_classes]  # Map indices to sentiments
    results.extend(sentiments)  # Append all sentiments for the batch

# Map results back to the DataFrame
product_features['sentiment'] = results

# Summarize sentiments by store
sentiment_summary_df = product_features.groupby('store')['sentiment'].value_counts().unstack(fill_value=0)
sentiment_summary_df = sentiment_summary_df.rename(columns={'POSITIVE': 'positive_reviews', 'NEGATIVE': 'negative_reviews'}).reset_index()

# Show the final sentiment summary
#print(sentiment_summary_df)

Processing batches: 100%|████████████████████████████████████████████████████████████| 154/154 [07:21<00:00,  2.87s/it]

In [62]:

# Filter stores with more negative reviews than positive reviews
negative_dominant = sentiment_summary_df[sentiment_summary_df["negative_reviews"] > sentiment_summary_df["positive_reviews"]]
negative_dominant = negative_dominant.fillna(0)  # Ensure no missing values
top_negative_dominant = negative_dominant.sort_values(by="negative_reviews", ascending=False).head(20)

# Filter stores with more positive reviews than negative reviews
positive_dominant = sentiment_summary_df[sentiment_summary_df["positive_reviews"] > sentiment_summary_df["negative_reviews"]]
positive_dominant = positive_dominant.fillna(0)  # Ensure no missing values
top_positive_dominant = positive_dominant.sort_values(by="positive_reviews", ascending=False).head(20)

# Create stacked bar chart for stores with more negative reviews
fig_negative_dominant = go.Figure()

fig_negative_dominant.add_trace(go.Bar(
    x=top_negative_dominant['store'],
    y=top_negative_dominant['negative_reviews'],
    name='Negative Reviews',
    marker_color='crimson'  # Red for negative sentiment
))

fig_negative_dominant.add_trace(go.Bar(
    x=top_negative_dominant['store'],
    y=top_negative_dominant['positive_reviews'],
    name='Positive Reviews',
    marker_color='green'  # Green for positive sentiment
))

fig_negative_dominant.update_layout(
    barmode='stack',
    title='Top 20 Stores with More Negative Reviews than Positive Reviews',
    xaxis_title='Store',
    yaxis_title='Number of Reviews',
    title_font_size=18,
    xaxis_tickangle=-45,
    legend_title_text='Review Type'
)

# Create stacked bar chart for stores with more positive reviews
fig_positive_dominant = go.Figure()

fig_positive_dominant.add_trace(go.Bar(
    x=top_positive_dominant['store'],
    y=top_positive_dominant['positive_reviews'],
    name='Positive Reviews',
    marker_color='green'  # Green for positive sentiment
))

fig_positive_dominant.add_trace(go.Bar(
    x=top_positive_dominant['store'],
    y=top_positive_dominant['negative_reviews'],
    name='Negative Reviews',
    marker_color='crimson'  # Red for negative sentiment
))

fig_positive_dominant.update_layout(
    barmode='stack',
    title='Top 20 Stores with More Positive Reviews than Negative Reviews',
    xaxis_title='Store',
    yaxis_title='Number of Reviews',
    title_font_size=18,
    xaxis_tickangle=-45,
    legend_title_text='Review Type'
)

# Show both charts
fig_negative_dominant.show()

In [63]:

fig_positive_dominant.show()

Concluding Remarks¶

In this case study, we analyzed subscription box reviews and metadata using GridDB's powerful query engine. The ability to write SQL-like queries combined with faster retrieval times than traditional databases makes GridDB the perfect choice for use cases involving real-time processing. With the advent of AI, tools like Hugging Face can help analyze large volumes of data stored in GridDB to derive meaningful insights. Adoption of GridDB and AI technologies can greatly enhance value creation in businesses and unlock opportunities for innovation and growth in today's digital age.

In [ ]:

Also available on Github Last updated: 19/01/2025 06:10:57

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

GridDB と AI を活用したサブスクリプションボックスのレビュー分析

The Subscription Scoop: Analyzing Subscription Box Reviews using GridDB and AI¶

Dataset Citation¶

Importing Libraries¶

Configuring Database Users in GridDB for querying¶

Creating the Database User Credentials¶

Loading the Datasets as Python DataFrames¶

Downloading the reviews¶

Downloading the Metadata¶

Encoding Credentials for GridDB¶

Loading the Container (Row Registration)¶

Analysis using GridDB SQL¶

Top 10 subscription boxes by Average Ratings and Popularity¶

Stores with Unverified Purchases: Analyzing Trustworthiness¶

Top 10 Verified Reviews that were upvoted¶

Total Product Variations¶

Top 10 Subscription boxes with the Highest Rating¶

Analyzing Sentiments based on Customer Reviews¶

Concluding Remarks¶

コメントを残すコメントをキャンセル

The Subscription Scoop: Analyzing Subscription Box Reviews using GridDB and AI¶

Dataset Citation¶

Importing Libraries¶

Configuring Database Users in GridDB for querying¶

Creating the Database User Credentials¶

Loading the Datasets as Python DataFrames¶

Downloading the reviews¶

Downloading the Metadata¶

Encoding Credentials for GridDB¶

Loading the Container (Row Registration)¶

Analysis using GridDB SQL¶

Top 10 subscription boxes by Average Ratings and Popularity¶

Stores with Unverified Purchases: Analyzing Trustworthiness¶

Top 10 Verified Reviews that were upvoted¶

Total Product Variations¶

Top 10 Subscription boxes with the Highest Rating¶

Analyzing Sentiments based on Customer Reviews¶

Concluding Remarks¶

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル