import os
os.chdir('PLEASE PUT YOUR WORKING DIRECTORY HERE')
import numpy as np
import pandas as pd
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
import altair as alt
import folium
import seaborn as sns
import requests
import http
http.client.HTTPConnection.debuglevel = 1
import json
import re
from IPython.display import Image
$ git clone https://github.com/griddbnet/Blogs.git --branch electric_vehicles
About the Analysis and the Dataset - A synopsis¶
As per official sources, Washington State's governor Jay Inslee has recently set a goal of phasing out most gasoline powered car sales by 2035 and having 100% electric powered vehicles by 2040. Here is the executive order from the official website of the Washington Government on Battery Electric Vehicles (BEVs). Officials of the State of Washington believe that the sudden changes in weather and increase in catastrophes are being caused by continuous vehicle emissions and pollutants from gasoline powered vehicles.Another state that follows Washington's route is New Jersey which has banned the sale of gasoline vehicles after 2035 as per PCMag .
Let's analyze the electric vehicles registered under Washington State's Department of Licensing. To do this, we use the open dataset 'Electric Vehicle Population Data' from the data.gov website. Here is a direct URL to the dataset. From the 'Downloads & Resources' section, click on 'Download' button next to the 'Comma Separated Values File'as shown below -
Below is a snapshot of the file that needs to be downloaded -
Image(filename='Electric_Vehicles_Dataset_Snapshot.jpg',width=500, height=200)
The resulting file has the file name 'Electric_Vehicle_Population_Data.csv'. Save this to your working directory.
Let's load this data from the csv file into a dataframe for use later.
Electric_Vehicle_Data = pd.read_csv('Electric_Vehicle_Population_Data.csv',sep=',')
# Removing columns #DOL Vehicle ID, 2020 Census Tract
Electric_Vehicle_Data = Electric_Vehicle_Data.drop(['DOL Vehicle ID', '2020 Census Tract'], axis=1)
As the data is huge, we create subsets of the data using np.array_split(). We will use this to load data in batches into the GridDB container.
#creating 12 subsets of the data
Electric_Vehicle_Subsets = np.array_split(Electric_Vehicle_Data, 12)
Creating the Request and Containers¶
Authorization¶
While connecting to GridDB cloud, the Request that is constructed should have the username and password encoded to base64. To determine the base64 encoded credentials for your username and password, you can use a tool such as https://www.base64encode.org/. Refer to the GridDB resource listed here to learn more about the different entities involved in creating a request - https://griddb.net/en/blog/using-griddb-cloud-version-1-2/.
Creating the Request¶
#Construct an object to hold the request headers (ensure that you replace the XXX placeholder with the correct value that matches the credentials for your GridDB instance)
header_obj = {"Authorization":"Basic XXX","Content-Type":"application/json; charset=UTF-8","User-Agent":"PostmanRuntime/7.29.0"}
#Construct the base URL based on your GridDB cluster you'd like to connect to (ensure that you replace the placeholders in the URL below with the correct values that correspond to your GridDB instance)
base_url = 'https://[host]:[port]/griddb/v2/[clustername]/dbs/[database_name]/'
Creating the container Electric_Vehicles¶
#Construct an object to hold the request body (i.e., the container that needs to be created)
data_obj = {
"container_name": "Electric_Vehicles",
"container_type": "COLLECTION",
"rowkey": False,
"columns": [
{
"name": "VIN",
"type": "STRING"
},
{
"name": "County",
"type": "STRING"
},
{
"name": "City",
"type": "STRING"
},
{
"name": "State",
"type": "STRING"
},
{
"name": "PostalCode",
"type": "STRING"
},
{
"name": "Model_Year",
"type": "INTEGER"
},
{
"name": "Make",
"type": "STRING"
},
{
"name": "Model",
"type": "STRING"
},
{
"name": "Electric_Vehicle_Type",
"type": "STRING"
},
{
"name": "CAFV",
"type": "STRING"
},
{
"name": "Electric_Range",
"type": "INTEGER"
},
{
"name": "Base_MSRP",
"type": "FLOAT"
},
{
"name": "Legislative_District",
"type": "STRING"
},
{
"name": "Vehicle_Location",
"type": "STRING"
},
{
"name": "Electric_Utility",
"type": "STRING"
}
]
}
#Set up the GridDB WebAPI URL
url = base_url + 'containers'
#Invoke the GridDB WebAPI with the headers and the request body
x = requests.post(url, json = data_obj, headers = header_obj)
Loading the Container with Data (Row-Registration)¶
The process of loading rows into a container is called 'Row Registration' in GridDB terminology. Here is a resource to learn more about the GridDB WebAPI and the process of 'Row Registration' - https://griddb.net/en/blog/griddb-webapi/
To load data into the container, we create a JSON structure of the data in the dataframe (Refer to the section 'About the Analysis and the Dataset - A synopsis').
GridDB supports batch loading. If you are using the SDK, here is a useful resource - https://griddb.net/en/blog/griddb-optimization-with-multi-put-and-query/. If you use the WebAPI, then you can just use a simple for loop to load your subsets of data into your GridDB container.
#Setup the URL to be used to invoke the GridDB WebAPI to register rows in the container created previously
url = base_url + 'containers/Electric_Vehicles/rows'
#Invoke the GridDB WebAPI using the request constructed
for subset in Electric_Vehicle_Subsets:
#Convert the data in the dataframe to the JSON format
Electric_Vehicle_Subset_json = subset.to_json(orient='values')
request_body_subset = Electric_Vehicle_Subset_json
x = requests.put(url, data=request_body_subset, headers=header_obj)
print(x.text)
if x.status_code > 299: #To account for HTTP response codes indicating a success - 200 and 201 etc.
print(x.status_code) # an error has occurred; so, lets stop the process here
break
Data Analysis and Visualization¶
What are the types of electric vehicles available today?¶
sql_query1 = (f"""SELECT DISTINCT Electric_Vehicle_Type, COUNT(*) AS cnt_vehicles FROM Electric_Vehicles GROUP BY 1""")
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query1+'"}]'
#Validate the constructed request body
request_body
'[{"type":"sql-select", "stmt":"SELECT DISTINCT Electric_Vehicle_Type, COUNT(*) AS cnt_vehicles FROM Electric_Vehicles GROUP BY 1"}]'
#Invoke the GridDB WebAPI
data_req1 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req1.json()
Electric_Vehicle_Types = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"],myJson[0]["columns"][1]["name"]])
Electric_Vehicle_Types.rename(columns = {'Electric Vehicle Type':'Electric Vehicle Type'}, inplace = True)
fig = ff.create_table(Electric_Vehicle_Types, height_constant=20)
fig.show()
Insight(s): From the above, we see that there are 2 types of electric vehicles - Plug-in Hybrid Electric Vehicles and Battery Electric Vehicles. Of these, the Battery Electric Vehicles are more in number as of date.
What are some manufacturers making PHEVs and/or BEVs?¶
sql_query2 = (f"""SELECT DISTINCT Make, Electric_Vehicle_Type, COUNT(*) AS cnt_vehicles FROM Electric_Vehicles GROUP BY 1,2""")
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query2+'"}]'
#Validate the constructed request body
request_body
'[{"type":"sql-select", "stmt":"SELECT DISTINCT Make, Electric_Vehicle_Type, COUNT(*) AS cnt_vehicles FROM Electric_Vehicles GROUP BY 1,2"}]'
#Invoke the GridDB WebAPI
data_req2 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req2.json()
Electric_Vehicles_and_Makes = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"],myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])
fig = px.bar(Electric_Vehicles_and_Makes, y="Electric_Vehicle_Type", x="cnt_vehicles",color='Make', orientation='h')
fig.update_layout(
title={
'text': "Manufacturers by Electric Vehicle Count (Hover over to access tooltips)",
'y':1.0,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
Insight(s):
- From the above visual, we see that Tesla makes the most number of BEVs (around 55 Million so far).
- Though it is a known fact that Tesla is a leading player in electric vehicles, it's interesting to see that there are a plethora of manufacturers in the 'Electric Vehicle' market manufacturing anywhere between 400 to 5000 vehicles.
- The next biggest player is Nissan which has manufactured around 12,887 BEVs so far.
- While Tesla and Nissan seem to focus solely on BEVs, manufacturers like BMW, Chevrolet and Ford manufacture both BEVs and PHEVs.
How do different manufacturers fare in terms of CAFV eligibility?¶
sql_query3 = (f"""SELECT DISTINCT Make, CAFV, COUNT(*) AS cnt_vehicles FROM Electric_Vehicles GROUP BY 1,2""")
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query3+'"}]'
#Validate the constructed request body
request_body
'[{"type":"sql-select", "stmt":"SELECT DISTINCT Make, CAFV, COUNT(*) AS cnt_vehicles FROM Electric_Vehicles GROUP BY 1,2"}]'
#Invoke the GridDB WebAPI
data_req3 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req3.json()
Makes_and_CAFV = pd.DataFrame(myJson[0]["results"],columns=[myJson[0]["columns"][0]["name"],myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])
fig = px.bar(Makes_and_CAFV, x="Make", y="cnt_vehicles", color="CAFV", title="How do Manufacturers fare in terms of CAFV eligibility?",width=1000, height=1000)
fig.show()
Insight(s):
- Among the electric vehicles that have been manufactured by Tesla (around 55 Million),around half of the vehicles (25 Million) are CAFV eligible whereas the status of around 29 Million vehicles is unknown as the battery range has not yet been researched.
- It is interesting to note that 1/5th of the vehicles manufactured by Nissan are CAFV eligible.
- Ford, BMW, Toyota and Volvo have some vehicles that are not CAFV eligible due to low battery range.
Who are the most and least expensive manufacturers in terms of the Avg Base MSRP?¶
GridDB supports standard null-handling using IFNULL, ISNULL and COALESCE. Refer to this URL to know more - https://docs.griddb.net/sqlreference/sql-commands-supported/#data-definition-language-ddl.
sql_query4 = (f"""SELECT Make, COALESCE(Model,'Unknown') AS Model, AVG(Base_MSRP) AS AVG_BASE_MSRP FROM Electric_Vehicles GROUP BY 1,2 HAVING AVG(Base_MSRP) > 10000 ORDER BY 3 DESC""")
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query4+'"}]'
#Validate the constructed request body
request_body
'[{"type":"sql-select", "stmt":"SELECT Make, COALESCE(Model,\'Unknown\') AS Model, AVG(Base_MSRP) AS AVG_BASE_MSRP FROM Electric_Vehicles GROUP BY 1,2 HAVING AVG(Base_MSRP) > 10000 ORDER BY 3 DESC"}]'
#Invoke the GridDB WebAPI
data_req4 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req4.json()
Models_MSRP = pd.DataFrame(myJson[0]["results"],columns=[myJson[0]["columns"][0]["name"],myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])
fig = px.pie(Models_MSRP, values='AVG_BASE_MSRP', names='Make', color_discrete_sequence=px.colors.sequential.RdBu,title='Manufacturers by Average Base MSRP')
fig.show()
Insight(s):
- Porsche seems to be the manufacturer of the costliest electric vehicle (being 60.5% of the overall Base MSRP).
- This is followed by BMW and Tesla.
- Some inexpensive electric vehicle options seem to be Mini, Volvo, Subaru and Cadillac.
What is the geographical distribution of these electric vehicles within Washington?¶
GridDB supports standard cast operations. Refer to this resource to know more - https://docs.griddb.net/sqlreference/sql-commands-supported/#convert-to-string-type
sql_query5 = (f"""SELECT city, state, county, CAST(COALESCE(PostalCode,'0')AS FLOAT) AS Postal_Code, Vehicle_Location, count(*) as cnt_vehicles FROM Electric_Vehicles GROUP BY 1,2,3,4,5""")
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query5+'"}]'
#Validate the constructed request body
request_body
'[{"type":"sql-select", "stmt":"SELECT city, state, county, CAST(COALESCE(PostalCode,\'0\')AS FLOAT) AS Postal_Code, Vehicle_Location, count(*) as cnt_vehicles FROM Electric_Vehicles GROUP BY 1,2,3,4,5"}]'
#Invoke the GridDB WebAPI
data_req5 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req5.json()
Geo_dist = pd.DataFrame(myJson[0]["results"],columns=[myJson[0]["columns"][0]["name"],myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"],myJson[0]["columns"][3]["name"],myJson[0]["columns"][4]["name"],myJson[0]["columns"][5]["name"]])
Geo_dist['Postal_Code'] = Geo_dist['Postal_Code'].astype(int)
Geo_dist.columns
Geo_dist = Geo_dist[Geo_dist['State']== 'WA']
Geo_dist['Postal_Code'] = Geo_dist['Postal_Code'].astype(str) #Convert to the string datatype as it is needed in the string format for easy matching.
A short note on the GeoJSON file required for visualizing the Geographical Distribution¶
For plotting the Choropleth map, we use the Folium Library in Python. To be able to plot the locations of the vehicles, the lat long values of their corresponding Postal Codes need to be gathered. These can be obtained from the GeoJSON file available in the https://raw.githubusercontent.com/OpenDataDE/State-zip-code-GeoJSON/master/wa_washington_zip_codes_geo.min.json
Note that you are not required to manually download the file from this URL because -
The below code snippet downloads the GeoJSON file from this URL and creates a JSON file with the name 'wa_washington_zip_codes_geo.min' in your local working directory . This is then used for visualizing the Choropleth map.
##Download the GeoJSON file from the internet. We will download the Washington State GeoJSON file here..
url='https://raw.githubusercontent.com/OpenDataDE/State-zip-code-GeoJSON/master/wa_washington_zip_codes_geo.min.json'
filename = url.split('/')[-1] #extract the file name from the url
file_path = os.path.join(os.getcwd(), filename) #construct the download file path
r = requests.get(url, stream=True) #make the request to fetch the file
if r.ok:
with open(file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024 * 8):
if chunk:
f.write(chunk)
f.flush()
os.fsync(f.fileno())
else:
print("Download failed: {}".format(r.text))
send: b'GET /OpenDataDE/State-zip-code-GeoJSON/master/wa_washington_zip_codes_geo.min.json HTTP/1.1\r\nHost: raw.githubusercontent.com\r\nUser-Agent: python-requests/2.25.1\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n' reply: 'HTTP/1.1 200 OK\r\n' header: Connection: keep-alive header: Content-Length: 5809641 header: Cache-Control: max-age=300 header: Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox header: Content-Type: text/plain; charset=utf-8 header: ETag: W/"fb8e62ca1ec58b205fea53d38ad89b2b2b274631ce108b22a5be3f626ac76ea5" header: Strict-Transport-Security: max-age=31536000 header: X-Content-Type-Options: nosniff header: X-Frame-Options: deny header: X-XSS-Protection: 1; mode=block header: X-GitHub-Request-Id: 640C:3A49:EC51E:19AF05:640403D9 header: Content-Encoding: gzip header: Accept-Ranges: bytes header: Date: Sun, 05 Mar 2023 02:52:11 GMT header: Via: 1.1 varnish header: X-Served-By: cache-bos4680-BOS header: X-Cache: MISS header: X-Cache-Hits: 0 header: X-Timer: S1677984730.647477,VS0,VE1981 header: Vary: Authorization,Accept-Encoding,Origin header: Access-Control-Allow-Origin: * header: X-Fastly-Request-ID: 0b63429350930916ce15d3696e81ce55965ae237 header: Expires: Sun, 05 Mar 2023 02:57:11 GMT header: Source-Age: 2
Below is a snippet of the GeoJSON file -
Note that we use the information highlighted in yellow while constructing the choropleth map as this gives us the lat long values for each postal code. These values will be used in the 'key_on' parameter as 'feature.properties.ZCTA5CE10' in the folium.Choropleth function.
The 'ZCTA5CE10' is the Postal Code in the JSON file that will be used as a lookup key in the dataframe Geo_dist's Postal_Code column. Note that as the 'ZCTA5CE10' field is a string in the JSON file, the Postal Code in the dataframe was converted to the string data type so that the Postal Code values in the dataframe can be matched easily to the values in the JSON.
m = folium.Map(location=(47.751076, -120.740135), zoom_start=7, tiles="cartodb positron") #
zip_codes_url = (
"wa_washington_zip_codes_geo.min.json" # Refer to 'A short note on the GeoJSON file required for visualizing the Geographical Distribution'
)
cp = folium.Choropleth(
geo_data=zip_codes_url,
fill_opacity=0.75,
line_opacity =0.2,
data=Geo_dist,
columns=['Postal_Code', 'cnt_vehicles'],
fill_color = 'YlGn',
key_on="feature.properties.ZCTA5CE10", # This gives the GeoJSON lat long values referred from the JSON file
legend_name='Total Number of Electric Vehicles in Washington state',
nan_fill_color='white'
).add_to(m)
folium.GeoJsonTooltip(['ZCTA5CE10']).add_to(cp.geojson) #add the Zip Code as a tooltip
m
Insight(s):
- The above map is interactive. Hover over the locations to access the Zip Codes.
- The Zip Codes 98115, 98034, 98040 and 98006 are some postal codes in Seattle Washington with the most number of electric vehicles. These correspond to the neighborhoods of Bryant, Kirkland, Mercer Island and Eaglesmere.
- It is pretty clear that Seattle and Everett have tne most number of Electric vehicle buyers.
- The rest of Washington still has potential as many of them only seem to have between 1 and around 330 electric vehicles so far.
Concluding Remarks¶
Thus, we see that while Washington has a lot of scope in becoming a state with 100% electric vehicles thereby reducing carbon fuel emissions, there's also a lot of work to be done. Charging stations and other amenities are also a key factor in ensuring the early adoption of electric vehicles by people. Though there seems to be enough demand for Electric Vehicles in Washington, many a time, supply is not able to meet demand as manufacturers struggle with producing enough vehicles to meet the demand. Will this situation improve and will Washington achieve its dream of being an 'All Electric Vehicle' state? Let's wait and see.
The most important tool that aided in the analysis was not just Python and its amazing libraries, but the database GridDB. Without a robust Database, handling such a large dataset wouldn't have been possible. The batch loading of GridDB and the laser-fast speed with which massive amounts of records are written to the GridDB database engine are important catalysts for this analysis. Moreover, query times were also fast thereby being able to fetch results very quickly.
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.
