Part 3: Trustbit Logistics Hackathon - Add speed model to logistic simulation
So far in the series we have built a trivial logistics simulation runtime. At this point it is only capable of finding the fastest route between two locations. This is implemented as a form of A* algorithm that uses predefined travel times.
Let’s extend the implementation and demonstrate how we can “plug” different models into the simulation runtime.
Within this article we’ll mine historical data to build a naive speed model. This model will predict an average speed for a road segment, given time of the day. Logic will live in train
function.
The trained model will then be passed to a modified route
function that is very similar to the logic from the previous article.
When wired together, we should be able to do something like that:
logisim3 --origin Steamdrift --dest Leverstorm --test_ratio 0.2 # prints: # Mean squared error is 46.5593 # 0.00h DEPART Steamdrift # 15.02h ARRIVE Cogburg # 28.58h ARRIVE Copperhold # 36.28h ARRIVE Leverstorm
Source code for this article is in Trustbit/logisim repository, within the src/logisim3
.
Mine the data
We’ll use synthetic data from Logistic Kata 2.3 which is stored in history.csv
and looks like this:
The data shows a historical event log of various transports driving between the locations.
Each line tracks one travel between connected locations (a segment) including arrival time and average speed.
We know that there are traffic jams during the day, so we want to build a model that predicts travel time between two adjacent locations given the departure time of the day.
For that, we need to:
Compute departure time for each record in
history.csv
Group all records for each pair of
(Origin -> Destination)
For each group, train a model that predicts the speed, using departure time as a feature.
This time, instead of basic logic from scratch, we are going to use popular Python libraries:
numpy - mathematical functions
pandas - data analysis library
fire - Google library to generate CLI interface from any function
Our imports will follow standard conventions and look like this:
from pathlib import Path import pandas as pd import numpy as np
Data is bundled with the source code, so we can load it into data frames:
DIR = Path(__file__).parent history_df = pd.read_csv(DIR / "history.csv", parse_dates=["Time"]) map_df = pd.read_csv(DIR / "map.csv")
For each record in history_df
we can compute departure time by:
Looking up distance between two locations in
map_df
Computing departure datetime by
Departure=Time-(Km/Speed)
Converting date time (with date) to a time of the day variable via
departure.hour + departure.minute / 60.0
`
Let’s do that.
We start by defining two helper functions. One maps location pairs to a segment name, another one - computes departure time of the day:
def segment(r) -> str: """ Consistently maps location pairs to a segment name A,B to A-B and B,A to A-B """ if r.Orig < r.Dest: return f"{r.Orig}-{r.Dest}" else: return f"{r.Dest}-{r.Orig}" def departure(row) -> float: """ Compute departure time-of-the-day, given arrival time, speed and distance. Departure at 1859-11-26 13:30:00 will map to 13.5 """ arrival = row.Time travel = row.Km / row.Speed departure = (arrival - pd.Timedelta(hours=travel)).time() hour = departure.hour + departure.minute / 60.0 return hour
Next, we join both data frames by the computed Segment
column and then fill in our Depart
value:
history_df["Segment"] = history_df.apply(segment, axis=1) map_df["Segment"] = map_df.apply(segment, axis=1) history_df = history_df.merge( map_df[["Segment", "Km"]], left_on=["Segment"], right_on=["Segment"], how="left" ) history_df["Depart"] = history_df.apply(departure, axis=1)
Training
It is always a good idea to split dataset in two partitions: train and test. We train the model on the former and test quality on the latter.
# get random sample test_df = history_df.sample(frac=test_ratio, axis=0, random_state=1) # get everything but the test sample train_df = history_df.drop(index=test_df.index)
Next, we group all records by Origin, Destination. For each edge, we use numpy to fit a polynomial of degree 3 to available recorded points, where x
is time of the departure and y
is speed. This is a naive way to capture a model.
# we are going to have a model per road edge models = {} # group data for each segment together for k, grp in train_df.groupby(["Orig", "Dest"]): # fit polynomial regression model = np.poly1d(np.polyfit(grp.Depart, grp.Speed, 3)) models[k] = model
At this point our model is a dictionary, where keys are represented by tuples and values are functions. This might be hard to keep in mind, especially in a language like Python.
Let’s wrap that data structure with a class that makes prediction more explicit:
class SpeedModel: def __init__(self, segments: dict): self.segments = segments def predict(self, orig: str, dest: str, hour: int) -> float: model = self.segments[(orig, dest)] prediction = model(hour) return prediction model = SpeedModel(models)
Now we can use this model to compute Mean Squared Error (or MSE) on a test dataset:
test_df["Predict"] = test_df.apply( lambda x: model.predict(x.Orig, x.Dest, x.Depart), axis=1 ) MSE = ((test_df.Predict - test_df.Speed) ** 2).mean() print(f"Mean squared error is {MSE:.4f}")
Given test ratio of 0.2 current implementation prints:
Mean squared error is 46.5593
This is not bad, although still worse than the lowest MSE of 44.29 achieved by Daniel Weller in Transport Tycoon Kata 2.3.
Plug Model into Routing
This SpeedModel
hides away speed calculation complexity, so it is fairly simple to plug into the simulation code from the previous article.
We are going to strip the comments here, they will still be available in the linked source code.
We start by loading a map. This time we ignore speed and keep track of the distance between the adjacent locations:
MAP = defaultdict(list) # port to roads map_file = Path(__file__).parent / df = pd.read_csv(DIR / "map.csv") for _, r in df.iterrows(): MAP[r.Orig].append((r.Dest, r.Km)) MAP[r.Dest].append((r.Orig, r.Km))
Start of the simulation loop is going to stay exactly the same as before, so we are going to skip it here.
We are going to change part of the simulation loop that determines time interval between current clock
in location
and truck arrival to destination
:
# compute time of the day time_of_day = clock % 24 # ask model to predict speed to destination speed = model.predict(location, destination, time_of_day) # compute our arrival time time_to_travel = distance / speed arrival_time = clock + time_to_travel # schedule this trip to continue when simulation forwards to that time travels.put((arrival_time, destination, trip))
Having that out of the way, we could now wire model training and routing into a single function:
import fire from .route import route from .train_speed import train def train_and_route( origin: str = "Steamdrift", dest: str = "Leverstorm", test_ratio=0.2 ): """ train speed model and then route truck. """ model = train(test_ratio) route(model, origin, dest) def main(): fire.Fire(train_and_route)
Thanks to fire
library, we can execute train_and_route
like this:
logisim3 --origin Steamdrift --dest Leverstorm --test_ratio 0.2 # prints: # Mean squared error is 46.5593 # 0.00h DEPART Steamdrift # 15.02h ARRIVE Cogburg # 28.58h ARRIVE Copperhold # 36.28h ARRIVE Leverstorm
Summary
Within this article we have explored how to mine historical data for insights, capture them into a model and plug that model into the deterministic simulation.
We covered the speed model, but the approach could be applied in a similar fashion to any other simulation parameter: location profiles, incident probabilities, fuel consumption or intermodal transfer times.
Source code for this article is in Trustbit/logisim repository, within the src/logisim3
.