Part 3: Trustbit Logistics Hackathon - Add speed model to logistic simulation

6. Sep.

So far in the series we have built a trivial logistics simulation runtime. At this point it is only capable of finding the fastest route between two locations. This is implemented as a form of A* algorithm that uses predefined travel times.

Let’s extend the implementation and demonstrate how we can “plug” different models into the simulation runtime.

Within this article we’ll mine historical data to build a naive speed model. This model will predict an average speed for a road segment, given time of the day. Logic will live in train function.

The trained model will then be passed to a modified route function that is very similar to the logic from the previous article.

When wired together, we should be able to do something like that:

logisim3 --origin Steamdrift --dest Leverstorm --test_ratio 0.2
# prints:
# Mean squared error is 46.5593
#  0.00h DEPART Steamdrift
# 15.02h ARRIVE Cogburg
# 28.58h ARRIVE Copperhold
# 36.28h ARRIVE Leverstorm

Source code for this article is in Trustbit/logisim repository, within the src/logisim3.

Mine the data

We’ll use synthetic data from Logistic Kata 2.3 which is stored in history.csv and looks like this:

The data shows a historical event log of various transports driving between the locations.

Each line tracks one travel between connected locations (a segment) including arrival time and average speed.

We know that there are traffic jams during the day, so we want to build a model that predicts travel time between two adjacent locations given the departure time of the day.

For that, we need to:

Compute departure time for each record in history.csv
Group all records for each pair of (Origin -> Destination)
For each group, train a model that predicts the speed, using departure time as a feature.

This time, instead of basic logic from scratch, we are going to use popular Python libraries:

numpy - mathematical functions
pandas - data analysis library
fire - Google library to generate CLI interface from any function

Our imports will follow standard conventions and look like this:

from pathlib import Path
import pandas as pd
import numpy as np

Data is bundled with the source code, so we can load it into data frames:

DIR = Path(__file__).parent
history_df = pd.read_csv(DIR / "history.csv", parse_dates=["Time"])
map_df = pd.read_csv(DIR / "map.csv")

For each record in history_df we can compute departure time by:

Looking up distance between two locations in map_df
Computing departure datetime by Departure=Time-(Km/Speed)
Converting date time (with date) to a time of the day variable via departure.hour + departure.minute / 60.0`

Let’s do that.

We start by defining two helper functions. One maps location pairs to a segment name, another one - computes departure time of the day:

def segment(r) -> str:
    """
    Consistently maps location pairs to a segment name
    A,B to A-B and B,A to A-B
    """
    if r.Orig < r.Dest:
        return f"{r.Orig}-{r.Dest}"
    else:
        return f"{r.Dest}-{r.Orig}"
        
def departure(row) -> float:
    """
    Compute departure time-of-the-day, given arrival time, speed and distance.
    Departure at 1859-11-26 13:30:00 will map to 13.5
    """
    arrival = row.Time
    travel = row.Km / row.Speed

    departure = (arrival - pd.Timedelta(hours=travel)).time()
    hour = departure.hour + departure.minute / 60.0
    return hour

Next, we join both data frames by the computed Segment column and then fill in our Depart value:

history_df["Segment"] = history_df.apply(segment, axis=1)
map_df["Segment"] = map_df.apply(segment, axis=1)
history_df = history_df.merge(
    map_df[["Segment", "Km"]], left_on=["Segment"], right_on=["Segment"], how="left"
)
history_df["Depart"] = history_df.apply(departure, axis=1)

Training

It is always a good idea to split dataset in two partitions: train and test. We train the model on the former and test quality on the latter.

# get random sample
test_df = history_df.sample(frac=test_ratio, axis=0, random_state=1)
# get everything but the test sample
train_df = history_df.drop(index=test_df.index)

Next, we group all records by Origin, Destination. For each edge, we use numpy to fit a polynomial of degree 3 to available recorded points, where x is time of the departure and y is speed. This is a naive way to capture a model.

# we are going to have a model per road edge
models = {}
# group data for each segment together
for k, grp in train_df.groupby(["Orig", "Dest"]):
    # fit polynomial regression
    model = np.poly1d(np.polyfit(grp.Depart, grp.Speed, 3))
    models[k] = model

At this point our model is a dictionary, where keys are represented by tuples and values are functions. This might be hard to keep in mind, especially in a language like Python.

Let’s wrap that data structure with a class that makes prediction more explicit:

class SpeedModel:
    def __init__(self, segments: dict):
        self.segments = segments

    def predict(self, orig: str, dest: str, hour: int) -> float:
        model = self.segments[(orig, dest)]
        prediction = model(hour)
        return prediction
        
model = SpeedModel(models)

Now we can use this model to compute Mean Squared Error (or MSE) on a test dataset:

test_df["Predict"] = test_df.apply(
    lambda x: model.predict(x.Orig, x.Dest, x.Depart), axis=1
)

MSE = ((test_df.Predict - test_df.Speed) ** 2).mean()
print(f"Mean squared error is {MSE:.4f}")

Given test ratio of 0.2 current implementation prints:

Mean squared error is 46.5593

This is not bad, although still worse than the lowest MSE of 44.29 achieved by Daniel Weller in Transport Tycoon Kata 2.3.

Plug Model into Routing

This SpeedModel hides away speed calculation complexity, so it is fairly simple to plug into the simulation code from the previous article.

We are going to strip the comments here, they will still be available in the linked source code.

We start by loading a map. This time we ignore speed and keep track of the distance between the adjacent locations:

MAP = defaultdict(list)  # port to roads

map_file = Path(__file__).parent / 
df = pd.read_csv(DIR / "map.csv")
for _, r in df.iterrows():
    MAP[r.Orig].append((r.Dest, r.Km))
    MAP[r.Dest].append((r.Orig, r.Km))

Start of the simulation loop is going to stay exactly the same as before, so we are going to skip it here.

We are going to change part of the simulation loop that determines time interval between current clock in location and truck arrival to destination:

# compute time of the day
time_of_day = clock % 24
# ask model to predict speed to destination
speed = model.predict(location, destination, time_of_day)
# compute our arrival time
time_to_travel = distance / speed
arrival_time = clock + time_to_travel
# schedule this trip to continue when simulation forwards to that time
travels.put((arrival_time, destination, trip))

Having that out of the way, we could now wire model training and routing into a single function:

import fire

from .route import route
from .train_speed import train


def train_and_route(
    origin: str = "Steamdrift", dest: str = "Leverstorm", test_ratio=0.2
):
    """
    train speed model and then route truck.
    """

    model = train(test_ratio)
    route(model, origin, dest)


def main():
    fire.Fire(train_and_route)

Thanks to fire library, we can execute train_and_route like this:

logisim3 --origin Steamdrift --dest Leverstorm --test_ratio 0.2
# prints:
# Mean squared error is 46.5593
#  0.00h DEPART Steamdrift
# 15.02h ARRIVE Cogburg
# 28.58h ARRIVE Copperhold
# 36.28h ARRIVE Leverstorm

Summary

Within this article we have explored how to mine historical data for insights, capture them into a model and plug that model into the deterministic simulation.

We covered the speed model, but the approach could be applied in a similar fashion to any other simulation parameter: location profiles, incident probabilities, fuel consumption or intermodal transfer times.