The Modular vs. End-to-End Debate: What the AV Industry's Biggest Architectural Decision Means for Mining

2 days ago
6 min read

In my previous piece, I argued that classical motion planning works exceptionally well in off-road domains like mining. But that naturally raises a question: if the on-road AV industry is rapidly moving toward ML-heavy and end-to-end architectures, is mining making a mistake by not following?

The short answer is no. The longer answer requires understanding what these architectural choices actually are, what they trade off, and why mining's constraints point toward a different answer than what works on San Francisco streets.

Three Architectures, One Decision

The autonomous vehicle industry is currently split across three broad architectural approaches. Understanding the differences matters because the choice affects not just performance, but how you debug failures, certify safety, and integrate components from multiple vendors.

The modular pipeline is the traditional approach. Separate modules handle perception, tracking, prediction, planning, and control, each with defined inputs and outputs. Open-source frameworks like Autoware follow this pattern, and it is the dominant architecture in mining today, underpinning systems like Caterpillar's Command and Hauling and Komatsu's FrontRunner. The strength is clarity: when something goes wrong, you can trace the failure to a specific module. The weakness is that information gets lost at module boundaries, and small errors compound through the pipeline.

The end-to-end approach replaces the entire pipeline with a single neural network. Raw sensor data goes in, steering and throttle commands come out. Tesla's FSD is the most prominent example. The appeal is obvious: no information loss between modules, and the ability to learn complex behaviors that are difficult to code explicitly. The challenge is just as real: when the system makes a mistake, there is no module boundary to inspect. The failure exists somewhere inside millions of learned parameters, and diagnosing why it happened is fundamentally harder than tracing a bug through structured code.

The hybrid or unified approach sits between the two. A shared learned backbone processes sensor data into rich scene representations, but the downstream prediction and planning stages remain partially structured. Waymo's production system has evolved significantly in this direction, incorporating transformer-based models and joint prediction-planning across its stack. Companies like Wayve in the UK are exploring similar territory. The idea is to capture the performance benefits of joint learning while preserving interpretability in the planning output. The on-road industry is broadly converging here.

It is worth stating clearly: architecture and methodology are independent choices. A modular system can use deep learning for perception, learned heuristics for planning, and neutral cost functions for behaviour shaping while still maintaining the structured interfaces that make it debuggable, certifiable, and open to multiple vendors. Choosing modular does not mean choosing against ML. It means choosing where ML sits in the stack and what gets the final say on safety. This distinction matter for what follows.

The Trade-Off Triangle

These three architectures force a trade-off between three properties that are hard to maximize simultaneously.

Chart titled "The Architecture Decision for Mining Autonomy" shows a matrix comparing Modular, End-to-End, and Hybrid models in mining.

Interpretability is the ability to understand why the system made a specific decision. Modular stacks score high here because each module produces a human-readable intermediate output: a list of detected objects, a set of predicted trajectories, a planned path. End-to-end systems score low because their internal representations are learned and opaque.

Performance in complex interactions favors end-to-end and hybrid approaches. Joint learning captures the coupling between prediction and planning that modular systems miss. When a robotaxi needs to negotiate an unprotected left turn, it needs to reason about how its own behavior will influence oncoming traffic. In a modular stack, the prediction module doesn't know what the planner intends to do. Joint models solve this by reasoning about ego and other agents simultaneously.

Data and compute requirements separate the approaches sharply. End-to-end systems need massive datasets to learn reliably. Tesla trains on data from millions of vehicles on the road. Hybrid approaches need less but still require significant investment. Classical modular systems can be developed and tuned with far less data, which matters in industries where training data is scarce and expensive to collect.

Why Mining Lands Differently

Here is where the on-road conclusions stop transferring cleanly.

The reason the AV industry is converging on hybrid and end-to-end architectures is that urban driving demands it. The interaction problem in dense city traffic is so complex that hand-coded planning rules cannot cover every scenario. Learned models earn their place by handling the long tail of human behavior: the jaywalker who doesn't look up, the delivery driver who reverses into a bike lane, the construction worker who waves you through a red light.

Mining generally doesn't have a long tail of unpredictable human interactions. Haul trucks follow dispatch routes. Traffic is managed by fleet management systems. The agents on site are known, their behavior is constrained, and the interaction patterns are repetitive. As I described in the previous piece, a classical planning stack can cover 99% or more of the operational scenarios in a mining environment.

That changes the calculus entirely. The performance advantage of end-to-end learning, which is substantial in urban driving, is marginal in mining. But the costs of end-to-end are the same regardless of domain: opaque failure modes, difficulty building safety cases, massive data requirements, and the inability to isolate and fix individual components without retraining the entire model.

Mining also has a practical constraint that the passenger car industry doesn't: fleet sizes measured in dozens or hundreds, not millions. There is no mining equivalent of Tesla's data flywheel. The data-hungry architectures that work at consumer automotive scale aren't realistic for mining fleets.

The Interoperability Argument

A modular architecture has clean interfaces between components. That means a mine operator can source perception from one vendor, planning from another, and fleet management from a third. They can upgrade individual modules without replacing the entire stack. They can evaluate competing solutions at each layer independently.

An end-to-end system, by definition, has no such interfaces. The entire pipeline is a single trained model. Switching vendors means switching everything. Improving one capability means retraining the whole system. For an industry that already struggles with vendor lock-in at the equipment level, adopting an opaque, monolithic autonomy architecture would deepen that dependency at the software level.

The modular approach doesn't mean ignoring ML. Perception should absolutely use learned models. But there is also a strong case for learned components deeper in the stack, and this is where the hybrid framing becomes useful even in mining.

Foundation world models and vision-language-action models can significantly improve a system's semantic understanding of its environment: what a scene means, what other agents are likely to do, how an ego vehicle's actions might influence the situation. These models are good at capturing context that is hard to encode by hand. They can also learn cost functions and behavioral preferences that better align with how skilled human operators actually drive, rather than relying on hand-tuned parameters that approximate it.

The key is what sits at the end of the pipeline. A physics-based verifier checks every planned trajectory against hard constraints: kinematic feasibility, collision boundaries, stability limits. It provides a safety guarantee that no learned model can offer on its own. The learned components propose and shape behavior. The verifier ensures nothing physically unsafe gets through. This separation means you get the semantic richness of modern ML where it helps, without giving up the deterministic safety properties where they matter most.

The architecture I'm advocating is not anti-ML. It is about being deliberate in where ML sits in the stack, what role it plays, and what gets the final say.

The Window Is Now

The mining industry is early enough in its autonomy adoption that these architectural decisions are still open. That isn't true for the passenger car industry, where billions of dollars of infrastructure are already committed to specific approaches.

Mining has the advantage of choosing with full visibility into what the on-road industry has learned over the past decade. The lesson is not that modular stacks are perfect. They have real limitations. The lesson is that for domains with structured interactions, manageable data scales, and high safety requirements, a modular architecture that uses learned models where they add value and physics-based verification where safety demands it delivers the best balance of capability, debuggability, and vendor independence.

The next piece in this series on OpenAutonomy.com will examine what happens after you choose your architecture: how the safety validation frameworks developed through billions of autonomous miles on public roads can inform mining's approach to certifying autonomous operations.