Opinion
Artificial Intelligence and Machine Learning

Autonomy 2.0: The Quest for Economies of Scale

Emphasizing self-improvement, simulation-driven refinement, and reduced human oversight for autonomous machines development.

Posted
robot meditating outside at night under a planet, illustration

The past decade has witnessed remarkable advancements in robotics and AI technologies, ushering in the era of autonomous machines. In this new age, service robots, autonomous drones, delivery robots, self-driving vehicles and other autonomous machines are poised to replace humans in providing various services.5 While the rise of autonomous machines promises to revolutionize our economy, the reality has fallen short of expectations despite over a decade of intensive R&D investments.

The current development paradigm, dubbed Autonomy 1.0, scales mainly with the size of engineering team rather than with the amount of relevant data or computational resources. This limitation prevents the autonomy industry from fully leveraging economies of scale, particularly the exponentially decreasing cost of computing power and the explosion of available data.

After evaluating the key scalability blockers, we argue that Autonomy 2.0, which is driven by an ever-evolving software stack that improves with more data and computational power, alongside a novel digital twin paradigm that enables algorithmic exploration through realistic, large-scale simulations prior to deployment, will address these challenges to significantly accelerate progress in the autonomy industry.

Scalability of the Digital Economy

The digital economy, characterized by the use of information technology to create, market, distribute, and consume goods and services, has been the primary driver of global economic growth over the past two decades. The Internet industry, a key component of the digital economy, exemplifies this impact. From 2005 to 2010, it accounted for 21% of GDP growth in mature economies. In 2019, the Internet industry contributed $2.1 trillion to the U.S. economy—approximately 10% of its annual GDP—making it the fourth largest industry behind only real estate, government, and manufacturing. Moreover, the industry directly provides nearly six million jobs, representing 4% of U.S. employment.

Technological scalability is the cornerstone fueling the continuous growth of the digital economy. The most successful companies in this sector have developed core technology stacks that scale with available computational resources and data, rather than with the size of their engineering teams. WhatsApp serves as a remarkable example: When Facebook acquired it for $19 billion, the company had only 32 engineers serving over 450 million users.

In stark contrast, today’s autonomous machine technologies, dubbed Autonomy 1.0, embody practices that a scalable industry should avoid. We compare the R&D expenditures and revenue per employee of two leading public digital economy companies—Microsoft (representing the software industry) and Alphabet (representing the Internet industry)—against two public autonomous driving companies: TuSimple (representing the robot truck industry) and Aurora (representing the robotaxi industry). These autonomous driving companies were selected for the accessibility of their financial data.

Both Alphabet and Microsoft demonstrate remarkable efficiency, allocating less than 20% of their total operating expenditures to R&D. Alphabet, in particular, serves over 4.3 billion users with fewer than 30,000 engineers. Their scalability is primarily constrained by available computational resources and data, rather than by their workforce size.

In Comparison, TuSimple and Aurora allocate more than 70% of their operating expenditures to R&D. Autonomous driving companies often face the challenge of recalibrating their existing technology stacks to adapt to new environments when expanding their user base or deploying services to new regions. Consequently, their scalability is limited by R&D investment or, more directly, by the number of engineers they employ.

This disparity in operational efficiency translates directly to financial performance. Alphabet and Microsoft generate $1.5 million and $0.8 million in revenue per employee respectively, while maintaining high growth rates. Conversely, TuSimple and Aurora generate negligible revenue per employee and struggle with growth. For the autonomy industry to achieve economies of scale comparable to the digital giants, a revolution in the R&D paradigm is imperative.

Autonomy 1.0: An Aging Paradigm

The accompanying figure illustrates the scalability challenges of Autonomy 1.0 using autonomous driving operation data from California between 2018 and 2022. Despite massive investments in autonomous driving over this five-year period, the growth in the operational vehicle fleet has been modest, increasing from just 400 vehicles in 2018 to 1,500 in 2022. Similarly, the annual operation mileage saw only a limited increase from two million miles to five million miles. Most critically, the industry continues to grapple with over 2,000 disengagement incidents per year. Given these trends in Autonomy 1.0, the prospect of large-scale commercial operations of autonomous vehicles remains distant, likely years away from realization.

Autonomy 1.0 is characterized by a modular structure, comprising functional components such as sensing, perception, localization, high-definition maps, prediction, planning, and control [9]. Each of these modules further consists of several functional sub-modules integrated through explicit and hand-crafted logic. Core decision-making tasks, such as planning (responsible for generating optimal and drivable paths), are typically solved using constraint optimization under a large set of manually tuned rules.

When a disengagement incident occurs, engineers must undertake a lengthy debugging process to identify which specific module or rule may have triggered the issue. They then optimize that module or develop logic changes to address the specific problem. However, due to the intricate dependencies and coupling among modules and rules, new software versions often lead to unforeseen issues, significantly slowing down the development process.

Over time, the Autonomy 1.0 software stack has evolved into a complex collection of ad-hoc rules and interdependent modules designed to handle various long-tail events. This complexity has made the system increasingly difficult to debug, maintain, and improve.

The open-source project Apollo serves as a typical example of this complexity. Its perception module alone consists of multiple individual learning-based sub-modules to accomplish tasks such as object detection in 2D images, LiDAR point cloud segmentation, traffic light detection, and lane detection. The planning module makes decisions and plans routes based on data from the perception, prediction, localization, and map modules. These modules often have strong interdependencies, meaning that changes to one module affect the algorithmic performance of downstream modules due to distributional shifts in data, and hence also impact overall system performance. Consequently, the entire system has become complicated and potentially brittle, demanding enormous engineering resources merely to maintain, let alone to advance at a rapid pace.

Figure.  California: The number of self-driving vehicles, mileage, and disengagements from 2018 to 2022.

There are three major scalability bottlenecks of Autonomy 1.0:

  • Complexity Bottleneck: Autonomy 1.0 systems are characterized by intricate designs that demand extensive engineering efforts. These efforts span software and algorithm design, system development, and ongoing maintenance, creating a significant burden on resources and time.

  • Human-Data Bottleneck: A critical limitation of Autonomy 1.0 systems is their dependence on fleets of physical vehicles operated by humans for data collection and testing purposes. Additionally, these systems require large-scale data labeling for system evaluation and model training. This approach is not only costly but also presents significant challenges in scaling operations effectively.

  • Generalization Bottleneck: The architecture of Autonomy 1.0 systems heavily relies on rule-based processing logic and handcrafted interfaces. This design philosophy inherently limits the systems’ ability to adapt and generalize to new environments, creating a substantial barrier to widespread deployment and adoption.

Autonomy 2.0: Scalability Is Everything

While Autonomy 1.0 has been the predominant R&D paradigm for autonomous machines for over a decade, recent breakthroughs in artificial intelligence have sparked new ideas in system architecture, data and model infrastructure, and engineering practices. These advancements, including Transformers,8 offline reinforcement learning,4 large language models (LLMs),1 neural rendering7 and vision-language-action models,6 have paved the way for a new development paradigm we call Autonomy 2.0. In this section, we will outline the main characteristics of Autonomy 2.0, followed by a discussion of key open questions in the subsequent section.

Table.  Summary of Autonomy 1.0 vs. Autonomy 2.0
 Autonomy 1.0Autonomy 2.0
Complexity Bottleneck

numerous functional modules

complicated dependencies among modules

complex runtime constraints

two main neural network modules

clear software interfaces

simple resource requirements

Human-Data Bottleneckdata from road tests and labeling scale with human labordata from synthesis and simulation scale with computing resources
Generalization Bottlenecktask-specific handcrafted logic complicated software stacktask-agnostic learning-based updates simple and stable software stack

The cornerstone of Autonomy 2.0 is scalability, which is achieved through two crucial ingredients:

  • A software stack that continuously improves with increasing scale of data and computational resources.

  • A new paradigm based on digital twins for algorithmic exploration using realistic, controllable, and large-scale simulation before deployment.

The accompanying table summarizes how Autonomy 2.0 addresses the three bottlenecks present in Autonomy 1.0.

Learning-native software stack.  Autonomous machines fundamentally perform two main tasks: perception and action, reflecting the natural dichotomy between past and future. The perception task involves observing the environment and inferring its current state based on accumulated observations. The action task, building upon the inferred environmental state, involves selecting an appropriate sequence of actions to achieve specified goals while anticipating various possible near-future environmental changes.

Benefits.  Autonomy 2.0 addresses the Complexity Bottleneck by significantly reducing the number of learning-based components and, consequently, the amount of non-learning code that requires maintenance. We compare the lines of code in the Apollo Perception module, representing the Autonomy 1.0 approach, with BEVFormer, an example of the perception module in Autonomy 2.0. The Apollo Perception module is 10 times larger than BEVFormer, yet BEVFormer has achieved state-of-the-art perception results.

This software architecture also tackles the Generalization Bottleneck present in Autonomy 1.0 by handling corner cases through data-driven model learning instead of handcrafted logic. An analysis of over 400 issues associated with Apollo planning modules reveals that 47% of the issues relate to Apollo failing to handle specific use cases, while 30% are linked to software engineering problems such as interfaces with other modules. In Autonomy 1.0, numerous handcrafted rules are implemented to handle specific use cases. As these rules accumulate, software quality inevitably becomes an issue.

Architectural Design.  The perception and action modules, while integral parts of autonomous systems, have distinct goals and traditionally require different algorithmic approaches. The perception module is primarily trained using supervised and self-supervised learning techniques. Its objective is to infer a unique ground truth of world states based on sensory inputs. This involves processing and interpreting data from various sensors to create an accurate representation of the environment. In contrast, the action module faces a more complex task. It must search through and choose from many acceptable action sequences while anticipating the diverse potential behaviors of other agents in the environment. This complexity necessitates the use of more diverse methods, including reinforcement learning, imitation learning, and model predictive control. These techniques allow the action module to make decisions that are not only optimal for the current state but also robust to potential future scenarios.

Perception.  Autonomy 1.0 approaches perception through a set of deep neural networks (DNNs), each trained independently to support specific tasks such as 2D/3D object detection, segmentation, and tracking. In contrast, Autonomy 2.0 employs a more unified approach. Its perception module utilizes a single transformer backbone to provide a cohesive representation of the ego-vehicle’s environment, such as 2D Bird’s Eye View (BEV) or 3D occupancy. This unified representation is then processed by several decoder “heads,” each fine-tuned for a specific task.

This single-transformer approach to perception has gained significant traction across the autonomous vehicle industry. Notable examples include Tesla’s description of their system during their “AI Day 2022” event, and the deployment of similar architectures by leading intelligent electric vehicle companies like XPENG. The growing adoption of this approach underscores its potential advantages in streamlining perception tasks and improving overall system efficiency.

Action.  The action module in autonomous systems faces the complex task of anticipating a combinatorially large number of possible “world trajectories.” It must hypothesize multiple action sequences and evaluate them to select the optimal one for actuator execution. In Autonomy 1.0, this process is broken down into separate sub-modules for prediction, planning, and control. Autonomy 2.0, however, takes a more integrated approach. It implements the action module as an end-to-end learned neural network, leveraging transformer-inspired architectures designed for sequential decision making.2 This unified approach aims to streamline the decision-making process and potentially improve the system’s ability to handle complex, dynamic environments.

Simulation-based development and deployment.  Autonomy 1.0 heavily relies on human efforts for tasks such as manual data labeling and physical testing, creating a significant scalability bottleneck. In contrast, Autonomy 2.0 address this human-data bottleneck with two key insights:

  • An efficient and sophisticated data “engine” is essential for processing and mining rare and critical driving scenarios from large-scale real-world data. Such data is invaluable for improving system performance.

  • Neural simulators trained with real-world data can be the source of almost unlimited, realistic, scalable and highly controllable driving scenario data for extensive system evaluation and model training, both for environment perception and action planning.

Autonomy 2.0 specifically leverages an emerging technology trend known as digital twins, where a virtual representation serves as a realistic and customizable replica of the physical world. In this paradigm, the physical system is meticulously instrumented to collect real-world, real-time data, which is then used to reproduce its digital “clone,” and improve neural simulators continuously. Within this digital realm, driving scenarios can be synthesized with exceptional fidelity, accurately replicating both sensor data and driving behaviors that may be encountered in the real world. This approach offers several key advantages:

  • Realism: The digital twin closely mirrors the complexities and nuances of real-world driving environments.

  • Scalability: Virtual scenarios can be generated and modified at a scale far beyond what’s feasible with physical testing.

  • Safety: Dangerous or rare scenarios can be explored without risk to human life or property.

  • Efficiency: The development and testing process can be significantly accelerated, reducing time and costs.

  • Iterative Improvement: The constant feedback loop between the physical and digital worlds allows for rapid refinement of the autonomous system.

This simulation-centric approach has already been adopted by various autonomous driving companies such as Wayve and Waabi.

The development and testing of autonomous driving software using synthesized virtual scenarios offers significant advantages over the physical-only approach of Autonomy 1.0. This method accelerates the evaluation process by a factor of 103 to 1053 and reduces testing costs by two orders of magnitude.10

We have compared the cost efficiency of Autonomy 2.0 versus Autonomy 1.0, and Autonomy 2.0 has achieved a dramatic 90-fold reduction in costs, and 1,000-fold improvement in R&D efficiency measured as kmiles/vehicle/year. The combined effect of these improvements results in a potential 105-fold enhancement in overall efficiency under the same engineering investment in Autonomy 2.0. This paradigm shift effectively eliminates the human-data bottleneck, as scalability becomes primarily constrained by available computational resources rather than the number of engineers.

Challenges

Autonomy 2.0 leverages a learning-native software stack that keeps improving overall system performance with more data, and a digital-twin platform that supplies such data with realism and controlability, such powerful combination allows Autonomy 2.0 to achieve economies of scale. However, it also faces several noticeable challenges.

  • Data Collection Requirements: While Autonomy 2.0 significantly enhances data efficiency, we believe it does not fully eliminate the need for real-world data collection. High-quality, diverse data is essential for training high-fidelity neural simulation models and narrowing the Sim2Real gap. Despite promising advancements in generative AI, such as multimodal large language models and neural radiance fields (NeRFs), these approaches still struggle to capture the unpredictability and dynamic interactions of the real world adequately. The variability in physical environments, the computational demands of high-fidelity models, and the continuous necessity for real-world data to refine simulations suggest that fully bridging the Sim2Real gap will be particularly challenging. In other words, real-world data is the ultimate source of rare challenging scenarios, which also define the boundary of both our problem understanding and autonomous system capability. We cannot completely replace real-world data with simulation, a hybrid approach that combines large scale simulations with selective real-world testing is critical.

  • System Inspection and Debuggability: Autonomy 2.0 offers impressive scalability, but to maintain engineering productivity and product quality, we believe there is an urgent need for appropriate data infrastructure, effective development tools, and comprehensive evaluation metrics. We advocate for an infrastructure for the design and deployment of autonomous machines, which creates automated workflows and tools to streamline the development process. Achieving this will not only enhance productivity and reduce errors but also ensure reliable performance through rigorous testing and validation. Such R&D platforms have emerged in several domains with high engineering complexity and high business values, including Web search engines, ads ranking systems, recommendation systems, EDA, and so forth, but Autonomy 2.0 still lacks it.

  • Architectural Considerations: End-to-end architectures, such as vision-language-action models with multimodality, have shown significant progress in recent years. Specifically, vision-language-action models incorporate natural language and physical action as additional data modalities, enhancing the system’s ability to understand and interact with the physical world. However, there is no clear consensus yet whether such end-to-end architectures need to have customized modules such as modality-specific encoders and decoders. We also believe that the fundamental differences between perception and action tasks, along with their distinct algorithmic requirements, are likely to persist even within a fully end-to-end framework. Perception can fully leverage supervised learning since history is static and unique, whereas planning for future action involves decision making with multiple possibilities, hence requiring reinforcement learning. These differences necessitate distinct model optimizations and simulation needs.

    References

    • 1. Brown, T. et al. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020).
    • 2. Chen, L. et al. Decision transformer: Reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems 34 (2021).
    • 3. Feng, S. et al. Dense reinforcement learning for safety validation of autonomous vehicles. Nature 615, 7953 (2023).
    • 4. Levine, S. et al. Offline Reinforcement Learning: Tutorial, Rev., and Perspectives on Open Problems. (2020); arXiv preprint arXiv:2005.01643 (2020).
    • 5. Liu, S. Shaping the outlook for the autonomy economy. Commun. ACM 67, 6 (June 2024).
    • 6. Ma, Y. et al. A Survey on Vision-Language-Action Models for Embodied AI. (2024); arXiv preprint arXiv:2405.14093
    • 7. Mildenhall, B. et al. NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021).
    • 8. Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
    • 9. Yu, B. et al. Building the computing system for autonomous micromobility vehicles: Design constraints and architectural optimizations. In Proceedings of the 2020 53rd Annual IEEE/ACM Intern. Symp. on Microarchitecture (MICRO). IEEE (2020).
    • 10. Yu, B. et al. Autonomous vehicles digital twin: A practical paradigm for autonomous driving system development. Computer 55, 9 (2022).

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More