A 2021 paper in Nature by Mirhoseini, Goldie, et al.30 about the use of reinforcement learning (RL) in the physical design of silicon chips raised eyebrows, drew critical media coverage, and stirred up controversy due to poorly documented claims. The paper, authored by Google researchers, withheld critical methodological steps, and most inputs needed to reproduce its results. Our meta-analysis shows how two separate evaluations filled in the gaps and demonstrated that Google RL lags behind human chip designers, a well-known algorithm (simulated annealing), and generally available commercial software, while also being slower. Crosschecked data indicates that the integrity of the Nature paper is substantially undermined, owing to errors in conduct, analysis, and reporting. Before publishing, Google rebuffed internal allegations of fraud which still stand. We note policy implications.
Key Insights
A Nature paper from Google with revolutionary claims in AI-enabled chip design was heralded as a breakthrough in the popular press, but it was met with skepticism from domain experts for being too good to be true and for lacking reproducible evidence.
Now, crosschecked data indicate that the integrity of the Nature paper is substantially undermined owing to errors in conduct, analysis, and reporting. Independently, detailed allegations of fraud and research misconduct in the Google Nature paper have been filed under oath in California.
Nature has been slow to enforce its own policies. Delaying retractions of problematic publications is distorting the scientific process. Swift and decisive action is necessary to maintain the integrity and credibility of scientific research.
As AI applications demand greater compute power, efficiency may be improved via better chip design. The Nature paper was advertised as a chip-design breakthrough using machine learning (ML). It addressed a challenging problem to optimize locations of circuit components on a chip and described applications to five tensor processing unit (TPU) chip blocks, implying that no better methods were available at the time in academia or industry. The paper generalized the claims beyond chip design to suggest that RL outperforms the state of the art in combinatorial optimization. “Extraordinary claims require extraordinary evidence” (per Carl Sagan) but the paper lacked results on public test examples (benchmarks16) and did not share the proprietary TPU chip blocks used. Source code—released seven months after publication13 to support the paper’s findings after the initial controversy14,36,37,39,42—was missing key parts needed to reproduce the methods and results (as explained in Cheng et al.11 and Goth18). More than a dozen researchers14,18,36,42 from Google and academia questioned the claims of Mirhoseini, Goldie, et al.,30 performed experiments, and raised concerns5,11 about the reported research. Google engineers have updated their open source13 many times since, filling in some missing pieces but not all.11 The single open source chip-design example in the Google repository13 does not clearly show strong performance of Google’s RL code.11 Apparently, the only openly claimed independent (of Google) reproduction of techniques in Mirhoseini, Goldie, et al.30 was developed in Fall 2022 by UCSD researchers.11 They reverse-engineered key components missing from Google’s open source code13 and fully reimplemented the simulated annealing (SA) baseline11 absent in the code.13 Google released no proprietary TPU chip design blocks used in Mirhoseini, Goldie, et al. (nor sanitized equivalents), ruling out full external reproduction of results. So, the UCSD Team shared27 their experiments on modern, public chip designs: Both SA and commercial electronic design automation (EDA) tools outperformed Google RL code.13
Reporters from The New York Times and Reuters covered this controversy in 202214,42 and found that, well before the Nature submission, several Google researchers (see Table 1) disputed the claims they had been tasked with checking. The paper’s two lead authors complained of persistent allegations of fraud in their research.39 In 2022, Google fired the internal whistleblower14,42 and denied publication approval for a paper written by Google researchers critical of Mirhoseini, Goldie, et al.5 The whistleblower sued Google for wrongful termination under California whistleblower-protection laws: Court documents,37 filed under penalty of perjury, detail allegations of fraud and scientific misconduct related to research in Mirhoseini, Goldie, et al.30 The 2021 Nature News & Views article introducing the paper in the same issue urged replication of the paper’s results. Given the obstacles to replication and the results of replication attempts,11 the author of the News & Views article retracted it. On Sept. 20, 2023, Nature added an online Editor’s Note20 to the paper:
“Editor’s Note: Readers are alerted that the performance claims in this article have been called into question. The Editors are investigating these concerns, and, if appropriate, editorial action will be taken once this investigation is complete.”
A year later (late September 2024), as this article goes to print, the Editor’s note was removed from the Nature article, but an authors’ addendum appeared. This addendum largely repeats the arguments from an earlier statement17 discussed in the section on authors response to critiques. There is little for us to modify in this article: none of the major concerns about the Nature paper have been addressed. In particular, “results” on one additional proprietary TPU block with undisclosed statistics do not support any substantiated conclusions. This only aggravates concerns about cherrypicking and misreporting. The release of a pre-trained model without information about pre-training data aggravates concerns about data contamination—any circuit could have been used in pre-training and then in testing. We do not comment on the recent Google blog post,a except that it repeats the demonstrably false claim of a full source-code release that allows one to reproduce the results in the Nature paper. Among other pieces, source code for SA is missing, and additionally the Nature results cannot be reproduced without proprietary training data and test data.
This article first covers the background and the chip-design task solved in the Nature paper and then introduces secondary sources used.5,11,27,46 Next, the article lists initial suspicions about the paper and shows that many of them were later confirmed. The article then checks if Mirhoseini, Goldie, et al. improved the state of the art, outlines how the authors responded, and discusses possible uses of the work in practice. Finally, the article draws conclusions and notes policy implications.
Background
Components of integrated circuits (ICs) include small gates and standard cells, as well as memory arrays and reusable subcircuits. In physical design,23 they are represented by rectangles within the chip canvas (Figures 1 and 2). Connections between components are modeled by the circuit netlist before wire routes are known. A netlist is an unordered set of nets, each naming components that should be connected. The length of a net depends on components’ locations and on wire routes; long routes are undesirable. The macro placement problem addressed in the paper seeks (x, y) locations for large circuit components (macros) so that their rectangles do not overlap, and the remaining components can be well-placed to optimize chip layout.22,28,33
Circuit placement as an optimization task. After (x, y) locations of all components are known, wires that connect components’ I/O pins are routed. Routes impact chip metrics (for power, timing/speed, and so on). The optimization of (x, y) locations starts with simplified estimates of wirelength without wire routes. Pin locations (x1, y1) and (x2, y2) may be connected by horizontal and vertical wire segments in many ways, but the shortest route length is |x1 − x2 | + |y1 − y2 |.
For multiple pin locations {(xi, yi)}i, this estimate generalizes to
(1)HPWL stands for half-perimeter wirelength, where the perimeter is taken of the bounding box of points {(xi, yi)}i.23,28,33 It is easy to compute and sum over many nets. This sum correlates with total routed wirelength reasonably well. When (x, y) locations are scaled by a factor γ > 0, HPWL also scales by γ, which makes HPWL optimization scale-invariant and appropriate for all semiconductor technology nodes.b Algorithms that optimize HPWL extend to more precisely optimize routed wirelength and technology-dependent chip metrics, so HPWL optimization is a precursor:4,10,22,28
To test new placement methods; once HPWL results are close to the best known, accurate metrics are used for evaluation; or
Followed by optimizations of advanced objectives that extend HPWL, for example, the RL proxy cost function in Mirhoseini, Goldie, et al.
Widely adopted optimization frameworks for placement do not use ML4,22,23,28,33 and can be classified as: simulated annealing, partitioning-driven, and analytical. Simulated annealing, developed in the 1980s24,25,38 and dominant through the mid-1990s,45 starts with an initial layout (for example, random) and alters it by a sequence of actions, such as component moves and swaps, of prescribed length. To improve the final result, some actions may sacrifice quality to escape local minima. SA excels on smaller layouts (up to 100K placeable components) but takes a long time for large layouts. Partitioning-driven methods3 view the circuit connectivity (the netlist) as a hypergraph and use established software packages to subdivide it into partitions with more connections within the partitions (not between). These methods run faster than SA, capture global netlist structures, and were dominant for some 10 years. Yet, the mismatch between partitioning and placement objectives (Equation 1) leaves room for improvement.3 Analytical methods approximate Equation 1 by closed-form functions amenable to established optimization methods. Force-directed placement12 from the 1980s models nets by springs and finds component locations to reconcile spring forces.23 In the 2000s, advanced analytical placement techniques attained superiority10,22,28,33 on all large, public benchmark sets, including those with macros and routing data.10 RePlAce10 from UCSD is much faster than SA and partitioning-based methods, but lags in quality on small netlists.
The Nature paper focuses on large circuit components (macros) among numerous small components. The fixed-outline macro-placement problem, which was formulated in the early 2000s,1,21,44 places all components onto a fixed-size canvas (prior formulations could stretch the canvas). It is now viewed as part of mixed-size placement.3 A 2004 benchmark suite2 for testing mixed-size placement algorithms evaluates the HPWL objective (Equation 1) which, as noted above, is apt for all semiconductor technology nodes. The suite has enjoyed significant use in the literature, for example Cheng et al.,10 Kahng,22 and Markov et al.28
Commercial and academic software for placement is developed to run on modest hardware within reasonable runtime. The methods and software in Mirhoseini, Goldie, et al. consume significantly greater resources, but at least with SA (during comparisons) it is straightforward to obtain progressively better results with greater runtime budget.
Circuit metrics for evaluating optimization results include circuit timing and dynamic power. Unlike power, timing metrics are sensitive to long/slow paths taken by signal transitions in a circuit and are difficult to predict before detailed placement and wire routing. Accurate early estimation of circuit metrics is a popular topic in the research literature but remains an unsolved challenge in physical design because metric values depend on the actual decisions by optimizers. For example, decisions on which wires take the shortest routes and which ones get detoured determine which pairs of wires experience crosstalk and which signal paths become slow.23 Because of this estimation difficulty, optimization methods with closed-form objectives are fundamentally limited in what they can achieve, and circuit implementation may need to be redone when routing cannot be completed or timing constraints cannot be satisfied.22
Key sources. To solve mixed-size placement, the Nature paper first places macros and then places small components with commercial software. It places numerous macros with an RL action policy that is iteratively improved (fine-tuned) at the same time. The RL policy can be pre-trained on prior circuits or initialized “from scratch.” The iterative process runs for a set time (or until no change) and optimizes a fixed (not learned) proxy cost function that blends HPWL, component density, and routing congestion. To evaluate this function, the small components are placed with force-directed placement. The paper claims that RL beats three baselines: (1) macro placement by human chip designers, (2) parallel SA, and (3) RePlAce software from UCSD, which uses no RL.
Among secondary sources discussed in the context of Mirhoseini, Goldie, et al., we prefer scholarly papers5,11,46 but also draw on open source repositories and include FAQs as needed.13,27,c Here, all benchmark sets have hundreds of macros per design, compared to only a handful in sets such as ISPD 2015. We crosscheck claims from three nonoverlapping groups of researchers: those associated with Google Team 1 (Mirhoseini, Goldie, et al. and CT), the Google Team 2 (Bae et al.5), and the UCSD Team (Cheng et al.11 and the Macro Placement Repo—see Table 1). Consistent claims from different groups are even more trustworthy when backed by numerous benchmarks. Both Google Team 2 and the UCSD Team included highly cited experts on floor-planning and placement with extensive publication records and several key references cited in Mirhoseini, Goldie, et al., (such as Cheng et al.,10 Markov et al.,28 and others), as well as experience developing academic and commercial floor-planning and placement tools beyond Google.
Google Team 1 (Nature authors + coauthors) | Google Team 2 + external coauthors | UCSD Team |
---|---|---|
Circuit Training (CT) repo and FAQ13 ISPD 2022 paper46 | Stronger Baselines5 | MacroPlacement repo and FAQ27 ISPD 2023 paper11 |
Four proprietary TPU blocks30 Ariane (public)13—all with numerous macros | 20 proprietary TPU blocks 17 public IBM circuits2 all with numerous macros | All with numerous macros: 17 public IBM circuits2 |
Initial Doubts
While the Nature paper was sophisticated and impressive, its research plan had notable shortfalls. For one, proposed RL was presented as being capable of broader combinatorial optimization (a field that includes puzzle-like tasks such as the Traveling Salesperson Problem, Vertex Cover, and Bin Packing). But instead of illustrating this with key problem formulations and easy-to-configure test examples, it solved a specialty task (macro placement for chip design) for proprietary Google TPU circuit design blocks, providing results on five blocks out of many more available. The RL formulation did not track chip metrics and optimized a simplified proxy function that included HPWL, but it was not evaluated for pure HPWL optimization on open circuit examples, as is routine in the literature.3,4,10,16,22,28,33 New ideas in placement are usually evaluated in research contests on industry chip designs released as public benchmarks,22,33 but Mirhoseini, Goldie, et al. neglected these contest benchmarks.
Some aspects of Mirhoseini, Goldie, et al. looked suspicious, as it did not substantiate several claims and withheld key aspects of experiments, claimed improvements in noisy metrics that the proposed technique did not optimize, relied on techniques with known handicaps that undermined performance in similar circumstances, and may have misconfigured and underreported its baselines. We spell these out and confirm many of them later in the article.
Unsubstantiated claims and insufficient reporting. Serious omissions are clear even without a background in chip design.
U1. With “fast chip design” in the title,30 the authors only described improvement in design-process time as “days or weeks to hours” without giving per-design time or breaking it down into stages. It was unclear if “days or weeks” for the baseline design process included the time for functional design changes, idle time, inferior EDA tools, and so on.
U2. The claim of RL runtimes per testcase being under six hours (for each of five TPU design blocks)30 excluded RL pre-training on 20 blocks (not amortized over many uses, as in some AI applications). Pausing the clock for pre-training (not used by prior methods) was misleading. Also, RL runtimes only cover macro placement, but RePlAce and industry tools place all circuit components.
U3. Mirhoseini, Goldie, et al. focused on placing macros but withheld the number, sizes, and shapes of macros in each TPU chip block, and other key design parameters such as area utilization.
U4. Mirhoseini, Goldie, et al. gave results on only five TPU blocks, with unclear statistical significance, but high-variance metrics produce noisy results (Table 2). Using more examples is common (Table 1).
Chip Metrics → | Area | Routed Wirelength | Power | WNS | TNS |
---|---|---|---|---|---|
Rank correlation to RL proxy cost | 0.00 | 0.28 | 0.05 | 0.20 | 0.05 |
Mean μ | 247.1K | 834.8 | 4,978 | -100 | -65 |
Standard deviation σ | 1.652K | 4.1 | 272 | 28 | 36.9 |
σ/|μ | | 0.01 | 0.00 | 0.05 | 0.28 | 0.57 |
U5. Mirhoseini, Goldie, et al. was silent on the qualifications and level of effort of the human chip designer(s) outperformed by RL. Reproducibility aside, those results could be easily improved (as shown in Cheng et al.11 later).
U6. Mirhoseini, Goldie, et al. claimed improved “area”, but chip area and macro area did not change and standard-cell area did not change during placement (also see the 0.00 correlation in Table 2).
U7. For iterative algorithms that optimize results over time, fair comparisons show per testcase: better-quality metrics with equal runtime, better runtime with equal quality, or wins for both. Mirhoseini, Goldie, et al. offered no such evidence. In particular, if ML-based optimization is used with extraordinary compute resources, then so should be optimization by SA in its most competitive form.
A flawed optimization proxy. The chip design methodology in Mirhoseini, Goldie, et al. uses physical synthesis to generate circuits for further layout optimization (physical design). The proposed RL technique places macros of those circuits to optimize a simplified proxy cost function. Then, a commercial EDA tool is invoked to place the remaining components (standard cells). The remaining operations (including power-grid design, clock-tree synthesis, and timing closure4,23) are outsourced to an unknown third party.30,35 Results are evaluated with respect to routed wirelength, area, power, and two circuit-timing metrics: TNS and WNS.d Per Mirhoseini, Goldie, et al., the proxy cost function did not perform circuit-timing analysis23 needed to evaluate TNS and WNS.e Therefore, it was misleading to claim in Mirhoseini, Goldie, et al. that the proposed RL method led to TNS and WNS improvements on five TPU design blocks without performing variance-based statistical significance tests (TNS and WNS were optimized at later steps unrelated to RL30).
Use of limited techniques. To experts, the methodology in Mirhoseini, Goldie, et al.30 looked to have shortcomings: Using outdated methods made it harder to improve the state of the art (SOTA).
H1. Proposed RL used exorbitant CPU/GPU resources compared to SOTA. Hence, the “fast chip design” claim (presumably due to fewer unsuccessful design attempts) required careful substantiation.
H2. Placing macros one by one (a type of constructive floor-planning23) is one of the simplest approaches. SA can swap and shift macros and make other incremental changes. Analytical methods relocate many components at once. One-by-one placement looked handicapped even when driven by deep RL.
H3. Mirhoseini, Goldie, et al. used circuit-partitioning (clustering) methods similar to partitioning-based methods from 20+ years ago.3,4,23 Those techniques are known to diverge from interconnect optimization objectives.3,23 By placing macros using a clustered netlist without gradual layout refinement, RL runs into the same problem.
H4. Mirhoseini, Goldie, et al. limited macro locations to a coarse grid, whereas SOTA methods10 avoid such a constraint. In Figure 1 (left) macros are placed freely, but a coarse grid used by Google’s RL implementation tends to spread macros apart and disallow large regions for cells, such as in the center of Figure 1 (left). Figure 2 illustrates the difference. Even if RL can run without gridding, it might not scale to large enough circuits without coarse gridding.
H5. The use of force-directed placement from the 1980s12 in Mirhoseini, Goldie, et al. left much room for improvement.
Questionable baselines. The Nature paper used several baselines to claim the superiority of proposed techniques. We already mentioned that the human baseline was undocumented and not reproducible.
B1. Key results in Mirhoseini, Goldie, et al. and in Table 1 give chip metrics for five TPU design blocks. But comparisons to SA do not report those chip metrics.
B2. Mirhoseini, Goldie, et al. mentions that RL results were post-processed by SA but lacks ablation studies to evaluate the impact of SA on chip metrics.
B3. RePlAce10 was used as a baseline in Mirhoseini, Goldie, et al. in a way inconsistent with its intended use. As previously explained, analytical methods do well on circuits with millions of movable components, but RePlAce was not intended for clustered netlists with a reduced number of components: It should be used directly sans clustering (for details, see Bae et al. and Cheng et al.10,11). Clustering can worsen results due to a mismatch between placement and partitioning objectives,3 and by unnecessarily creating large clusters that are hard to pack without overlaps.
B4. Mirhoseini, Goldie, et al. did not describe how macro locations in SA were initialized, suggesting that the authors used a naive approach that could be improved. Later, Bae et al. identified more handicaps in the SA baseline, and Cheng et al.11 confirmed them.
Additional Evidence
Months after the Nature publication, more data became available in Bae et al., Google’s documentation and open source code,13 Nature peer review,35 and in Yue et al.,46 followed by the first wave of controversial media coverage.14,39,42 Nature editors released the peer review file for Mirhoseini, Goldie, et al., including authors’ rebuttals. In the lengthy back-and-forth,35 the authors assured reviewers that macro locations were not modified after placement by RL, confirming coarse-grid placement of macros. Among several contributions, Bae et al.5 implemented the request of Nature Reviewer #335 and benchmarked Google’s technique on 17 public chip-design examples:2 Prior methods decisively outperformed Google RL. American and German professors publicly expressed doubts about the Nature paper.14,42 As researchers noted gaps in the Google open source release,13 such as the grouping (clustering) flow, Google engineers released more code (but not all), prompting more questions. Another year passed, and initial suspicions were expanded11,27 by showing that when macro placement is not limited to a grid, both human designers and commercial EDA tools (separately) outperform Google code.13 In Table 2 of Cheng et al.,11 the authors estimated rank correlation of the proxy cost function optimized by RL to chip metrics used in Table 1 of the Nature paper. Cheng et al.,11 in Table 3, estimated the mean and standard deviation for chip metrics after RL-based optimization. A summary is provided in this article (Table 2), where rank correlations are low for all chip metrics, while TNS and WNS are noisy. Hence, the optimization of TNS and WNS in Mirhoseini, Goldie, et al. relied on a flawed proxy and produced results of dubious statistical significance (see Table 1 in Mirhoseini, Goldie, et al.). We note that σ/|μ | > 0.5 for TNS on Ariane-NG45, as well as on BlackParrot-NG45 in Table 3 of Cheng et al. In additional critical media coverage, Mirhoseini, Goldie, et al. was questioned by three U.S. professors.18,36
↓ Designs / Tools → | Google CT/RL | Cadence CMP | UCSD SA |
---|---|---|---|
Ariane-NG45 | 32.31 | 0.05 | 12.50 |
BlackParrot-NG45 | 50.51 | 0.33 | 12.50 |
MemPool-NG45 | 81.23 | 1.97 | 12.50 |
Undisclosed use of (x, y) locations from commercial tools. Strong evidence and confirmation by Google engineers are mentioned in the UCSD paper11 that the authors withheld a critical detail. When clustering the input netlist, CT merge in Google code13 read in a placement to restructure clusters based on locations. To produce (x, y) locations of macros, the paper’s authors used initial (x, y) locations of all circuit components (including macros) produced by commercial EDA tools from Synopsys.13 The lead authors of Mirhoseini, Goldie, et al. confirmed using this step, claiming it was unimportant.17 But it improved key metrics by 7–10% in Cheng et al.11 So, the results in Mirhoseini, Goldie, et al. needed algorithmic steps that were not included, such as obtaining (x, y) data from commercial software.
More undocumented techniques were itemized in Cheng et al.,11 which mentioned discrepancies between the Nature paper, their source code,13 and the actual code used for chip design at Google. These discrepancies included specific weights of terms in the proxy cost function, a different construction of the adjacency matrix from the circuit, and several “blackbox” elements13 available as binaries with no source code or full description in Mirhoseini, Goldie, et al. Bae et al., Cheng et al.,11 and the Macro Placement Repo27 offer missing descriptions. Moreover, Mirhoseini, Goldie, et al.’s results did not match the methods used because key components were not mentioned in the paper. And neither results nor methods were reproducible from descriptions alone.
Data leakage between training and test data? Per Mirhoseini, Goldie, et al., “as we expose the policy network to a greater variety of chip designs, it becomes less prone to overfitting.” But Google Team 1 showed later in Yue et al.46 that pre-training on “diverse TPU blocks” did not improve quality of results. Pre-training on “previous netlist versions” improved quality somewhat. Pre-training RL and evaluating it on similar designs could be a serious flaw in methodology of Mirhoseini, Goldie, et al. As Google did not release proprietary TPU designs or per-design statistics, we cannot compare training and test data.
Likely limitations. Mirhoseini, Goldie, et al. did not disclose major limitations of its methods but promised success in broader combinatorial optimization. The Ariane design image in Mirhoseini, Goldie, et al. shows macro blocks of identical sizes: a potential limitation, given that commercial chip designs often use a variety of macro sizes. Yet, they do not report basic statistics per TPU block: the number of macros and their shapes, design area utilization, and the fraction of area taken by macros. Based on peer reviews35 and the guidance from Google engineers to the authors of Cheng et al.,11 it appears that TPU blocks had lower area utilization than in typical commercial chip designs. Poor performance of Google RL on challenging public benchmarks from Adya and Markov2 used in Bae et al. and Cheng et al.11 (illustrated in Figure 2) suggests undisclosed limitations. Another possible limitation is poor handling of preplaced (fixed) macros, common in industry layouts, but not discussed in Mirhoseini, Goldie, et al. By interfering with pre-placed macros, gridding (see H4) can impact usability in practice. Poor performance on public benchmarks may also be due to overfitting to proprietary TPU designs.
A middling simulated annealing baseline. The “Stronger Baselines paper”5 from Google Team 2 improved the parallel SA used by Google Team 1 in Mirhoseini, Goldie, et al. by adding “move” and “shuffle” actions to “swap,” “shift,” and “mirror” actions. This improved SA typically produces better results than RL in a shorter amount of time when optimizing the same objective function. Cheng et al11 reproduced qualitative conclusions of Bae et al. with an independent implementation of SA and found that SA results had less variance than RL results. Additionally, Bae et al. suggested a simple and fast macro-initialization heuristic for SA and equalized compute times when comparing RL to SA. Given that SA was widely used in the 1980s and 1990s, comparing to a weak SA baseline contributed to overestimating the new RL technique.
Did the Nature Paper Improve State of the Art?
The Nature editorial15 discussing the paper speculated that “this is an important achievement and will be a huge help in speeding up the supply chain.” But today, after evaluations and reproduction attempts at multiple chip-design and EDA companies, it is safe to conclude that no important achievement occurred because prior chip-design software, particularly from Cadence Design Systems, produced better layouts faster.11,27 If this were known to the paper’s reviewers or to the public, the paper’s claims of improving TPU designs would be nonsensical. The Nature paper claimed that humans produced better results than commercial EDA tools but gave no substantiation. When license terms complicate publishing comparisons to commercial EDA tools,f one compares to academic software and to other prior methods, with the proviso that small improvements are not compelling. Google Team 2 and the UCSD Team took different approaches to comparing methods from the Mirhoseini paper to baselines,5,11,27 but cumulatively reported comparisons to commercial EDA tools, to human designers, prior university software, and to two independent custom implementations of SA.
Google Team 25 followed the descriptions in Mirhoseini and did not supply initial placement information. The UCSD Team11,27 sought to replicate what Google actually did to produce results (lacking details in Mirhoseini, Goldie, et al.).
Google Team 2 had access to TPU design blocks and demonstrated5 that the impact of pre-training was small at best.g
The UCSD Team11,27 lacked access to Google training data and code but followed Google instructions by Google Team 113 for obtaining results similar to those in Mirhoseini, Goldie, et al. without pre-training. They also reimplemented SA following instructions by Google Team 25 and introduced several new chip-design examples (Table 1).
Comparisons using chip metrics and using a commercial EDA tool (Cadence CMP) were made,11,27 which outperformed Google RL. When running RePlAce in this context,11 used only macro locations produced by RePlAce and placed standard cells with the same commercial software used after Google CT/RL13,30 (more details below).
The UCSD Team repeated SA vs. RL comparisons for several configurations11,27 (those in Mirhoseini, Goldie, et al., those in the Github repo,13 and additional ones suggested by Google engineers). The results were consistent.
A chip designer from IBM outperformed Google RL,11,13,27 whereas Bae et al. did not use human baselines.
For comparisons that can be crosschecked, Bae et al. and Cheng et al.11 and the Macro Placement Repo27 report qualitatively similar conclusions. RePlAce was used in the Nature paper in a way inconsistent with its intended use.5 With proper use of RePlAce, Bae et al. and, independently, Cheng et al.11 produce strong results for RePlAce on well-known public ICCAD 2004 benchmarks. The implementation of simulated annealing used in the Nature paper was handicapped.5 Removing the handicaps (in the same source code base) improved results. When properly implemented, SA produces better solutions than Google CT/RL13 using less runtime, when both are given the same proxy cost function. This is shown consistently in Bae et al. and Cheng et al.11 on 17 widely used ICCAD 2004 benchmarks2 and on several modern design benchmarks.11 Compared to Google CT/RL,13 SA consistently improves wirelength and power metrics. For circuit-timing metrics TNS and WNS, SA produces less noisy results but comparable to RL’s results.11 Recall that the proxy function optimized by SA and RL does not include timing metrics,30 making any claims of improvement in these metrics due to SA or RL dubious. Improving upon SOTA requires improving upon all prior baselines.
Google CT/RL failed to improve by quality upon human baselines, commercial EDA tools, and SA. It did not improve SOTA by runtime either (Table 3), and the authors did not disclose per-design data or design-process time. RePlAce and SA gave stronger baselines than described in the paper, when configured/implemented well.
Rebuttals to Critiques of the Nature Paper
Despite critical media coverage14,31,36,42 and technical questions raised, the authors failed to remove remaining obstacles to reproducibility18 of the methods and results in Mirhoseini, Goldie, et al. The UCSD Team’s engineering effort overcame those obstacles, and they followed up on the work of Google Team 25 that criticized the Nature paper, then analyzed many of the issues. Google Team 2 had access to Google TPU designs and the source code used in the paper before the CT GitHub repo appeared. The UCSD authors of Cheng et al.11 and the Macro Placement Repo27 had access to circuit training (CT)13 and benefited from a lengthy involvement of Google Team 1 engineers, but not access to SA code used in Bae et al. or Mirhoseini, Goldie, et al. or other key pieces of code missing from the CT framework.13 Yet, the results in Bae et al., Cheng et al.,11 and the Macro Placement Repo27 corroborate each other, and their qualitative conclusions are consistent. UCSD results for Ariane-NG45 closely match those by Google Team 1 engineers, and in Figure 4 of Cheng et al.11 shows that CT training curves of Ariane-NG45 generated at UCSD match those produced by Google Team 1 engineers. Google Team 1 engineers carefully reviewed the paper11 and the work in Fall 2022 and Winter 2023, raising no objections.27
The two lead authors of the Nature paper left Google in August 2022, but in March 2023 they objected to the results in Cheng et al.11 without remedying the original work’s deficiencies. Those objections were addressed promptly in the FAQ section of the Macro Placement Repo,27 for example, in #6, #11, #13, #15. One issue was the lack of pre-training in experiments in Cheng et al.11
Pre-training. Cheng et al.11 performed training using code and instructions in Google’s Circuit CT repo,13 which states (June 2023): “The results below are reported for training from scratch, since the pre-trained model cannot be shared at this time.”
Per the MacroPlacement FAQ in the Macro Placement Repo, Cheng et al.11 did not use pretraining because, per Google’s CT FAQ,13 pre-training was not needed to reproduce results of Mirhoseini, Goldie, et al. Also, Google did not release pre-training data.
Google Team 25 evaluated pre-training using Google-internal code and saw no impact on comparisons to SA or RePlAce.
Google Team 1 showed46 that pre-training on “diverse TPU blocks” did not improve results, only runtime. Pre-training on “previous netlist versions” gave small improvement. No such previous versions were discussed, disclosed or released in the CT documentation13 or in the paper itself.30
In other words, the lead authors of the Nature paper want others to use pre-training while they did not describe it in detail sufficient for reproduction, did not release code or data for it, and have shown that it does not improve results in the context of their claims. In September 2024 (years after the publication), the authors announced the release of a pre-trained model but not the pre-training data. Hence, we cannot ensure that a particular example used for testing was not used in pre-training.
Old benchmarks. Another objection31 is that public circuit benchmarks2 used in Bae et al.5 and Cheng et al.11 allegedly use outdated infrastructure. In fact, those benchmarks2 have been evaluated with the HPWL objective, which scales accurately under geometric 2D scaling of chip designs and remains appropriate for all technology nodes (Section 2). ICCAD benchmarks were requested35 by Peer Reviewer #3 of the paper. When Bae et al. and Cheng et al.11 implemented this ask, Google RL ran into trouble before routing became relevant: RL lost by 20% or so in HPWL optimization (HPWL is the simplest yet most important term of the proxy cost optimized by CT/RL13,30).
Not training until convergence in experiments in Cheng et al.11 This concern was promptly addressed in FAQ #15 in the Macro Placement Repo: “ ‘training until convergence’ is not described in any of the guidelines provided by the CT GitHub repo for reproducing the results in the Nature paper.”27 Cheng et al. followed guidelines by Google in the CT. Later, their additional experiments indicated that “training until convergence worsens some key chip metrics while improving others, highlighting the poor correlation between proxy cost and chip metrics. Overall, training until convergence does not qualitatively change comparisons to results of Simulated Annealing and human macro placements reported in the ISPD 2023 paper.” RL-vs-SA experiments in Bae et al. predated the CT framework, so trained until convergence per six-hour protocol from Mirhoseini, Goldie, et al.
Computational resources used the Nature paper were exorbitant and difficult to replicate. Since both RL and SA algorithms produce valid solutions early and then gradually improve the proxy function, the best-effort comparisons in Cheng et al.11 used smaller computational resources than in Mirhoseini et al, with parity between RL and SA. The result: SA beat RL. Bae et al.5 compared RL to SA using the same computational resources as Mirhoseini. Results in Cheng et al.11 were consistent with Bae et al.5 If given greater resources, SA and RL are unlikely to further improve chip metrics due to poor correlation to the proxy function from Mirhoseini.
The paper’s lead authors mention in Goldie and Mirhoseini17 that the paper is heavily cited, but they cite no positive reproductions outside Google that cleared all known obstacles. Bae et al. and Cheng et al. do not discuss other ways to use RL in IC design, so we avoid general conclusions.
Can the Work in the Nature Paper Be Used?
The Nature paper claimed applications to recent Google TPU chips, providing credence to the notion that those methods improved state of the art. But aside from vague general claims, no chip-metric improvements were reported for specific production chips.h This article (see section titled “Did the Nature Paper Improve State of the Art?”) shows that the methods in the paper, and in the framework, lag behind SOTA, for example, simulated annealing from the 1980s.24,25,38,45 Moreover, a strong Google internal implementation of SA from Bae et al. could serve as a drop-in replacement of RL in the framework and of the Nature paper. We try to reconcile the claimed use in TPUs with Google CT/RL lagging behind SOTA.5,11
Given the high variance of chip-timing metrics TNS and WNS in RL results (due to low correlation with the proxy metric), trying many independent randomized attempts with variant proxy cost functions and hyperparameter settings may improve best-seen results,37 with much greater runtimes. But SA can also be used this way.
Using in-house methods, even if inferior, is a common methodology in industry practice called dogfooding (“eat your own dogfood”). In most chips, some blocks are not critical (do not affect chip speed) and are good dogfooding candidates. This can explain selective “production use” and reporting.
The results of RL were postprocessed by SA30 but the CT FAQ13 disclaimed this postprocessing—postprocessing was used in the TPU design flow but not when comparing RL to SA. But since full-fledged SA consistently beats RL5,11 SA could substitute for RL (initial locations can be accommodated using an adaptive temperature schedule in SA).
Google Team 1’s follow-up46 shows (in Figure 7) that pre-training improves results only when pre-training on essentially the same design. Perhaps, Google is leveraging RL when performing multiple revisions to IC designs—a valid context, but not described in the Nature paper. Moreover, commercial EDA tools are orders of magnitude faster than RL when running from scratch,11 so pre-training RL does not close the gap.
Can Google CT/RL code13 be improved? RL and SA are orders of magnitude slower than SOTA (Table 3), but pre-training (missing in CT) speeds up RL46 by only several times. The CT repository now contains attempted improvements, but we have not seen serious improvements to chip metrics. Four major barriers to improving the CT repository and the paper remain:
The proxy cost optimized by RL does not reflect circuit timing,11 so improving RL may not help to improve TNS and WNS.
SA outperforms RL when optimizing a given proxy function.5,11 Hence, RL may lose even with a better proxy.
RL’s placement of macros on a coarse grid limits their locations (Figure 2). When a human ignored the course grid, they found better macro locations.11 Commercial EDA tools also avoid this limitation and outperform Google CT/RL.
Clustering as a preprocessing step creates mismatches between placement and netlist partitioning objectives.3,23
Conclusions
This meta-analysis discusses the reproduction and evaluation of results in the Nature paper by Mirhoseini, Goldie, et al.30, as well as the validity of methods, results, and claims. In the paper, we find a smorgasbord of questionable practices in ML,26 including irreproducible research practices, multiple variants of cherry-picking, misreporting, and likely data contamination (leakage). Based on crosschecked newer data, we draw conclusions with ample redundancy (resistant to isolated mistakes): the paper’s integrity is substantially undermined owing to errors in the conduct, analysis and reporting of its study. Omissions, inconsistencies, mistakes, and misrepresentations impacted their methods, data, results, and interpretation.
Conclusions about the Nature paper. Google Team 2 had access to Google internal code whereas Cheng et al.11 reverse-engineered and/or reimplemented missing components. Google Team 2 and the UCSD Team drew consistent conclusions from similar experiments, and each team made additional observations. We crosscheck the results reported in Google Team 2 and the UCSD Team and also account for the CT framework,13 Nature peer reviews,35 and Yue et al.46, and then summarize conclusions drawn from these works. This confirms many of the initial doubts about the claims and identifies additional deficiencies. As a result, it is clear that the Nature paper by Mirhoseini, Goldie, et al. is misleading in several ways, such that the readers can have no confidence in its top-line claims and conclusions. Mirhoseini, Goldie, et al. did not improve SOTA while the methods and results of the original paper were not reproducible from the descriptions provided, contrary to stated editorial policies at Nature (see below). The reliance on proprietary TPU designs for evaluation, along with insufficient reporting of experiments, continues to obstruct reproducibility of the methods and the results. Attempts by the authors of the Nature paper to invalidate critiques have been unsuccessful. Surprisingly, the authors of Mirhoseini, Goldie, et al. have not offered new compelling empirical results one-and-a-half years since the publication of Cheng et al.11
Implications for chip design. Our work highlights deficiencies only in the approach of the Nature paper. But a 2024 effort from China43 compared seven techniques for mixed-size placement using their new independent evaluation framework with 20 circuits (seven with macros). End-to-end results for chip metrics show that ML-based techniques lag behind RePlAce10 (embedded in OpenROAD) and other optimization-based techniques: DREAMPlace (a GPU-based variant of the RePlAce algorithm) and AutoDMP (a Bayesian Optimization wrapper around DREAMPlace). Despite the obvious need to replicate the methods of Mirhoseini, Goldie, et al., the authors of Wang et al.43 were unable to provide such results.
Policy implications. Theoretical arguments and empirical evidence suggest that numerous published papers across various fields cannot be replicated and may be incorrect.34,41 As a case in point, the Nature paper aggravated the reproducibility crisis that is undermining trust in published research.8,34 Retraction Watch tracks 5,000 retractions per year, including prominent cases of research misconduct.19,34 “Research misconduct is a serious problem and (probably) getting worse”,8 which makes it even more important to separate honest mistakes from deliberate exaggerations and misconduct.6,7,19,40 Institutional response is needed,40,41 including clarity in Nature retraction notices.32
Nature Portfolio editorial policies should be followed broadly and rigorously. Quoting from Nature Portfolio (https://go.nature.com/4dshcXv):
“An inherent principle of publication is that others should be able to replicate and build upon the authors’ published claims. A condition of publication in a Nature Portfolio journal is that authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications[…] After publication, readers who encounter refusal by the authors to comply with these policies should contact the chief editor of the journal.”
Specifically for Mirhoseini, Goldie, et al., the Nature editorial15 insisted that “the technical expertise must be shared widely.” But when manuscript authors neglect requests for public benchmarking and obstruct reproducibility, their technical claims should be viewed with suspicion (especially if they later disagree with comparisons to their work17). This point has already been made in a Communications news article.18 Per peer review file,35 the acceptance of the Nature paper30 was conditional on the release of code and data, but this did not happen when Mirhoseini, Goldie, et al.30 was published or later.11 The Nature paper was amended by the authors to claim that the code had been made available (see the “Data and Code Availability” disclaimer). But serious omissions remain in the released code. This is particularly concerning because the paper omitted key comparisons and details, and fraud was alleged under oath in a California court by a Google whistleblower tasked with evaluating the project.37 This makes reproducibility more critical.
It is in everyone’s interest to reach clear and unequivocal conclusions about published scientific claims, free of misrepresentations. Authors, Nature editors and reviewers, and the research community, share the burden of responsibility. Seeking the truth is a shared obligation.6,40
Acknowledgments
This meta-analysis would be impossible without the hard work and dedication to science of the authors of Bae et al. and Cheng et al.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment