Method for synthesis of fault-tolerant topologies with implicit clusters based on de Bruijn transformations in redundant numeral systems

Authors

DOI:

https://doi.org/10.18372/2073-4751.80.19784

Keywords:

topology, efficiency, fault tolerance, survivability, de Bruijn sequences

Abstract

The work is devoted to the development of a method for synthesizing topologies based on de Bruijn transformations in redundant numeral systems, which allows synthesizing fault-tolerant topologies of a given order, including those with implicit clusters. A method for forming such clusters and a method for studying the survivability of topologies using dynamic determination of failure probability based on betweenness centrality are also developed.

The proposed comprehensive approach allows us to synthesize graphs that, on the one hand, contain less redundancy, and on the other hand, have higher survivability due to better use of the available redundancy, which allows us to increase fault tolerance with lower costs and ensure better efficiency.

References

November 2023 | TOP500. URL: https://www.top500.org/lists/top500/2023/11.

Esfahanian, Hakimi. Fault-tolerant routing in debruijn comrnunication networks. IEEE Transactions on Computers, 1985. Vol. 100(9). P. 777–788.

Atchley S. et al. (2023, November). Frontier: Exploring Exascale The System Architecture of the First Exascale Supercomputer. SC23: International Conference for High Performance Computing, Networking, Storage and Analysis : proceedings, Denver, CO, USA, 11–17 November 2023 / SIGHPC, IEEE CS. New York, 2023. P. 1–16. DOI: 10.1145/3581784.3607089.

Aurora | Argonne Leadership Computing Facility. (n.d.). URL: https://www.alcf.anl.gov/aurora.

Eagle System Configuration. (n.d.). High-Performance Computing | NREL. URL: https://www.nrel.gov/hpc/eagle-system-configuration.html

Ajima Y. High-dimensional interconnect technology for the K computer and the supercomputer Fugaku. URL: https://www.fujitsu.com/global/about/resources/publications/technicalreview/topics/article005.html.

Documentation - Network and interconnect. (n.d.). URL: https://docs.lumi-supercomputer.eu/hardware/network/.

About | Leonardo pre-exascale supercomputer. (2024, February 21). Leonardo Pre-exascale Supercomputer. URL: https://leonardo-supercomputer.cineca.eu/about/#:~:text=Leonardo%20features%20a%20Dragonfly%2B%20topology,HPC%20application%20performance%20and%20scalability.

Stunkel C. B. et al. The high-speed networks of the Summit and Sierra supercomputers. IBM Journal of Research and Development. 2020. Vol. 64(3/4). P. 3–1.

MareNostrum 5. (n.d.). BSC-CNS. URL: https://www.bsc.es/ca/marenostrum/marenostrum-5

Morgan T. P. (2022, October 26). The NVSwitch fabric that is the hub of the DGX H100 SuperPOD. The Next Platform. URL: https://www.nextplatform.com/2022/03/23/nvidia-will-be-a-prime-contractor-for-big-ai-supercomputers/.

Wang T. et al. Rethinking the data center networking: Architecture, network protocols, and resource sharing. IEEE access. 2014. Vol. 2. P. 1481–1496.

Jain N. et al. Predicting the performance impact of different fat-tree configurations. SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis : proceedings, Denver, CO, USA, 12–17 November 2017 / SIGHPC, IEEE CS. New York, 2017. P. 1–13. DOI: 10.1145/3126908.312696.

Ohring S. R. et al. On generalized fat trees. 9th international parallel processing symposium : proceedings. Santa Barbara, CA, USA, 25–28 April 1995 / IEEE. 1995. P. 37–44. DOI: 10.1109/IPPS.1995.395911.

Zahavi, E. (2010). D-Mod-K routing providing non-blocking traffic for shift permutations on real life fat trees. CCIT Report, 776, 840.

Alizadeh M., Edsall T. On the data path performance of leaf-spine datacenter fabrics. 2013 IEEE 21st Annual Symposium on High-Performance Interconnects : proceedings. San Jose, CA, USA, 21–23 August 2013 / IEEE. 1995. P. 71–74. DOI: 10.1109/IPPS.1995.395911.

Sabir E., Mamut A., Vumar E. The extra connectivity of the enhanced hypercubes. Theoretical Computer Science. 2019. Vol. 799. P. 22–31.

Shpiner A. et al. Dragonfly+: Low cost topology for scaling datacenters. In 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB) (pp. 1-8). IEEE.

Loutskii H. et al. Increasing the fault tolerance of distributed systems for the Hyper de Bruijn topology with excess code. 2019 IEEE International Conference on Advanced Trends in Information Theory : proceedings. Kyiv, Ukraine, 18–20 December 2019 / IEEE. 2019. P. 1–6. DOI: 10.1109/ATIT49449.2019.9030487.

Dodonov A., Lande D. Modeling the Survivability of Network Structures. URL: https://www.academia.edu/download/108489732/paper1.pdf.

Published

2025-03-15

Issue

Section

Статті