I want to build next-generation computer systems. This aspiration originated the first time I managed a cluster to accelerate scientific applications, and was further fostered during my time at Microsoft Research Asia (MSRA) and AWS Shanghai AI Lab (ASAIL), where I learned how emerging real world scenarios challenge existing computer systems. Having my research skills and taste shaped by my research experience in both industrial research labs (MSRA and ASAIL) and academia (ShanghaiTech University and Max Planck Institute for Software System, MPI-SWS), I find rethinking system abstractions to be fascinating, and empirical research through full-system design and implementation to be deeply fulfilling. For these reasons I am determined to pursue a Ph.D. in computer systems.
As advocated in my favorite paper “More is Different” by physicist Philip W. Anderson, at each level of complexity, entirely new properties appear. Following a similar principle, instead of building all levels of computer systems from the same foundation, generations of computer scientists have introduced numerous levels of abstractions to harness the increasing complexity. However, a number of driving forces are creating cracks in the abstractions of current computer systems and three among these excite me the most. First, emerging data intensive applications such as machine learning call for tailored abstractions for their unique computational patterns in order to achieve higher performance. Second, as Moore’s Law is coming to an end, we sometimes need to break the boundary between software and hardware to gain better end-to-end application performance. Lastly, for networked systems in data centers, certain abstraction layers may not be necessary and can even be detrimental to latency and resource utilization, inspiring techniques such as kernel bypassing and resource disaggregation. Over the past few years, I have done research in several projects spanning these themes, primarily focused on systems design for machine learning. For my Ph.D. studies I want to take a broader perspective, while still solving systems challenges related to the above three themes.
Tailoring abstractions for emerging machine learning workloads. I started my first research project in the summer of 2018 when I interned at MSRA. The project’s initial goal was to improve the resource utilization of DNN serving systems in Azure. Through profiling workloads on Azure, I identified a gap between the achieved throughput of the serving system and the capabilities of the underlying hardware, and decided to investigate how to make better use of the parallelism in DNNs to saturate the hardware. Gradually my mentors and I determined that the problem lay in the widely-used “two layer approach” for DNN execution, which separately schedules inter- and intra-operator parallelism, thereby causing the performance problems we observed. In other words, the lack of awareness between schedulers deployed in different abstraction layers prohibits the serving system from effectively utilizing hardware. To address this we proposed new abstractions for both workloads and hardware execution units to holistically schedule inter- and intra-operator parallelism, and implemented them in Rammer, a compiler for DNNs. A key idea of Rammer is to move the scheduling logic originally executed at run-time to instead be pre-determined at compile time. I built the first prototype of Rammer during my first internship, and went back to MSRA in the winter of 2019 to further optimize and integrate it into the open-source NNFusion, which is used internally at Microsoft and also attracted substantial attention from the broader system community. The work on Rammer ultimately led to my co-first author publication at OSDI ‘20.
I further investigated the theme of tailored abstractions during my internship in 2020 at ASAIL. This research project began in a similar way: Graph Neural Networks (GNNs) are an increasingly important workload, yet exhibited poor performance in practice at AWS. Unlike conventional DNNs, GNNs have irregular memory access and data movement patterns, which degrade system performance and cannot be optimized by existing compilers for conventional DNNs. To capture the workload patterns, I proposed the Message Passing Data Flow Graph (MP-DFG) abstraction, which enriches the traditional DFG abstraction with graph-based message passing semantics. Based on the new abstraction, we were able to build a compiler stack called Graphiler to accelerate GNN programs by up to two orders of magnitude. I presented the preliminary version of this work at the first MLSys workshop on GNN systems, and the full version is currently under review.
Accelerating sparse computing through hardware and software co-design. Software-only optimization can eventually reach a hardware limit, and over the course of the Rammer project I was increasingly interested in optimization opportunities beyond just software. This prompted me to take a course on re-configurable computing and to look for related research projects. The opportunity arose towards the end of my time at MSRA. I learned of an ongoing project on accelerating sparse computation in DNNs by designing specialized hardware, and was able to join the project. Sparse encoding for DNN computations has potential performance benefits, but currently even the most carefully engineered software implementations have poor performance. To effectively make use of the relatively low level of sparsity in DNNs, we revised Nvidia’s Tensor Cores and proposed new instructions and a novel bitmap-based sparse encoding, as well as new SPMM and im2col algorithms. While working in this project, I learned a lot about hardware simulation, and contributed primarily to the software side algorithm implementation and final evaluation. This work was later published in ISCA ‘21.
Understanding system performance at scale. The previous projects focused on a single machine setting. After I saw Prof. Jonathan Mace’s presentation in OSDI ‘20, I wanted to gain a better understanding of system performance at scale, and was recently able to come to Germany to join his group at MPI-SWS as a research intern. Using ideas and tools from distributed tracing, we reexamined distributed DNN training pipelines and compared them with data analytics workloads. Eventually we were surprised to find that resource sharing by co-locating workloads, a common practice in the cloud, may not be suitable for distributed DNN training, and we started to envision a resource disaggregation based cluster design. In addition, I also involved in Prof. Antoine Kaufmann’s project on full stack system simulation and contributed to performance modeling by using ideas I learned from distributed tracing. These experiences opened my mind and stimulated my interest in networked systems.
Teaching and Sharing. My first teaching assistant (TA) experience was in my sophomore year. Since then, I have found that teaching and sharing both provide value to others and are a great learning experience, which makes me an enthusiastic sharer. A blog I wrote during OSDI ‘20 about Rammer can serve as an example: it was read over 16 thousand times and was selected for both the English and Chinese MSRA newsletters. During my undergraduate and Master’s studies, in addition to serving as a TA for four years in a row, I founded a student HPC club and organized a series of workshops for junior students. The experience of founding a club also demonstrated to me that diversity and inclusion are essential for a community to flourish. We made a deliberate effort to hear the voices of under-represented students and adjusted the content of events to be as inclusive as possible. In my Ph.D. journey, I will strive to make our community even better through teaching and sharing in an inclusive way.
Feedback? email, anonymously or comment below: