Image for post
Image for post
Neurons: real or artificial? Credit: IBM Research

Light and In-Memory Computing Help AI Achieve Ultra-Low Latency

By Abu Sebastian

Ever noticed that annoying lag that sometimes happens during the internet streaming from, say, your favorite football game?

Called latency, this brief delay between a camera capturing an event and the event being shown to viewers is surely annoying during the decisive goal at a World Cup final. But it could be deadly for a passenger of a self-driving car that detects an object on the road ahead and sends images to the cloud for processing. Or a medical application evaluating brain scans after a hemorrhage.

Our team, combined with scientists from the universities of Oxford, Muenster and Exeter as well as from IBM Research has developed a way to dramatically reduce latency in AI systems. We’ve done it using photonic integrated circuits that use light instead of electricity for computing. In a recent Nature paper, we detail our combination of photonic processing with what’s known as the non-von Neumann, in-memory computing paradigm — demonstrating a photonic tensor core that can perform computations with unprecedented, ultra-low latency and compute density.

Our tensor core runs computations at a processing speed higher than ever before. It performs key computational primitives associated with AI models such as deep neural networks for computer vision in less than a microsecond, with remarkable areal and energy efficiency.

When light plays with memory

While scientists first started tinkering with photonic processors back in the 1950s, with the first laser built in 1960, in-memory computing (IMC) is a more recent kid on the block. IMC is an emerging non-von Neumann compute paradigm where memory devices, organized in a computational memory unit, are used for both processing and memory. This way, the physical attributes of the memory devices are used to compute in place.

By removing the need to shuttle data around between memory and processing units, IMC even with conventional electronic memory devices could bring significant latency gains. IBM AI hardware center, a collaborative research hub in Albany, NY, is doing a lot of research in this field.

Image for post
Image for post
Artistic depiction of the photonic tensor core with each element representing a photonic memory unit performing in-memory computing using light. Different colors show wavelength division multiplexing. Credit: IBM Research

However, the combination of photonics with IMC could further address the latency issue — so efficiently that photonic in-memory computing could soon play a key role in latency-critical AI applications. Together with in-memory computing, photonic processing overcomes the seemingly insurmountable barrier to the bandwidth of conventional AI computing systems based on electronic processors.

This barrier is due to the physical limits of electronic processors, as the number of GPUs one can pack into a computer or a self-driving car isn’t endless. This challenge has recently prompted researchers to turn to photonics for latency-critical applications. An integrated photonic processor has much higher data modulation speeds than an electronic one. It can also run parallel operations in a single physical core using what’s called ‘wavelength division multiplexing’ (WDM) — technology that multiplexes a number of optical carrier signals onto a single optical fiber by using different wavelengths of laser light. This way, it provides an additional scaling dimension through the use of the frequency space. Essentially, we can compute using different wavelengths, or colors, simultaneously.

In 2015, researchers from Oxford University, the University of Muenster and the University of Exeter developed a photonic phase change memory device that could be written to and read from optically. Then in 2018, Harish Bhaskaran at Oxford (who is also a former IBM researcher) and myself found a way to perform in-memory computing using photonic phase-change memory devices.

Together with Bhaskaran, Prof. Wolfram Pernice of Muenster University and Prof. David Wright from Exeter, we initiated a research program that culminated in the current work. I fondly remember some of the initial discussions with Prof. Pernice on building a photonic tensor core for convolution operations while walking the streets of Florence in early 2019. Bhaskaran’s and Pernice’s teams made significant experimental progress over the following months.

But there was a challenge — the availability of light sources for WDM. This is required for feeding in the input vectors to a tensor core. Luckily, the chipscale frequency combs using nonlinear optics developed by Prof. Tobias Kippenberg from Swiss Federal Institute of Technology (EPFL) in Lausanne provided the critical breakthrough that overcame this issue.

Leaping into the future

Armed with these tools, we demonstrated a photonic tensor core that can perform a so-called convolution operation in a single time step. Convolution is a mathematical operation on two functions that outputs a third function expressing how the shape of one is changed by the other. An operation for a neural network usually involves simple addition or multiplication. One neural network can require billions of such operations to process one piece of data, for example an image. We use a measure called TOPS to assess the number of Operations Per Second, in Trillions, that a chip is able to process.

In our proof of concept, we obtained the experimental data with matrices up to the size of 9×4, with a maximum of four input vectors per time step. We used non-volatile photonic memory devices based on phase change memory to store the convolution kernels on a chip and used photonic chip-based frequency combs to feed in the input encoded in multiple frequencies. Even with the tiny 9×4 matrix, by employing four multiplexed input vectors and a modulation speed of 14GHz, we obtained a whopping processing speed of two trillion MAC (multiply-accumulate) operations per second, or 2 TOPS. The result is impressive since the matrix is so tiny — while we are not performing a lot of operations, we are doing them so fast that the TOPS figure is still very large.

And this is just the beginning. We expect that with reasonable scaling assumptions, we can achieve an unprecedented PetaMAC (thousand trillion MAC operations) per second per mm2. In comparison, the compute density associated with state-of-the-art AI processors is less than 1 TOPS/mm2, meaning less than a trillion operations per second per mm2.

Our work shows the enormous potential for photonic processing to accelerate certain types of computations such as convolutions. The challenge going forward is to string together these computational primitives and still achieve substantial end-to-end system-level performance. This is what we’re focused on now.

This research was partially funded by European Union’s Horizon 2020 research and innovation program (Fun-COMP project, Grant Number 780848)

Feldmann, J., Youngblood, N., Karpov, M. et al. Parallel convolutional processing using an integrated photonic tensor core. Nature 589, 52–58 (2021).

This blog was first published on IBM Research

Written by

This is the official Medium account of IBM Research. It’s managed by IBM Research’s Chief Writer Katia Moskvitch & follows the IBM Social Computing Guidelines.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store