Image Filtering in FPGAs

Image filters – like Gaussian blurring, median filtering, and morphological operations, are indispensable tools in image pre-processing. But with the increasing bandwidth requirements today, such pre-processing tasks can be offloaded to a parallel processing hardware monster known as an FPGA (Field Programmable Gate Array).

FPGAs are reconfigurable devices that contain small Look-Up Tables (LUTs), small memories (Registers), larger block RAM (BRAM), and dedicated multipliers. The user basically programs the connections and parameters for these blocks. FPGAs excel in high parallel processing tasks and, although they typically operate at lower frequencies than CPUs, for a given task they can easily outperform a CPU by orders of magnitude.

From a purely software perspective, images are 2D arrays that reside in memory. A 3×3 convolution in software pseudo-code would look as shown below. This will roughly take 9 multiplications and 8 additions; these will be executed in a serial fashion (more or less).

for (y = 1; y < ImageHeight-1; y++){
    for (x = 1; x < ImageWidth-1; x++){
        O[x][y] = c0*I[x-1][y-1] + c1*I[x][y-1] + c2*I[x+1][y-1] +
                  c3*I[x-1][ y ] + c4*I[x][ y ] + c5*I[x+1][ y ] +
                  c6*I[x-1][y+1] + c7*I[x][y+1] + c8*I[x+1] [y+1];

But from an FPGA paradigm images are simply pixel streams coming in as a raster scan one at a time  (or a packet of them at a time) from top-left to bottom-right, and since FPGAs are rather limited in terms of internal memory, they typically store lines rather than frames (unless when necessary, in such cases external memory would be used).


A 3×3 convolution in an FPGA would look as follows:

The line buffers act as a latency pipe and delay the incoming streams by a factor of the line width (wider images will require more resources). This will have the effect of sliding a 3×3 window in raster scan over the image. The 3×3 kernel stream is then passed to the filter core which performs the actual calculations.

For the actual filtering core a Finite Impulse Response (FIR) filter is implemented, 9 multiplier blocks can be used in parallel with an adder chain (or an adder tree); hence we are able to treat 1 pixel every clock cycle as shown below (register pipelining is not shown):

That same architecture can be extended to support more taps or pixels in parallel (e.g. 8 or 16) making it a powerful, high-throughput, pixel crunching machine!

Food for thought:

  • Dedicated multipliers can operate at higher frequencies than the remaining FPGA fabric; hence you can use less multipliers and accumulators at a faster rate.
  • If your filter is symmetric you can use a pre-adder to save the precious multipliers.
  • If your [3×3] 2D filter kernel is separable then you might save resources by performing a [3×1] followed by a [1×3] convolution.

FPGA image processing can be performed inside smart frame grabbers and cameras. Compared to software implementations, this offers users both the freedom and flexibility to tailor a processing stage to their bandwidth requirements. Keep in mind though that in most cases, this may amount to a trade off between speed and resources.


About Anthony

Anthony is an FPGA designer in the image processing group; he has studied in both computer and electrical engineering and is mainly interested in image processing applications and the acceleration of such modules in hardware. His hobbies include soccer, martial arts, camping, and rock-climbing.
Posted on by Anthony. This entry was posted in Frame grabbers, Image processing and tagged , . Bookmark the permalink.

Comments are closed.