In the past six years we have helped out various customers solve their software performance problems. While each project has been very different, there have been 8 reasons to hire us as performance engineers. These can be categorised in three groups:

Reduce processing time
- Meeting timing requirements
- Increasing user efficiency
- Increasing the responsiveness
- Reducing latency
Do more in the same time
- Increasing simulation/data sizes
- Adding extra functionality
Reduce operational costs
- Reducing the server count
- Reducing power usage

Let’s go into each of these.

Reduce processing time

When people hear about Software Performance Engineering, the first thing they think about is reducing processing time. The largest part of our customers have hired us for this task, but all had different time-requirements.

Meeting timing requirements

The most usual type of project has very specific timing-requirements. Some examples:

When there is 10 minutes of time to respond, the computations cannot take any second later.
On cloud applications, the time is more often less than half a second.
When processing data for a customer using the latest info, the time should not take more than “Let me get your file”.
When a plane lands, there should be no “Just hold a minute”

We did not only use GPUs (OpenCL or CUDA) to make our customer’s code faster, but also redesigned algorithms to be more efficient. In a few cases we could get the timing-requirement by optimising the code – there was no need to port to the GPU anymore.

Increasing user efficiency

Getting hours to minutes or minutes to seconds.

When employees are waiting for the computer to finish, this reduces efficiency. Some examples:

Handling customer-data at the reception, so more customers can be helped.
Reducing a daily batch job of 2 hours to minutes, to minimise overtime.
Importing new data into the system in less time, so the user can more focus on quality.

We found that users feel less powerless when pressure increases, as they have more control of the time and the software. Before the speedup they felt controlled by the software.

Increasing the responsiveness

From seconds or even minutes to milliseconds.

When a system does not react immediately, trust in the system goes down. Some examples:

For creative software the user needs to be in a flow and not have to wait for each small step taken. This is because slower software reduces the number of items in one’s short term memory.
For data analysts getting a feeling for the data takes some “wandering around”. Responsive software reduces learning time.

Besides standard performance engineering, there are more options to make the user interaction more immediate and snappy. For instance a “low resolution” result can let the user preview the results.

Reduce latency

Getting from seconds to milliseconds, or from milliseconds to microseconds.

Where responsiveness deals with users, latency describes automated systems where microseconds matter. The requirements to maximum processing time are often very strict and real-time operating systems could be used. Some examples:

Real time image processing for video-streams.
Feedback-loops for machine control to reduce operational errors.
High-speed networking applications, such as finance.

We choose FPGAs when the latency needs to be in the low microsecond, and GPUs when latency has to be in the millisecond range. Work we did here included porting from CPU to GPU and from GPU to FPGA.

Increase functionality and data sizes

The goal in this category is the same as the previous, but the problem is described often from the perspective of features and data size. This is because the time is not seen as a current problem, but as a future problem.

Add extra functionality

Same processing time, more functionality.

The processing time is described as a disadvantage, but is not a complaint – there is understanding the computations are intensive. On the other hand the customers request extra features. Some examples:

Applying extra image improvement algorithms on the video-stream.
Applying an alternative algorithm to the same data.

In cases where we also improved the existing software for performance, total processing time for more data went down.

Increase simulation/data sizes

Same processing time, tenfold the data.

Each year the data size increases more than the performance of computers increase. Where a single server was enough 3 years ago, now a small cluster has to be rented.

To cope with this explosive growth we are asked to remodel the software that ports well to current end future high-performance processors. Some examples:

Promoting prototypes to production software.
Going from 1D data to 2D and 3D.
Cross-analysing 10,000 shoppers instead of 100.
Doing a weather-model for the whole of Europe instead of one country only.
Improving a stochastic model.
Using higher resolution data.

This is the most common type of problem we solve in this category. Especially proven-to-work models (including prototypes) are chosen to work on larger data sets than they were designed for.

Reduce operational costs

When the operation scales up, the operational costs can increase exponentially. Some of our customers identified the problem werll in advance and let us reduce their operational costs before it got out of hand.

Reducing the server count

Same performance, less hardware.

Processing data can take 10s to 100s of servers, increasing power and maintenance costs and thus lowering the performance/€ or performance/$.

If the computations are the limiting factor, ten dual-socket servers can be replaced by single GPU or FPGA-server, it will be much easier to double the capacity.
Computations can be moved to the user, by doing pre-processing on the mobile phone or desktop. By using the GPU, the device’s battery doesn’t get drained.

We’ve helped early scale-ups who identified the problem of the operational costs sky-rocket when the code is not optimised.

Reducing power usage

Same performance, less Watt.

It’s all about performance/Watt. Some examples:

A GPU can take 200 to 300 Watt, but algorithms can take 10 time less than a 100 Watt CPU.
On smartphone, porting code to the GPU reduces the power usage.
When an FPGA can be used (i.e. networking), there are options to replace the full server by a single 20-30 Watt FPGA.
Porting code to GPUs using HBM, which uses much less memory.

Reducing power usage on portable devices has been the most common use case here.

Recognise one of these problems?

Call or email us to start discussing your problem. In a one week effort of analysing your code and discussing with the developers, we can often provide a good indication how much time the performance improvement can take.

StreamHPC communications

The 8 reasons why our customers had their code written or accelerated by us