Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P

NOTE: there are many contradicting sources out there, so there are mistakes in this article. Please give me feedback via twitter, mail or comments, so all the info can be completed.

Yes, another post in the answer-to series. At SC12 Intel tries to steal away the show from the Tesla K20 and FirePro S10000.

After two years of waiting Intel finally comes with an accelerator-card: the Xeon Phi. Compare it if NVIDIA would have skipped the GTX 200 series and now has presented the GTX 500 series. Or maybe even the GTX 600 series – we cannot tell yet.

The Phi is not a compute-card as we know it. As you cannot do a 1-to-1 comparison between AMD GCN architecture and NVIDIA Kepler, neither can be easily compared to the Phi. But this article should give an idea on where it is positioned.

Continue reading “Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P”

OpenCL – the battle, part II

Part II: the software-companies

It is very clear what’s at stake for the hardware-companies; we’ve also discussed the operating systems. But what should the software companies do? For companies which make i.e. encoding-software or databases it is very simple: support OpenCL or be years behind (what marketing can’t fix). For most other software there is a dependency on the programming language since OpenCL is a very specialised way of programming which (most times) is too different from in-house knowledge and can therefore be too expensive.

This article is somewhat brief, since most of the material will be discussed further in later-to-be-released articles.

Video-encoding and rendering

Why we had easily 60 frames per second in games but rendering an image of our own house would take minutes? You had the feeling there was a gap between worlds which needed to be closed. OpenGL/DirectX did a lot (also see our next article about OpenGL, OpenCL and DirectX), but was not able to help us in outside games. Apple did a lot to the desktop by integrating hardware-acceleration (later copied by Linux and Windows), but somehow GPU-processed results were not regarded professional and maybe seen more as an intermediate result (to see how it would look like).

Elemental Technologies was first with its H.264/AVC encoder; Nero and nVidia joined forces somewhat later. Both are based on CUDA and not OpenCL. Since rendering is close to what we already expected to come out of a GPU, we think this market is very soon recovering introducing the same product, based on OpenCL.

A few months ago nVidia has released its GPU-based ray-tracing engine, OptiX. On Youtube you can find the demo of VRay‘s accelerated ray-tracing engine.

We expect a lot of news from the graphics-world, since they already know how to program with shaders. A lot of artists will love the free speed-up, but it’s not breaking news this would be possible.

Programming languages

C and C++ are official bindings of OpenCL. And thereby Objective-C (used on i.e. on the iPhone) has native support.

As we described last week, we think that Oracle/Sun is taking OpenCL more serious now, Several wrappers exist for Java, but native support is missing; we would suggest writing the OpenCL-part in C or C++ when using Java, even if this breaks the beauty of the multi-platform-language.

It is very clear Microsoft had a better view with being an early adopter with Visual Studio integration trough profilers (created by AMD and nVidia). You already see higher-level implementations, such as the C#-toolkit OpenTK has included support that goes beyond the default dll-bindings. Also here programming parts in native C would be best.

Python is famous for its endless wrappers around anything, so it was to be expected to find an OpenCL-binding. Python has always been the safe choice for scientific programming, because of its enthusiastic community.

A binding for OpenCL in languages like PHP and Perl is completely absent. Most times this is not a problem, as C-libraries can easily be called.

RapidMind had en product which provided higher-level programming on the GPU, but after its acquisition by Intel, we don’t see the product any more. So we can conclude we just have to wait for native support in other languages than C, C++ and objective-C, to have better support.

Databases

We will cover databases later, when projects are more mature. In short, currently is investigated how GPUs can do, what SUN’s UltraSparcs already did. Since the memory-bandwidth is only great when using the onboard-memory, this is not as promising as it looks. Index-searches can be sped up, but these are not the real bottle-neck in database-performance. We think it is very important to invest in OpenCL-research in this competing market.

Operating Systems

Apple has had good GPU-support since OSX and therefore a good understanding of graphic-cards. Apple started the project OSX and already has updated several core libraries with OpenCL.

Microsoft has built DirectCompute in DirectX 11. This is OpenCL-technology put in a MS-jacket, as we’ve seen the company do many times before. Coming up is an article which discusses the differences.

Linux (Desktop and HPC) have not great support for OpenCL, but it works well enough to have most large OpenCL/CUDA-upgraded clusters on its name. Due to the flexibility of the OS and the strong competition between nVidia and AMD, a lot of research is done. Nevertheless there are no core-libraries in Linux which support OpenCL. We expect i.e. visualisation-libraries to support OpenCL this year.

Mathematical software

Matlab, Octave, Mathematica, R and Maple will all have a big advantage by using the GPU. Matlab has the most support by external libraries: CUDA, Jacket, gpuMat, etc. Mathematica will soon release a CUDA-version of Mathematica. R is still in discussion, Octave has a few partial/abandoned implementations of some libraries; since there is a lot of money to make by selling these products we can only expect full open-source implementations. Maple refers to its external call routines, so we still have to wait a while until we can have GPU-support.

Conclusion

This short overview gives an idea of where to expect to find OpenCL-powered solutions. When we find more markets the coming weeks, we’ll update this post.

Clear winners cannot be pinpointed, since the door has just opened. Maybe Nero, since it will now sell more of its encoding-products to owners of nVidia GPUs.

Guest-blog: Accelerating sequential machine vision algorithms with OpenMP and OpenCL

Jaap van de LoosdrechtGuest-blogger Jaap van de Loosdrecht wants to share his thesis with you. He leads the Centre of Expertise in Computer Vision department at NHL University of applied sciences and is the owner of his own company, and still managed to study and write a MSc-thesis. The thesis is interesting because it extensively compares OpenCL with OpenMP, especially chapters 7 an 8.

For those who are interested, my thesis “Acceleration sequential machine vision algorithms using commodity parallel hardware” is available at www.vdlmv.nl/thesis.

Keywords: Computer Vision, Image processing, Parallel programming, Multi-core CPU, GPU, C++, OpenMP, OpenCL.

Many other related research projects have considered using one domain specific algorithm to compare the best sequential implementation with the best parallel implementation on a specific hardware platform. This work was distinctive because it investigated how to speed up a whole library by parallelizing the algorithms in an economical way and execute them on multiple platforms.This work has:

  • Examined, compared and evaluated 22 programming languages and environments for parallel computing on multi-core CPUs and GPUs.
  • Chosen to use OpenMP as the standard for multi-core CPU programming and OpenCL for GPU programming.
  • Re-implemented a number of standard and well-known algorithms in Computer Vision using both standards.
  • Tested the performance of the implemented parallel algorithms and compared the performance to the sequential implementations of the commercially available software package VisionLab.
  • Evaluated the test results with a view to assessing:
    • Appropriateness of multi-core CPU and GPU architectures in Computer Vision.
    • Benefits and costs of parallel approaches to implementation of Computer Vision algorithms.

Using OpenMP it was demonstrated that many algorithms of a library could be parallelized in an economical way and that adequate speedups were achieved on two multi-core CPU platforms. With a considerable amount of extra effort, OpenCL was used to achieve much higher speedups for specific algorithms on dedicated GPUs.

At the end of the project, the choice of standards was re-evaluated including newly emerged ones. Recommendations are given for using standards in the future, and for future research and development.

Algorithmic improvements are suggested for Convolution and Connect Component Labelling.

Your feedback and/or questions are welcome.

If you put comments here, I’ll make sure Jaap van de Loosdrecht will get to know and answer your questions on the subjects discussed in his thesis.

MPI in terms of OpenCL

OpenCL is a member of a family of Host-Kernel programming language extensions. Others are CUDA, IMPC and DirectCompute/AMP. It lets itself define by a separate function or set of functions referenced to as kernel, which are prepared and launched by the host to run in parallel. Added to that are deeply integrated language-extensions for vectors, which gives an extra dimension to parallelism.

Except from the vectors, there is much overlap between Host-Kernel-languages and parallel standards like MPI and OpenMP. As MPI and OpenMPI have focused on how to get software parallel for years now, this could give you an image of how OpenCL (and the rest of the family) will evolve. And it answers how its main concept message-passing could be done with OpenCL, and more-over how OpenCL could be integrated into MPI/OpenMP.

At the right you see bees doing different things, which is easy to parallellise with MPI, but currently doesn’t have the focus of OpenCL (when targeting GPUs). But actually it is very easy to do this with OpenCL too, if the hardware supports it such like CPUs.

Continue reading “MPI in terms of OpenCL”

Help write the book “Numerical Computations with GPUs”

9783319065472There is an interesting book coming up: “Numerical Computations with GPUs” – a book explaining various numerical algorithms with code in CUDA or OpenCL.

edit: At the moment there are 21 articles to be included in the book.

edit 2: book should be out in July

edit 3: Order via Springer International or Amazon US.
TOC:

  • Accelerating Numerical Dense Linear Algebra Calculations with GPUs.
  • A Guide to Implement Tridiagonal Solvers on GPUs.
  • Batch Matrix Exponentiation.
  • Efficient Batch LU and QR Decomposition on GPU.
  • A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems.
  • Sparse Matrix-Vector Product.
  • Solving Ordinary Differential Equations on GPUs.
  • GPU-based integration of large numbers of independent ODE systems.
  • Finite and spectral element methods on unstructured grids for flow and wave propagation problems.
  • A GPU implementation for solving the Convection Diffusion equation using the Local Modified SOR method.
  • Pseudorandom numbers generation for Monte Carlo simulations on GPUs: Open CL approach.
  • Monte Carlo Automatic Integration with Dynamic Parallelism in CUDA.
  • GPU-Accelerated computation routines for quantum trajectories method.
  • Monte Carlo Simulation of Dynamic Systems on GPUs.
  • Fast Fourier Transform (FFT) on GPUs.
  • A Highly Efficient FFT Using Shared-Memory Multiplexing.
  • Increasing parallelism and reducing thread contentions in mapping localized N-body simulations to GPUs.

 

Continue reading “Help write the book “Numerical Computations with GPUs””

The knowns and unknowns of the PEZY-SC accelerator at RIKEN

PEZY-SC_QuadPCB-1_smallThe green500 is out and one unknown processor takes the number one position with a huge improvement over last year. It is a new super-computer installed at RIKEN with an incredible 7 GFLOPS/Watt. It is powered by the processor-boards at the right: two Xeons, 4 PEZY-SC 1.4 accelerators and 128GB DRAM, which have a combined performance of about 6.2 TFLOPS. It has been designed for immersive cooling.

The second and third positions are also powered by the PEZY-SC, before we find the winner of last year: the AMD FirePro S9150 and a bit after that the rest (mostly NVidia Tesla). One constant is the CPUs used: Intel XEON is taking most. To my big surprise no ARM64.

green500_2015june_top5

From the third to the first PEZY-SC installation there is an improvement of 13%. It seems the first two are the new type, called “bricks”, while the third is the same as last year. Comparing with that super from last year (4.4945 GFLOPS/W) there is an improvement of 42% and 25%. The 13% improvement from the previous version is interesting enough, but the 25% improvement on exactly the same system raised questions. Probably it is due to compiler-optimisations. As the November-version of the Green500 is much more strict, it will be clear if the rules were bent – let’s hope it’s for real!

It supports OpenCL!

When new accelerators support OpenCL, it gets accepted more easily. So it is very interesting the PEZY-SC runs on OpenCL. I asked at ISC and got explained it was a subset of OpenCL, but could not get the finger on which subset, nor could I get access to test it. It does mean that code that would run well on this machine is easy to port. And then I mean the same “easy” Intel uses for explaining the easyness of porting OpenMP software to XeonPhi: PEZI-specific optimisations and writing around the missing functionality would still take effort – the typical stuff we do at StreamHPC.

RIKEN Shoubu

Some information on “Shoubu” (“Iris” in Japanese), the top 1 on the Green 500. According to the Green500 it is 353.8 TFLOPS (based on 50kW, using an actual benchmark). On 25 June RIKEN announced the Shoubu is 2 PFLOPS (theoretical). If the full machine is used for the Green500, then the efficiency was only 18%!

Below are some images of the installation.

shoubu2  shoubu3  shoubu1

Source: http://www.exascaler.co.jp/wp-content/uploads/2015/06/20150625.pdf

An important part is Exascaler’s immersion technology, what I understood is a spin-off of PEZY. I’m very curious what the AMD FirePro S9150 does when it uses immersion-cooling – I think we have to do some frying at the office to find out.

PEZY-SC1.4 and PEZY-SC2

PEZY started with a multi-core processor of 512 cores, the PEZY-1. The PEZY-SC has 1024 cores and has had a few gradual upgrades – currently PEZY-SC 1.4 (“the brick”) is installed.

PEZY-SC Specification:

Logic Cores(PE) 1,024
Core Frequency 733MHz
Peak Performance Floating Point. Single 3.0TFlops / Double 1.5TFlops
Host Interface PCI Express GEN3.0 x8Lane x 4Port (x16 bifurcation available)
JESD204B Protocol support
DRAM Interface DDR4, DDR3 combo 64bit x 8Port Max B/W 1533.6GB/s
+Ultra WIDE IO SDRAM (2,048bit) x 2Port Max B/W 102.4GB/s
Control CPU ARM926 dual core
Process Node 28nm
Package FCBGA 47.5mm x 47.5mm, Ball Pitch 1mm, 2,112pin

Source: http://pezy.co.jp/en/products/pezy-sc.html

Development on PEZY-SC2 is ongoing, which will have a staggering 4096 cores. Ofcourse efficiency has to go up (if the 18% is correct), to make this a good upgrade.

There is no promise on when the PEZY-SC2 will be announced, but it will certainly surprise us again hen it arrives.

Meet us in April

9017503_mThe coming month we’re travelling every week. This generates are a lot of opportunities where you can meet the StreamHPC team! For appointments, send an email to contact@streamhpc.com.

  • Meet us at ParallelCon (6 April 2016, Heidelberg, Germany). Besides the crash course (see below), we also have a talk on Vulkan.
  • Crash Course OpenCL @ ParallelCon (8 April 2016, Heidelberg, Germany). This is part of the conference – you can still buy tickets!
  • Meet us in Toronto (11 April 2016, Toronto, Canada). In Toronto for business, with time for appointments.
  • Meet us at IWOCL (19 April 2016, Vienna, Austria). The event-of-the-year for all OpenCL. So ofcourse we’re there.
  • Meet us in Grenoble (25 April 2016, Grenoble, France). For a training we’re there the whole week. On Thursday and Friday there is time for appointments.

We’re happy to talk business and about technology. Also giving presentations at your company is an option.

This information was previously communicated via the newsletter and on LinkedIn.

Rapid Performance Assessment

tesla-xeonphi-fireproYou might have heard about the major speed-ups GPUs and FPGAs have promised, but also about the fact that this speed-up will depend a lot on the type of software/algorithm. Investing in OpenCL or CUDA can therefore feel risky, since going in costs time and money, while keeping out can potentially give too much space to the competition. But if you want your customers to get the best experience without paying an unnecessary high price, you’ll need to know what the return of your investment could be. With this quick assessment we will help you determine exactly that.

What we’ve done before

Most assessments were on answering the question “How much speed-up can I get using GPUs?“. Other questions were:

  • Does this algorithm work on this specific mobile processor?
  • Can we better use CUDA, OpenCL or OpenGL shaders for this algorithm?
  • Does the HPC code run best on a Tesla K40 or FirePro S9150?
  • How many weeks/months would it take to port all code?
  • How many GPUs do I need for under 1 second responses?
  • Does this code port to an FPGA?
  • Which OpenCL device best suites by algorithm: CPU, GPU, APU, DSP, FPGA or something else?

Is your question in the list?

Program

Within a week we can fully analyse your code, or two weeks if the codebase is large or complex. During the assessment we write/port/optimise code, to be able to support our conclusions with numbers.

After the assessment you get an overview of the hotspots, an indication of total speed-up when using OpenCL (or comparable technology), and the answers to your questions.

Preparations

Send a mail to contact@streamhpc.com for more information, and we’ll call you back to talk about your requirements. Please provide times when you want to be called back.

[button text=”Contact form” url=”https://streamhpc.com/about-us/contact/” color=”red” target=”_self”]

Win an OpenCL mug!

The first batch is in and you can win one from the second batch!

We’re sending a mug to a random person who subscribes to out newsletter before the end of 17 April 2017 (Central European Time). Yes, that’s a Monday.

Two winners

We’ll pick two winners: one from academia and one from industry. If you select “other” as your background, then share which category you fall in the last field.

Did you already subscribe and also want to win? I am not forgetting you – more details are in a newsletter next quarter.

More winners, by referring to a friend

If you refer a colleague, a friend or even a stranger to subscribe, you can both win a mug. Just be sure he/she remembers to mention you to me when I ask. Before you ask: the maximum referral-length is 5 (so referral of referral of referral of referral, etc) plus the one who started it.

UPDATE: If you win a mug and were not referred by somebody, you can pick a co-winner yourself. Joy should be shared.

Newsletter
Sign up for our Newsletter
* = required field

You can also use this link http://eepurl.com/bgmWaP.

Feedback & Privacy

thankyouThis field is huge and ever-changing, which means that certain old posts might need to get updated information. Also, most of our team is made by humans: a species famous for making all kinds of mistakes.

We care about what you say!

"Feedback is the breakfast of champions."
--Ken Blanchard
  • We are not native English speakers. Did we say something strange?
  • Is the site somewhat too technical or is it missing a bit of hard-core code examples?
  • Is some important information missing?
  • Is the publishing of the book or Eclipse-plugin taking too long?
  • Is the site too slow? (or broken in any other way?)
  • Do you have any compliments? We blush easily!

Tell us what you think by using the contact-page or sending an e-mail to feedback@streamhpc.com.

Privacy

We track the pages you visit with Piwik and Google Analytics, and we use this information to improve our webpage. For Google Analytics there is an opt-out tool, to be excluded from any webpage that uses it. For Piwik you have the choice to opt-out by using the form below. If you opt-out, please be kind to give us feedback on how we can improve our page.

Problem solving tactic: making black boxes smaller

We are a problem solving company first, specialised in HPC – building software close to the processor. The more projects we finish, the more it’s clear that without our problem solving skills, we could not tackle the complexity of a GPU and CPU-clusters. While I normally shield off how we do and how we continuously improve ourselves, it would be good to share a bit more so both new customers and new recruits know what to expect form the team.

https://twitter.com/StreamHPC/status/1235895955569938432

Black boxes will never be transparent

Assumption is the mother of all mistakes

Eugene Lewis Fordsworthe

A colleague put “Assumptions is the mother of all fuckups” on the wall, because we should be assuming we assume. Problem is that we want to have full control and make faster decisions, and then assuming fits in all these scary unknowns.

Continue reading “Problem solving tactic: making black boxes smaller”

Rebranding the company name from StreamComputing to StreamHPC

Since 2010 the name StreamComputing has been used and is widely known now in the GPU-computing industry. But the name has three problems: we don’t have the .com domain, it does not directly show what we do, and the name is quite long.

Some background on the old domain name

While the initial focus was Europe, for years our projects are done for 95% for customers outside the Netherlands and over 50% outside Europe – with the .eu domain we don’t show our current international focus.

But that’s not all. The name sticks well in academics, as they’re more used to have longer names – just try to read a book on chemistry. Names I tested as alternatives were not well-received for various reasons. Just like “fast” is associated with fast food, computing is not directly associated with HPC. So fast computing gets simply weird. Since several customers referred to us as Stream, it made much sense to keep that part of the name.

Not a new begin, but a more focused continuation

Stream HPC defines more what we are: we build HPC software. Stream HPC combines the well-known name combined with our diverse specialization.

  • Programming GPUs with CUDA or OpenCL.
  • Scaling code to multiple CPU and GPUs
  • Creating AI-based software
  • Speeding up code and optimizing for given architectures
  • Code improvement
  • Compiler tests and compiler building (LLVM)

With the HPC-focus we were more able to improve ourselves. We have put a lot of time in professionalizing our development-environment and project-management by implementing suggestions from current customers and our friends. We were already used to work fully independent and be self-managed, but now we were able to standardize more of it.

The rebranding process

Rebranding will take some time, as our logo and name is in many places. For the legal part we will take some more time, as we don’t want to get into problems with i.e. all the NDAs. Email will keep working on both domains.

We will contact all organizations we’re a member of over the coming weeks. You can also contact us, if you read this.

StreamComputing will never really go away. The name was with us for 7 years and stands for growing with the upcoming of GPU-computing.

Ask your question

Do you have a question? We are happy to answer all your questions on any subject discussed at this website.

Due to spam floods, we removed the form.

info@streamhpc.com

We try to answer your question within 24 hours.

Our Team

StreamHPC is one of the best known companies in GPU-computing (OpenCL/CUDA/HIP/SYCL), but we are also very active in embedded development, algorithm-design and technologies like graphics (OpenGL/VULKAN), Machine Learning, and HPC (OpenMP/MPI).

We are distributed between mainly Amsterdam, Budapest and Barcelona.

The developers, the heart of the company

The company consists of highly skilled developers and low-level performance engineers. We mostly manage ourselves, but always with help of the group. This way we have influence by showing ownership.

Each employee regularly shares their experience and checks the work of colleagues, to keep the standards high. This results in faster deliveries with higher quality of code, for which we’ve been complimented often.

Want to work at StreamHPC too? Check our jobs-page.

The Leads

The senior team deals with new directions/markets/strategies, training the employees and making sure the project teams get enabled. We use Holacracy and EOS to lead our company.

  • HR: Berrak Bas
  • Consultancy + Projects. Adel Johar
  • Marketing + Sales: Vincent Hindriksen
  • General strategy: Vincent Hindriksen
  • IT: Robin Voetter and Balint Soproni
  • Operations: Maurizio Campese
  • Finance: shared
  • Legal: shared
  • Open standards: shared
  • Offices: shared
  • Products: shared

The shared roles do not need to be filled right now, as they are done together or done outside the company. When these become full-time roles, we will make them vacant and publish them on our jobs-page.

Hire the experts

On average we have our pipeline full for 3-6 months, but always reserve time for shorter projects (maximum a month).

Call +31 854865760 or mail to info@streamhpc.com or fill in the contact form to have a chat on how we can solve your software performance problems or do your software development.

OpenCL on the CPU: AVX and SSE

When AMD came out with CPU-support I was the last one who was enthusiastic about it, comparing it as feeding chicken-food to oxen. Now CUDA has CPU-support too, so what was I missing?

This article is a quick overview on OpenCL on CPU-extensions, but expect more to come when the Hybrid X86-Processors actually hit the market. Besides ARM also IBM already has them; also more about their POWER-architecture in an upcoming article to give them the attention they deserve.

CPU extensions

SSE/MMX started in the 90’s extending the IBM-compatible X86-instruction, being able to do an add and a multiplication in one clock-tick. I still remember the discussion in my student-flat that the MP3s I could produce in only 4 minutes on my 166MHz PC just had to be of worse quality than the ones which were encoded in 15 minutes. No, the encoder I “found” on the internet made use of SSE-capabilities. Currently we have reached SSE5 (by AMD) and Intel introduced a new extension called AVX. That’s a lot of abbreviations! MMX stands for “MultiMedia Extension”, SSE for “Streaming SIMD Extensions” with SIMD being “Single Instruction Multiple Data” and AVX for “Advanced Vector Extension”. This sounds actually very interesting, since we saw SIMD and Vectors op the GPU too. Let’s go into SSE (1 to 4) and AVX – both fully supported on the new CPUs by AMD and Intel.

Continue reading “OpenCL on the CPU: AVX and SSE”

Start your GPU-career here

GPUs have been our mysterious friends and known enemies for years, as they let us run code in expected and unexpected ways. GPUs have solved problems for many of our customers. GPUs have such a high rate of evolvement, that they’ll remain important for the years to come.

Problem is that programming GPUs is not an easy task. Where do you learn to program GPUs? We found these to be the main groups:

  • Universities
  • Research centers
  • GPU vendors (AMD, Nvidia, Intel, Qualcomm, ARM)
  • Self-study

This is far from enough. Add to that, that only a very select group learns the craft at a company. We’d like to change that, and we think now is the time for us to be able to deliver on this.

In January we’ll our internal training program will start with 4 to 8 developers. Focus in on fully understanding recent GPU-architectures, CUDA and OpenCL. It will consist of lectures, workshops, discussions, paper reading and ofcourse coding for one month. The months after that will have guidance, paper presentations, code reviews and time for self-study. The exact form will differ per person.

The hard side

The current measurable requirements are:

  • EU citizen or already having a working permit
  • Great at C/C++
  • High interest in algorithmic optimisations
  • Any performance improvement focus (i.e. Assembly, clean code) is a plus
  • Any GPU experience (i.e. OpenGL, DirectX, self-study) is a plus
  • High interest in performance
  • Willing to move to Amsterdam
  • Willing to work for Stream HPC for at least 2 years

The soft side

We’re looking for people that fit our culture and we think we can train. This means that the selection is based for a large part on “the spark”. Therefore the application starts with a speed date, and we’re sorry for not finding a better wording for this. This is a 20 minute discussion about what we like and what we don’t. This can be done via phone, Skype or in person, during the evening, in the weekends or during your lunch break.

How to apply

Read about our company culture. Look at the jobs we have open. These describe the requirements after the training. Then write us a motivational letter: explain us why this is exactly what you want, why you’re capable and why you’re a cultural fit. If you find it hard to write such letter, then just start with answering the list of requirements. It’s a big bonus to share code (Github, Gitlab, zip-file). Send your email to jobs@streamhpc.com

Other jobs

Feeling more senior? We have other jobs:

Texas Instruments DSP

Texas-Instruments-logo-designTI has a fully conformant OpenCL 1.1 implementation.

The below table is taken from http://downloads.ti.com/mctools/esd/docs/opencl/intro.html and shows which DSPs have OpenCL-support.

SoC System Khronos Conformance Installation Instructions
AM572 AM572 EVM OpenCL v1.1 Conformant Processor SDK for AM57x
DRA75x DRA75x EVM OpenCL v1.1 Conformant Processor SDK for DRA7x (Enabling OpenCL on DRA75x)
AM571 AM572 EVM OpenCL v1.1 Conformant Processor SDK for AM57x
66AK2H 66AK2H EVM OpenCL v1.1 Conformant Processor SDK for K2H
66AK2L 66AK2L EVM Not submitted for conformance Processor SDK for K2L
66AK2E 66AK2E EVM Not submitted for conformance Processor SDK for K2E
66AK2G 66AK2G EVM Not submitted for conformance Processor SDK for K2G

Theoretical Performance of the C66x

  • Fixed point 16×16 MACs per cycle: 32
  • Fixed point 32×32 MACs per cycle: 8
  • Floating point single precision MACs per cycle: 8
  • Arithmetic floating point operations per cycle: 16 2-way SIMD on .L and .S units (e.g. 8 SP operations for A and B) and 4 SP multiply on one .M unit (e.g 8 SP operations for A and B)
  • Arithmetic floating point operations per cycle: 164 2-way SIMD on .L and .S units (e.g. 8 SP operations for A and B) and 4 SP multiply on one .M unit (e.g 8 SP operations for A and B)
  • Load/store width 2 x 64-bit 2 x 64-bit Vector size (SIMD capability): 128-bit (4 x 32-bit, 4 x 16-bit, 4x-8bits)

GFLOPs

2 FLOPs – 2-way SIMD on .L1 (A side) such as DADDSP or DSUBSP
2 FLOPs – 2-way SIMD on .L2 (B side) such as DADDSP or DSUBSP
2 FLOPs – 2-way SIMD on .S1 (A side) such as DADDSP or DSUBSP
2 FLOPs – 2-way SIMD on .S2 (B side) such as DADDSP or DSUBSP
4 FLOPs – 4-way SIMD on .M1 (A side) such as QMPYSP (or CMPYSP, maybe not 4-way SIMD)
4 FLOPs – 4-way SIMD on .M2 (B side) such as QMPYSP (or CMPYSP, maybe not 4-way SIMD)
========================
16 FLOPs total per cycle per C66x CorePac (source)

Boards

A good starter board is the BeagleBoard X-15, and has OpenCL drivers. It has 2x C66X DSPs and 2x 1.5-GHz ARM Cortex-A15.

X15_TOP_SIDE

StreamHPC flirts with ARM

 With the launch of twitter-channel @OpenCLonARM we now officially show a strong interest in ARM for compute. And we are not the only ones, as the twitter already has 80 followers (60 in 1.5 day and 12 retweets of the welcome-message).

ARM has made tremendous progress in both technology and market-share. With ARM-64, companies like NVidia (and maybe AMD) in the field, X86 seems to be getting a real competitor. This could happen because since a few years computers are fast enough and are not being replaced by a faster one, but a smaller one (tablet, phone) or extra one. By the rules of the market, current technologies are replaced by the ones that give those other needs. ARM is fast (enough), flexible in design, very cheap, low-power and passively cooled. The biggest obstacle seems to be only getting a standard for a docking-station to connect your mobile, tablet or watch to keyboard, mouse and large screen.

OpenCL is perfect for ARM, as it gives the computation-power to the intensive computations not already covered by hardware-support. In the world of X86 this interests high performance and big data companies, where on ARM this interests also more. Without the need for OpenCL you can already watch HD video, with OpenCL you can encode the video with MP4. This year you will certainly hear more about new possibilities of OpenCL on ARM.

What do you think. Why does Intel not sell IP to ARM-companies as many technologies could be reused? Could Intel be the next ARM as an IP-seller, or will they stay the defender of X86 for many years to come?

streamhpc.com is not affiliated with ARM.

Code Review

Code reviews are one of the fastest ways to get the dev-team back on track in order to add performance to the code. We offer two types of code reviews, all safely under an NDA. This way you keep in control of the development, while getting expert-knowledge in.

A quick scan gives you an overview of the main ways to speed up the code and how it can be done.
This quick scan can be delivered in one week, if necessary, to give you the direction you may require in times of pressure.

Also, an extensive code review can provide all the necessary information for a redesigned architecture.

GPU-code (OpenCL, CUDA, Aparapi, and more)

Writing GPU-code and performing host-code can be tricky. The best method to learn CUDA or openCL is by doing. Nevertheless, you may need feedback sometimes to be sure you’re doing the right thing. We can check your code and give you a report with hand-on tricks to make it optimal.

CPU-code (Java, C, C++ and more)

Many CPU-codes, like Java, C, C++ and C# are written with functionality in mind, but not performance. Adding performance (cache-optimisation, memory-usage reduction, parallelisation of computations, adding OpenMP-threads, etc) is quite doable, but only when you know how. We can help you increase performance of the software through feedback and clear steps.

Let us help you!

If you are interested in this service, request more information today and we will get back to you as soon as possible. Of course, you can also contact us via phone (+31 6 45400456), or e-mail (info@streamhpc.com).

Events&Talks

StreamHPC gives talks at public and in-company events to explain what GPU-programming is, while focusing on the day’s theme.

You are welcome to attend these days, or you can request a talk about OpenCL and GPU-programming to be given at your event.

Agenda Talks

At the events in the list below Vincent Hindriksen will give a talk, or has given a talk.

DateLocation - LanguageDescriptionLink to programType
6 November 2013Nijkerk - EnglishAparapi and Project Sumatra: using GPGPU in JavaNLJUG J-Fall 2013Registration and NLJUG-membership required
31+31 October 2013Cambridge (UK) - EnglishUsing the GPU for Physics computations via OpenCLMosaic3DXRegistration required
4 October 2012Amsterdam - EnglishIntroduction to OpenCL on mobile processors, to make hackers&funders think of new ideas for products which were never possible before.Hackers and Founders (Amsterdam, NL) MeetupFree
20 September 2012EnglishOngoing work on the OpenCL plugin for Eclipse - presented remotely.
I cancelled this, because my work has not advanced enough.
PTP User-Developer Workshop Sept 18-20, ChicagoRegistration required.
28 June 2012Amsterdam - EnglishGPGPU-day organised by StreamHPC. No talks by us, but by many interesting speakers from many Dutch universities.Platform Parallel NLRegistration required.
Free for students and researchers in the Netherlands.
20 June 2012Delft - EnglishIndustry-session at HPDC'12 (session 4). "Parallel Programming for the Masses" about how to walk the road forward while on the road of legacy.HPDC'12Registration required.
15 June 2012Ede - DutchTalk at SDN about how to use GPU-programming in .NET, including introduction to GPU-programming.SDNRegistration required.
Free for SDN-members.
25 May 2012Amsterdam - EnglishIn-company talk about parallel and GPU programming to decrease power usage.Internal

Reservations

For reservations and requests, please mail to events@streamhpc.com.

Agenda Events

We are visiting or have visited the following events. This is perfect if you want to have a quick discussion with us.

DateLocationDescriptionLink to program
20-22 January 2014Vienna, AustriaA premier forum for experts in computer architecture, programming models, compilers and operating systems for embedded and general-purpose systems.HiPEAC
12+13 May 2014Bristol, UKAn annual meeting of OpenCL users, researchers, developers and suppliers to share OpenCL best practise, and to promote the evolution and advancement of the OpenCL standard.IWOCL
20 June 2013Amsterdam, NetherlandsAll about GPGPU in the NetherlandsApplied GPGPU-days
21+22 May 2013London, UKAll about HPC-techniques on low-power devicesLEAP-conference
26-28 February 2013Nürnberg, Germany873 exhibitors from 37 countries. Will focus on the processors with high-end compute-capabilities.Embedded World'13 Exhibition
21-23 January 2013Berlin, GermanyInternational conference on high-performance and embedded architectures and compilers.HiPEAC'13
18 December 2012Paris, FranceCancelled. Meetup by Paris HPC group. This talk will be about most efficient (GFLOPS / Watt) processors existing today.Very High Performance/Watt Processors: the Road to Exascale
17 December 2012Amsterdam, NetherlandsThe e-BioGrid project is part of the BiG Grid project to establish an e-infrastructure for life sciences.e-BioGrid and beyond
13 December 2012Brussels, BelgiumAll around GPUs, FPGAs and upcoming architectures.Symposium on Personal High-Performance Computing
5 + 6 November 2012Amsterdam, NetherlandsBig data event around Smart Systems, Cloud Computing, Mobile and Social Media. (I did a pitch talk here)Perfect Storm Europe
29 November 2012Eindhoven, NetherlandsAltera explains how FPGAs with an ARM-core can work for many types of problems.Altera Soc FPGA eventAltera SoC FPGA event
24 - 26 October 2012Eindhoven, NetherlandsTalks around multiscale science.Opening Symposium Eindhoven Multiscale Institute
26 September 2012Amsterdam, NetherlandsAlles over Grid-computing in Nederland.BiG Grid and beyond