FP10: Generic Number Crunching Via Pixel Bender - Part 2
Many of you have expressed concern regarding the Flash implementation of Pixel Bender using the CPU rather than the GPU like Photoshop and After Effects. A lot has changed since my first post on the topic of generic number crunching; though you can certainly refer to my previous post for reference on how to get started, I have some additional points that I would like to make clear:
1.) Compared to pure AS3, a Pixel Bender kernel will be up to X number of times faster; X being the number of CPU cores the machine it is running on has. It isn’t always a 1:1 performance gain though; I would say on average that a complex task will be around 3x faster than the optimized AS equivalent. Keep in mind that you are also working with 32-bit floating point values in Pixel Bender, where as in AS something like BitmapData is limited to only 8-bit uints per channel.
2.) Even if Flash had full GPU support, the CPU would still beat it out in some cases. A good example of this is when you are using a kernel to handle some generic number crunching. Reading back data from the GPU rather than just displaying it is a very slow process. The mindset that GPU support would be the answer to all of our problems in Flash is simply not true.
3.) Not every task will see a performance gain by moving calculations from AS to a PB kernel. In general, operations to a bitmap or bitmaps will almost always be faster, but in the case of generic number crunching, it depends on the complexity of the math taking place. Simply deferring your cross product calculations from AS to a PB kernel is going to be the same speed or even a bit slower in some cases. The biggest advantage to deferring your complex algorithms to a PB kernel is that you can avoid UI lag while the shader job is processing. This is due to the fact that the calculations are taking place on a different thread than ActionScript. Finding the right balance between AS and PB is simply a matter of experimentation and testing.
4.) PB rendering tasks are split up by row and delegated to each core on a multi-core machine. If the height of an input is one, only one core will be used. In other words, be smart about how you configure the heights of your shader jobs; more CPUs crunching your numbers means faster performance.
5.) There are limitations in the PB Toolkit that prevent you from exporting byte code with certain output types (currently, these problems revolve around limitations with GPUs). These limitations force you to either work with collections that are a width of a certain multiple or do some sort of filtering process in AS after the shader job has completed. The PB command-line utility that Adobe will be releasing soon addresses these limitations by allowing you to export byte code for everything that FP10 supports, thus alleviating the need to do workarounds.
As I have mentioned in previous posts, I will be covering all of this information and a whole lot more in my session ‘Pixel Bender Unleashed’ at Adobe MAX this November. Seats are starting to fill up, so if you haven’t already setup your MAX schedule you should definitely do that before you miss out on any of the big sessions that you wish to attend.
On a final note - I haven’t gotten around to doing the iPhone tutorial for Flash developers yet, but will be doing so this week. Look for a post on it soon!
1 Comment so far
Leave a reply

This is certainly interesting for us. We develop a financial application in Flex:
http://www.softcapital.com/labs/demo/radarlite.html
And we could really use some more number chunching power than we have present in pure Flash. Especially for derivatives calculations; smoothing functions like cubic spline and 4th degree polynomial in realtime. This is no problem in our C++ programs but impossible in Flex, maybe this is the solution.
Thanks for taking this subject up.