faq.html

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.19: https://docutils.sourceforge.io/" />
<title>Frequently Asked Questions About Intel® ISPC</title>
<link rel="stylesheet" href="css/style.css" type="text/css" />
</head>
<body>
<div class="document" id="frequently-asked-questions-about-intel-ispc">
<div id="wrap">
  <div id="wrap2">
    <div id="header">
      <h1 id="logo">Intel® Implicit SPMD Program Compiler</h1>
      <div id="slogan">An open-source compiler for high-performance SIMD programming on
      the CPU and GPU</div>
    </div>
    <div id="nav">
      <div id="nbar">
        <ul>
          <li><a href="index.html">Overview</a></li>
          <li><a href="features.html">Features</a></li>
          <li><a href="downloads.html">Downloads</a></li>
          <li id="selected"><a href="documentation.html">Documentation</a></li>
          <li><a href="perf.html">Performance</a></li>
          <li><a href="contrib.html">Contributors</a></li>
        </ul>
      </div>
    </div>
    <div id="content-wrap">
      <div id="sidebar">
          <div class="widgetspace">
            <h1>Resources</h1>
            <ul class="menu">
              <li><a href="http://github.com/ispc/ispc">GitHub page</a></li>
              <li><a href="https://github.com/ispc/ispc/discussions">Discussions on GitHub</a></li>
              <li><a href="http://github.com/ispc/ispc/issues">Issues on Github</a></li>
              <li><a href="https://github.com/orgs/ispc/projects/1">Release planning board</a></li>
              <li><a href="https://github.com/ispc/ispc/blob/main/CONTRIBUTING.md">Contributing guide</a></li>
              <li><a href="http://github.com/ispc/ispc/wiki">Wiki on Github</a></li>
            </ul>
        </div>
      </div>
<h1 class="title">Frequently Asked Questions About Intel® ISPC</h1>

<div id="content">
<p>This document includes a number of frequently (and not frequently) asked
questions about ispc, the Intel® Implicit SPMD Program Compiler (Intel® ISPC).
The source to this document is in the file <tt class="docutils literal">docs/faq.rst</tt> in the <tt class="docutils literal">ispc</tt> source
distribution.</p>
<ul class="simple">
<li>Understanding ispc's Output<ul>
<li><a class="reference internal" href="#how-can-i-see-the-assembly-language-generated-by-ispc">How can I see the assembly language generated by ispc?</a></li>
<li><a class="reference internal" href="#how-can-i-have-the-assembly-output-be-printed-using-intel-assembly-syntax">How can I have the assembly output be printed using Intel assembly syntax?</a></li>
<li><a class="reference internal" href="#why-are-there-multiple-versions-of-exported-ispc-functions-in-the-assembly-output">Why are there multiple versions of exported ispc functions in the assembly output?</a></li>
<li><a class="reference internal" href="#how-can-i-more-easily-see-gathers-and-scatters-in-generated-assembly">How can I more easily see gathers and scatters in generated assembly?</a></li>
</ul>
</li>
<li>Language Details<ul>
<li><a class="reference internal" href="#what-is-the-difference-between-int-foo-and-int-foo">What is the difference between &quot;int *foo&quot; and &quot;int foo[]&quot;?</a></li>
<li><a class="reference internal" href="#why-are-pointed-to-types-uniform-by-default">Why are pointed-to types &quot;uniform&quot; by default?</a></li>
<li><a class="reference internal" href="#what-am-i-getting-an-error-about-assigning-a-varying-lvalue-to-a-reference-type">What am I getting an error about assigning a varying lvalue to a reference type?</a></li>
</ul>
</li>
<li>Interoperability<ul>
<li><a class="reference internal" href="#how-can-i-supply-an-initial-execution-mask-in-the-call-from-the-application">How can I supply an initial execution mask in the call from the application?</a></li>
<li><a class="reference internal" href="#how-can-i-generate-a-single-binary-executable-with-support-for-multiple-instruction-sets">How can I generate a single binary executable with support for multiple instruction sets?</a></li>
<li><a class="reference internal" href="#how-can-i-determine-at-run-time-which-vector-instruction-set-s-instructions-were-selected-to-execute">How can I determine at run-time which vector instruction set's instructions were selected to execute?</a></li>
<li><a class="reference internal" href="#is-it-possible-to-inline-ispc-functions-in-c-c-code">Is it possible to inline ispc functions in C/C++ code?</a></li>
<li><a class="reference internal" href="#why-is-it-illegal-to-pass-varying-values-from-c-c-to-ispc-functions">Why is it illegal to pass &quot;varying&quot; values from C/C++ to ispc functions?</a></li>
</ul>
</li>
<li>Programming Techniques<ul>
<li><a class="reference internal" href="#what-primitives-are-there-for-communicating-between-spmd-program-instances">What primitives are there for communicating between SPMD program instances?</a></li>
<li><a class="reference internal" href="#how-can-a-gang-of-program-instances-generate-variable-amounts-of-output-efficiently">How can a gang of program instances generate variable amounts of output efficiently?</a></li>
<li><a class="reference internal" href="#is-it-possible-to-use-ispc-for-explicit-vector-programming">Is it possible to use ispc for explicit vector programming?</a></li>
<li><a class="reference internal" href="#how-can-i-debug-my-ispc-programs-using-valgrind">How can I debug my ispc programs using Valgrind?</a></li>
<li><a class="reference internal" href="#foreach-statements-generate-more-complex-assembly-than-i-d-expect-what-s-going-on">foreach statements generate more complex assembly than I'd expect; what's going on?</a></li>
<li><a class="reference internal" href="#how-do-i-launch-an-individual-task-for-each-active-program-instance">How do I launch an individual task for each active program instance?</a></li>
</ul>
</li>
</ul>
<div class="section" id="understanding-ispc-s-output">
<h1>Understanding ispc's Output</h1>
<div class="section" id="how-can-i-see-the-assembly-language-generated-by-ispc">
<h2>How can I see the assembly language generated by ispc?</h2>
<p>The <tt class="docutils literal"><span class="pre">--emit-asm</span></tt> flag causes assembly output to be generated.  If the
<tt class="docutils literal"><span class="pre">-o</span></tt> command-line flag is also supplied, the assembly is stored in the
given file, or printed to standard output if <tt class="docutils literal">-</tt> is specified for the
filename.  For example, given the simple <tt class="docutils literal">ispc</tt> program:</p>
<pre class="literal-block">
export uniform int foo(uniform int a, uniform int b) {
    return a+b;
}
</pre>
<p>If the SSE4 target is used, then the following assembly is printed:</p>
<pre class="literal-block">
_foo:
        addl    %esi, %edi
        movl    %edi, %eax
        ret
</pre>
</div>
<div class="section" id="how-can-i-have-the-assembly-output-be-printed-using-intel-assembly-syntax">
<h2>How can I have the assembly output be printed using Intel assembly syntax?</h2>
<p>The <tt class="docutils literal">ispc</tt> compiler is currently only able to emit assembly with AT+T
syntax, where the destination operand is the last operand after an
instruction.  If you'd prefer Intel assembly output, one option is to use
Agner Fog's <tt class="docutils literal">objconv</tt> tool: have <tt class="docutils literal">ispc</tt> emit a native object file and
then use <tt class="docutils literal">objconv</tt> to disassemble it, specifying the assembler syntax
that you prefer.  <tt class="docutils literal">objconv</tt> <a class="reference external" href="http://www.agner.org/optimize/#objconv">is available for download here</a>.</p>
</div>
<div class="section" id="why-are-there-multiple-versions-of-exported-ispc-functions-in-the-assembly-output">
<h2>Why are there multiple versions of exported ispc functions in the assembly output?</h2>
<p>Two generations of all functions qualified with <tt class="docutils literal">export</tt> are generated:
one of them is for being be called by other <tt class="docutils literal">ispc</tt> functions, and the
other is to be called by the application.  The application callable
function has the original function's name, while the <tt class="docutils literal">ispc</tt>-callable
function has a mangled name that encodes the types of the function's
parameters.</p>
<p>The crucial difference between these two functions is that the
application-callable function doesn't take a parameter encoding the current
execution mask, while <tt class="docutils literal">ispc</tt>-callable functions have a hidden mask
parameter.  An implication of this difference is that the <tt class="docutils literal">export</tt>
function starts with the execution mask &quot;all on&quot;.  This allows a number of
improvements in the generated code, particularly on architectures that
don't have support for masked load and store instructions.</p>
<p>As an example, consider this short function, which loads a vector's worth
values from two arrays in memory, adds them, and writes the result to an
output array.</p>
<pre class="literal-block">
export void foo(uniform float a[], uniform float b[],
                uniform float result[]) {
    float aa = a[programIndex], bb = b[programIndex];
    result[programIndex] = aa+bb;
}
</pre>
<p>Here is the assembly code for the application-callable instance of the
function.</p>
<pre class="literal-block">
_foo:
        movups        (%rsi), %xmm1
        movups        (%rdi), %xmm0
        addps         %xmm1, %xmm0
        movups        %xmm0, (%rdx)
        ret
</pre>
<p>And here is the assembly code for the <tt class="docutils literal">ispc</tt>-callable instance of the
function.</p>
<pre class="literal-block">
&quot;_foo___uptr&lt;Uf&gt;uptr&lt;Uf&gt;uptr&lt;Uf&gt;&quot;:
        movmskps      %xmm0, %eax
        cmpl          $15, %eax
        je            LBB0_3
        testl         %eax, %eax
        jne           LBB0_4
        ret
LBB0_3:
        movups        (%rsi), %xmm1
        movups        (%rdi), %xmm0
        addps         %xmm1, %xmm0
        movups        %xmm0, (%rdx)
        ret
LBB0_4:
####
####  Code elided; handle mixed mask case..
####
        ret
</pre>
<p>There are a few things to notice in this code.  First, the current program
mask is coming in via the <tt class="docutils literal">%xmm0</tt> register and the initial few
instructions in the function essentially check to see if the mask is all on
or all off.  If the mask is all on, the code at the label LBB0_3 executes;
it's the same as the code that was generated for <tt class="docutils literal">_foo</tt> above.  If the
mask is all off, then there's nothing to be done, and the function can
return immediately.</p>
<p>In the case of a mixed mask, a substantial amount of code is generated to
load from and then store to only the array elements that correspond to
program instances where the mask is on.  (This code is elided below).  This
general pattern of having two-code paths for the &quot;all on&quot; and &quot;mixed&quot; mask
cases is used in the code generated for almost all but the most simple
functions (where the overhead of the test isn't worthwhile.)</p>
</div>
<div class="section" id="how-can-i-more-easily-see-gathers-and-scatters-in-generated-assembly">
<h2>How can I more easily see gathers and scatters in generated assembly?</h2>
<p>Because CPU vector ISAs don't have native gather and scatter instructions,
these memory operations are turned into sequences of a series of
instructions in the code that <tt class="docutils literal">ispc</tt> generates.  In some cases, it can be
useful to see where gathers and scatters actually happen in code; there is
an otherwise undocumented command-line flag that provides this information.</p>
<p>Consider this simple program:</p>
<pre class="literal-block">
void set(uniform int a[], int value, int index) {
    a[index] = value;
}
</pre>
<p>When compiled normally to the SSE4 target, this program generates this
extensive code sequence, which makes it more difficult to see what the
program is actually doing.</p>
<pre class="literal-block">
&quot;_set___uptr&lt;Ui&gt;ii&quot;:
        pmulld        LCPI0_0(%rip), %xmm1
        movmskps      %xmm2, %eax
        testb         $1, %al
        je            LBB0_2
        movd          %xmm1, %ecx
        movd          %xmm0, (%rcx,%rdi)
LBB0_2:
        testb         $2, %al
        je            LBB0_4
        pextrd        $1, %xmm1, %ecx
        pextrd        $1, %xmm0, (%rcx,%rdi)
LBB0_4:
        testb         $4, %al
        je            LBB0_6
        pextrd        $2, %xmm1, %ecx
        pextrd        $2, %xmm0, (%rcx,%rdi)
LBB0_6:
        testb        $8, %al
        je            LBB0_8
        pextrd        $3, %xmm1, %eax
        pextrd        $3, %xmm0, (%rax,%rdi)
LBB0_8:
        ret
</pre>
<p>If this program is compiled with the
<tt class="docutils literal"><span class="pre">--opt=disable-handle-pseudo-memory-ops</span></tt> command-line flag, then the
scatter is left as an unresolved function call.  The resulting program
won't link without unresolved symbols, but the assembly output is much
easier to understand:</p>
<pre class="literal-block">
&quot;_set___uptr&lt;Ui&gt;ii&quot;:
        movaps        %xmm0, %xmm3
        pmulld        LCPI0_0(%rip), %xmm1
        movdqa        %xmm1, %xmm0
        movaps        %xmm3, %xmm1
        jmp        ___pseudo_scatter_base_offsets32_32 ## TAILCALL
</pre>
</div>
</div>
<div class="section" id="language-details">
<h1>Language Details</h1>
<div class="section" id="what-is-the-difference-between-int-foo-and-int-foo">
<h2>What is the difference between &quot;int *foo&quot; and &quot;int foo[]&quot;?</h2>
<p>In C and C++, declaring a function to take a parameter <tt class="docutils literal">int *foo</tt> and
<tt class="docutils literal">int foo[]</tt> results in the same type for the parameter.  Both are
pointers to integers.  In <tt class="docutils literal">ispc</tt>, these are different types.  The first
one is a varying pointer to a uniform integer value in memory, while the
second results in a uniform pointer to the start of an array of varying
integer values in memory.</p>
<p>To understand why the first is a varying pointer to a uniform integer,
first recall that types without explicit rate qualifiers (<tt class="docutils literal">uniform</tt>,
<tt class="docutils literal">varying</tt>, or <tt class="docutils literal">soa&lt;&gt;</tt>) are <tt class="docutils literal">varying</tt> by default.  Second, recall from
the <a class="reference external" href="ispc.html#pointer-types">discussion of pointer types in the ispc User's Guide</a> that pointed-to
types without rate qualifiers are <tt class="docutils literal">uniform</tt> by default.  (This second
rule is discussed further below, in <a class="reference internal" href="#why-are-pointed-to-types-uniform-by-default">Why are pointed-to types &quot;uniform&quot; by
default?</a>.)  The type of <tt class="docutils literal">int *foo</tt> follows from these.</p>
<p>Conversely, in a function body, <tt class="docutils literal">int foo[10]</tt> represents a declaration of
a 10-element array of varying <tt class="docutils literal">int</tt> values.  In that we'd certainly like
to be able to pass such an array to a function that takes a <tt class="docutils literal">int []</tt>
parameter, the natural type for an <tt class="docutils literal">int []</tt> parameter is a uniform
pointer to varying integer values.</p>
<p>In terms of compatibility with C/C++, it's unfortunate that this
distinction exists, though any other set of rules seems to introduce more
awkwardness than this one.  (Though we're interested to hear ideas to
improve these rules!).</p>
</div>
<div class="section" id="why-are-pointed-to-types-uniform-by-default">
<h2>Why are pointed-to types &quot;uniform&quot; by default?</h2>
<p>In <tt class="docutils literal">ispc</tt>, types without rate qualifiers are &quot;varying&quot; by default, but
types pointed to by pointers without rate qualifiers are &quot;uniform&quot; by
default.  Why this difference?</p>
<pre class="literal-block">
int foo;  // no rate qualifier, &quot;varying int&quot;.
uniform int *foo;  // pointer type has no rate qualifier, pointed-to does.
                   // &quot;varying pointer to uniform int&quot;.
int *foo;  // neither pointer type nor pointed-to type (&quot;int&quot;) have
           // rate qualifiers. Pointer type is varying by default,
           // pointed-to is uniform. &quot;varying pointer to uniform int&quot;.
varying int *foo;   // varying pointer to varying int
</pre>
<p>The first rule, having types without rate qualifiers be varying by default,
is a default that keeps the number of &quot;uniform&quot; or &quot;varying&quot; qualifiers in
<tt class="docutils literal">ispc</tt> programs low.  Most <tt class="docutils literal">ispc</tt> programs use mostly &quot;varying&quot;
variables, so this rule allows most variables to be declared without also
requiring rate qualifiers.</p>
<p>On a related note, this rule allows many C/C++ functions to be used to
define equivalent functions in the SPMD execution model that <tt class="docutils literal">ispc</tt>
provides with little or no modification:</p>
<pre class="literal-block">
// scalar add in C/C++, SPMD/vector add in ispc
int add(int a, int b) { return a + b; }
</pre>
<p>This motivation also explains why <tt class="docutils literal">uniform int *foo</tt> represents a varying
pointer; having pointers be varying by default if they don't have rate
qualifiers similarly helps with porting code from C/C++ to <tt class="docutils literal">ispc</tt>.</p>
<p>The tricker issue is why pointed-to types are &quot;uniform&quot; by default.  In our
experience, data in memory that is accessed via pointers is most often
uniform; this generally includes all data that has been allocated and
initialized by the C/C++ application code. In practice, &quot;varying&quot; types are
more generally (but not exclusively) used for local data in <tt class="docutils literal">ispc</tt>
functions.  Thus, making the pointed-to type uniform by default leads to
more concise code for the most common cases.</p>
</div>
<div class="section" id="what-am-i-getting-an-error-about-assigning-a-varying-lvalue-to-a-reference-type">
<h2>What am I getting an error about assigning a varying lvalue to a reference type?</h2>
<p>Given code like the following:</p>
<pre class="literal-block">
uniform float a[...];
int index = ...;
float &amp;r = a[index];
</pre>
<p><tt class="docutils literal">ispc</tt> issues the error &quot;Initializer for reference-type variable &quot;r&quot; must
have a uniform lvalue type.&quot;.  The underlying issue stems from how
references are represented in the code generated by <tt class="docutils literal">ispc</tt>.  Recall that
<tt class="docutils literal">ispc</tt> supports both uniform and varying pointer types--a uniform pointer
points to the same location in memory for all program instances in the
gang, while a varying pointer allows each program instance to have its own
pointer value.</p>
<p>References are represented a pointer in the code generated by <tt class="docutils literal">ispc</tt>,
though this is generally opaque to the user; in <tt class="docutils literal">ispc</tt>, they are
specifically uniform pointers.  This design decision was made so that given
code like this:</p>
<pre class="literal-block">
extern void func(float &amp;val);
float foo = ...;
func(foo);
</pre>
<p>Then the reference would be handled efficiently as a single pointer, rather
than unnecessarily being turned into a gang-size of pointers.</p>
<p>However, an implication of this decision is that it's not possible for
references to refer to completely different things for each of the program
instances.  (And hence the error that is issued).  In cases where a unique
per-program-instance pointer is needed, a varying pointer should be used
instead of a reference.</p>
</div>
</div>
<div class="section" id="interoperability">
<h1>Interoperability</h1>
<div class="section" id="how-can-i-supply-an-initial-execution-mask-in-the-call-from-the-application">
<h2>How can I supply an initial execution mask in the call from the application?</h2>
<p>Recall that when execution transitions from the application code to an
<tt class="docutils literal">ispc</tt> function, all of the program instances are initially executing.
In some cases, it may desired that only some of them are running, based on
a data-dependent condition computed in the application program.  This
situation can easily be handled via an additional parameter from the
application.</p>
<p>As a simple example, consider a case where the application code has an
array of <tt class="docutils literal">float</tt> values and we'd like the <tt class="docutils literal">ispc</tt> code to update
just specific values in that array, where which of those values to be
updated has been determined by the application.  In C++ code, we might
have:</p>
<pre class="literal-block">
int count = ...;
float *array = new float[count];
bool *shouldUpdate = new bool[count];
// initialize array and shouldUpdate
ispc_func(array, shouldUpdate, count);
</pre>
<p>Then, the <tt class="docutils literal">ispc</tt> code could process this update as:</p>
<pre class="literal-block">
export void ispc_func(uniform float array[], uniform bool update[],
                      uniform int count) {
    foreach (i = 0 ... count) {
        cif (update[i] == true)
            // update array[i+programIndex]...
    }
}
</pre>
<p>(In this case a &quot;coherent&quot; if statement is likely to be worthwhile if the
<tt class="docutils literal">update</tt> array will tend to have sections that are either all-true or
all-false.)</p>
</div>
<div class="section" id="how-can-i-generate-a-single-binary-executable-with-support-for-multiple-instruction-sets">
<h2>How can I generate a single binary executable with support for multiple instruction sets?</h2>
<p><tt class="docutils literal">ispc</tt> can also generate output that supports multiple target instruction
sets, also generating code that chooses the most appropriate one at runtime
if multiple targets are specified with the <tt class="docutils literal"><span class="pre">--target</span></tt> command-line
argument.</p>
<p>For example, if you run the command:</p>
<pre class="literal-block">
ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2
</pre>
<p>Then four object files will be generated: <tt class="docutils literal">foo_sse2.o</tt>, <tt class="docutils literal">foo_sse4.o</tt>,
<tt class="docutils literal">foo_avx.o</tt>, and <tt class="docutils literal">foo.o</tt> <a class="footnote-reference" href="#footnote-1" id="footnote-reference-1">[1]</a>. Link all of these into your executable, and
when you call a function in <tt class="docutils literal">foo.ispc</tt> from your application code,
<tt class="docutils literal">ispc</tt> will determine which instruction sets are supported by the CPU the
code is running on and will call the most appropriate version of the
function available.</p>
<table class="docutils footnote" frame="void" id="footnote-1" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#footnote-reference-1">[1]</a></td><td>Similarly, if you choose to generate assembly language output or
LLVM bitcode output, multiple versions of those files will be created.</td></tr>
</tbody>
</table>
<p>In general, the version of the function that runs will be the one in the
most general instruction set that is supported by the system.  If you only
compile SSE2 and SSE4 variants and run on a system that supports AVX, for
example, then the SSE4 variant will be executed.  If the system
is not able to run any of the available variants of the function (for
example, trying to run a function that only has SSE4 and AVX variants on a
system that only supports SSE2), then the standard library <tt class="docutils literal">abort()</tt>
function will be called.</p>
<p>One subtlety is that all non-static global variables (if any) must have the
same size and layout with all of the targets used.  For example, if you
have the global variables:</p>
<pre class="literal-block">
uniform int foo[2*programCount];
int bar;
</pre>
<p>and compile to both SSE2 and AVX targets, both of these variables will have
different sizes (the first due to program count having the value 4 for SSE2
and 8 for AVX, and the second due to <tt class="docutils literal">varying</tt> types having different
numbers of elements with the two targets--essentially the same issue as the
first.)  <tt class="docutils literal">ispc</tt> issues an error in this case.</p>
</div>
<div class="section" id="how-can-i-determine-at-run-time-which-vector-instruction-set-s-instructions-were-selected-to-execute">
<h2>How can I determine at run-time which vector instruction set's instructions were selected to execute?</h2>
<p><tt class="docutils literal">ispc</tt> doesn't provide any API that allows querying which vector ISA's
instructions are running when multi-target compilation was used.  However,
this can be solved in &quot;user space&quot; by writing a small helper function.
Specifically, if you implement a function like this</p>
<pre class="literal-block">
export uniform int isa() {
#if defined(ISPC_TARGET_SSE2)
    return 0;
#elif defined(ISPC_TARGET_SSE4)
    return 1;
#elif defined(ISPC_TARGET_AVX)
    return 2;
#else
    return -1;
#endif
}
</pre>
<p>And then call it from your application code at runtime, it will return 0,
1, or 2, depending on which target's instructions are running.</p>
<p>The way this works is a little surprising, but it's a useful trick.  Of
course the preprocessor <tt class="docutils literal">#if</tt> checks are all compile-time only
operations.  What's actually happening is that the function is compiled
multiple times, once for each target, with the appropriate <tt class="docutils literal">ISPC_TARGET</tt>
preprocessor symbol set.  Then, a small dispatch function is generated for
the application to actually call.  This dispatch function in turn calls the
appropriate version of the function based on the CPU of the system it's
executing on, which in turn returns the appropriate value.</p>
<p>In a similar fashion, it's possible to find out at run-time the value of
<tt class="docutils literal">programCount</tt> for the target that's actually being used.</p>
<pre class="literal-block">
export uniform int width() { return programCount; }
</pre>
</div>
<div class="section" id="is-it-possible-to-inline-ispc-functions-in-c-c-code">
<h2>Is it possible to inline ispc functions in C/C++ code?</h2>
<p>If you're willing to use the <tt class="docutils literal">clang</tt> C/C++ compiler that's part of the
LLVM tool suite, then it is possible to inline <tt class="docutils literal">ispc</tt> code with C/C++
(and conversely, to inline C/C++ calls in <tt class="docutils literal">ispc</tt>).  Doing so can provide
performance advantages when calling out to short functions written in the
&quot;other&quot; language.  Note that you don't need to use <tt class="docutils literal">clang</tt> to compile all
of your C/C++ code, but only for the files where you want to be able to
inline.  In order to do this, you must have a full installation of LLVM
version 3.0 or later, including the <tt class="docutils literal">clang</tt> compiler.</p>
<p>The basic approach is to have the various compilers emit LLVM intermediate
representation (IR) code and to then use tools from LLVM to link together
the IR from the compilers and then re-optimize it, which gives the LLVM
optimizer the opportunity to do additional inlining and cross-function
optimizations.  If you have source files <tt class="docutils literal">foo.ispc</tt> and <tt class="docutils literal">foo.cpp</tt>,
first emit LLVM IR:</p>
<pre class="literal-block">
ispc --emit-llvm -o foo_ispc.bc foo.ispc
clang -O2 -c -emit-llvm -o foo_cpp.bc foo.cpp
</pre>
<p>Next, link the two IR files into a single file and run the LLVM optimizer
on the result:</p>
<pre class="literal-block">
llvm-link foo_ispc.bc foo_cpp.bc -o - | opt -O3 -o foo_opt.bc
</pre>
<p>And finally, generate a native object file:</p>
<pre class="literal-block">
llc -filetype=obj foo_opt.bc -o foo.o
</pre>
<p>This file can in turn be linked in with the rest of your object files when
linking your applicaiton.</p>
<p>(Note that if you're using the AVX instruction set, you must provide the
<tt class="docutils literal"><span class="pre">-mattr=+avx</span></tt> flag to <tt class="docutils literal">llc</tt>.)</p>
</div>
<div class="section" id="why-is-it-illegal-to-pass-varying-values-from-c-c-to-ispc-functions">
<h2>Why is it illegal to pass &quot;varying&quot; values from C/C++ to ispc functions?</h2>
<p>If any of the types in the parameter list to an exported function is
&quot;varying&quot; (including recursively, and members of structure types, etc.),
then <tt class="docutils literal">ispc</tt> will issue an error and refuse to compile the function:</p>
<pre class="literal-block">
% echo &quot;export int add(int x) { return ++x; }&quot; | ispc
&lt;stdin&gt;:1:12: Error: Illegal to return a &quot;varying&quot; type from exported function &quot;foo&quot;
&lt;stdin&gt;:1:20: Error: Varying parameter &quot;x&quot; is illegal in an exported function.
</pre>
<p>While there's no fundamental reason why this isn't possible, recall the
definition of &quot;varying&quot; variables: they have one value for each program
instance in the gang.  As such, the number of values and amount of storage
required to represent a varying variable depends on the gang size
(i.e. <tt class="docutils literal">programCount</tt>), which can have different values depending on the
compilation target.</p>
<p><tt class="docutils literal">ispc</tt> therefore prohibits passing &quot;varying&quot; values between the
application and the <tt class="docutils literal">ispc</tt> program in order to prevent the
application-side code from depending on a particular gang size, in order to
encourage portability to different gang sizes.  (A generally desirable
programming practice.)</p>
<p>For cases where the size of data is actually fixed from the application
side, the value can be passed via a pointer to a short <tt class="docutils literal">uniform</tt> array,
as follows:</p>
<pre class="literal-block">
export void add4(uniform int ptr[4]) {
    foreach (i = 0 ... 4)
        ptr[i]++;
}
</pre>
<p>On the 4-wide SSE instruction set, this compiles to a single vector add
instruction (and associated move instructions), while it still also
efficiently computes the correct result on 8-wide AVX targets.</p>
</div>
</div>
<div class="section" id="programming-techniques">
<h1>Programming Techniques</h1>
<div class="section" id="what-primitives-are-there-for-communicating-between-spmd-program-instances">
<h2>What primitives are there for communicating between SPMD program instances?</h2>
<p>The <tt class="docutils literal">broadcast()</tt>, <tt class="docutils literal">rotate()</tt>, and <tt class="docutils literal">shuffle()</tt> standard library
routines provide a variety of mechanisms for the running program instances
to communicate values to each other during execution.  Note that there's no
need to synchronize the program instances before communicating between
them, due to the synchronized execution model of gangs of program instances
in <tt class="docutils literal">ispc</tt>.</p>
</div>
<div class="section" id="how-can-a-gang-of-program-instances-generate-variable-amounts-of-output-efficiently">
<h2>How can a gang of program instances generate variable amounts of output efficiently?</h2>
<p>It's not unusual to have a gang of program instances where each program
instance generates a variable amount of output (perhaps some generate no
output, some generate one output value, some generate many output values
and so forth), and where one would like to have the output densely packed
in an output array.  The <tt class="docutils literal">exclusive_scan_add()</tt> function from the
standard library is quite useful in this situation.</p>
<p>Consider the following function:</p>
<pre class="literal-block">
uniform int func(uniform float outArray[], ...) {
   int numOut = ...;  // figure out how many to be output
   float outLocal[MAX_OUT]; // staging area

   // each program instance in the gang puts its results in
   //  outLocal[0], ..., outLocal[numOut-1]

   int startOffset = exclusive_scan_add(numOut);
   for (int i = 0; i &lt; numOut; ++i)
       outArray[startOffset + i] = outLocal[i];
   return reduce_add(numOut);
}
</pre>
<p>Here, each program instance has computed a number, <tt class="docutils literal">numOut</tt>, of values to
output, and has stored them in the <tt class="docutils literal">outLocal</tt> array.  Assume that four
program instances are running and that the first one wants to output one
value, the second two values, and the third and fourth three values each.
In this case, <tt class="docutils literal">exclusive_scan_add()</tt> will return the values (0, 1, 3, 6)
to the four program instances, respectively.</p>
<p>The first program instance will then write its one result to
<tt class="docutils literal">outArray[0]</tt>, the second will write its two values to <tt class="docutils literal">outArray[1]</tt>
and <tt class="docutils literal">outArray[2]</tt>, and so forth.  The <tt class="docutils literal">reduce_add()</tt> call at the end
returns the total number of values that all of the program instances have
written to the array.</p>
<p>FIXME: add discussion of foreach_active as an option here once that's in</p>
</div>
<div class="section" id="is-it-possible-to-use-ispc-for-explicit-vector-programming">
<h2>Is it possible to use ispc for explicit vector programming?</h2>
<p>The typical model for programming in <tt class="docutils literal">ispc</tt> is an <em>implicit</em> parallel
model, where one writes a program that is apparently doing scalar
computation on values and the program is then vectorized to run in parallel
across the SIMD lanes of a processor.  However, <tt class="docutils literal">ispc</tt> also has some
support for explicit vector unit programming, where the vectorization is
explicit.  Some computations may be more effectively described in the
explicit model rather than the implicit model.</p>
<p>This support is provided via <tt class="docutils literal">uniform</tt> instances of short vectors
Specifically, if this short program</p>
<pre class="literal-block">
export uniform float&lt;8&gt; madd(uniform float&lt;8&gt; a, uniform float&lt;8&gt; b,
                             uniform float&lt;8&gt; c) {
    return a + b * c;
}
</pre>
<p>is compiled with the AVX target, <tt class="docutils literal">ispc</tt> generates the following assembly:</p>
<pre class="literal-block">
_madd:
    vmulps  %ymm2, %ymm1, %ymm1
    vaddps  %ymm0, %ymm1, %ymm0
    ret
</pre>
<p>(And similarly, if compiled with a 4-wide SSE target, two <tt class="docutils literal">mulps</tt> and two
<tt class="docutils literal">addps</tt> instructions are generated, and so forth.)</p>
<p>Note that <tt class="docutils literal">ispc</tt> doesn't currently support control-flow based on
<tt class="docutils literal">uniform</tt> short vector types; it is thus not possible to write code like:</p>
<pre class="literal-block">
export uniform int&lt;8&gt; count(uniform float&lt;8&gt; a, uniform float&lt;8&gt; b) {
    uniform int&lt;8&gt; sum = 0;
    while (a++ &lt; b)
        ++sum;
}
</pre>
</div>
<div class="section" id="how-can-i-debug-my-ispc-programs-using-valgrind">
<h2>How can I debug my ispc programs using Valgrind?</h2>
<p>The <a class="reference external" href="http://valgrind.org">valgrind</a> memory checker is an extremely useful memory checker for
Linux and OSX; it detects a range of memory errors, including accessing
memory after it has been freed, accessing memory beyond the end of an
array, accessing uninitialized stack variables, and so forth.
In general, applications that use <tt class="docutils literal">ispc</tt> code run with <tt class="docutils literal">valgrind</tt>
without modification and <tt class="docutils literal">valgrind</tt> will detect the same range of memory
errors in <tt class="docutils literal">ispc</tt> code that it does in C/C++ code.</p>
<p>One issue to be aware of is that until recently, <tt class="docutils literal">valgrind</tt> only
supported the SSE2 vector instructions; if you are using a version of
<tt class="docutils literal">valgrind</tt> older than the 3.7.0 release (5 November 2011), you should
compile your <tt class="docutils literal">ispc</tt> programs with <tt class="docutils literal"><span class="pre">--target=sse2</span></tt> before running them
through <tt class="docutils literal">valgrind</tt>.  (Note that if no target is specified, then <tt class="docutils literal">ispc</tt>
chooses a target based on the capabilities of the system you're running
<tt class="docutils literal">ispc</tt> on.)  If you run an <tt class="docutils literal">ispc</tt> program that uses instructions that
<tt class="docutils literal">valgrind</tt> doesn't support, you'll see an error message like:</p>
<pre class="literal-block">
vex amd64-&gt;IR: unhandled instruction bytes: 0xC5 0xFA 0x10 0x0 0xC5 0xFA 0x11 0x84
==46059== valgrind: Unrecognised instruction at address 0x100002707.
</pre>
<p>The just-released valgrind 3.7.0 adds support for the SSE4.2 instruction
set; if you're using that version (and your system supports SSE4.2), then
you can use <tt class="docutils literal"><span class="pre">--target=sse4</span></tt> when compiling to run with <tt class="docutils literal">valgrind</tt>.</p>
<p>Note that <tt class="docutils literal">valgrind</tt> does not yet support programs that use the AVX
instruction set.</p>
</div>
<div class="section" id="foreach-statements-generate-more-complex-assembly-than-i-d-expect-what-s-going-on">
<h2>foreach statements generate more complex assembly than I'd expect; what's going on?</h2>
<p>Given a simple <tt class="docutils literal">foreach</tt> loop like the following:</p>
<pre class="literal-block">
void foo(uniform float a[], uniform int count) {
    foreach (i = 0 ... count)
        a[i] *= 2;
}
</pre>
<p>the <tt class="docutils literal">ispc</tt> compiler generates approximately 40 instructions--why isn't
the generated code simpler?</p>
<p>There are two main components to the code: one handles
<tt class="docutils literal">programCount</tt>-sized chunks of elements of the array, and the other
handles any excess elements at the end of the array that don't completely
fill a gang.  The code for the main loop is essentially what one would
expect: a vector of values are loaded from the array, the multiply is done,
and the result is stored.</p>
<pre class="literal-block">
LBB0_2:                                 ## %foreach_full_body
    movslq  %edx, %rdx
    vmovups (%rdi,%rdx), %ymm1
    vmulps  %ymm0, %ymm1, %ymm1
    vmovups %ymm1, (%rdi,%rdx)
    addl    $32, %edx
    addl    $8, %eax
    cmpl    %ecx, %eax
    jl      LBB0_2
</pre>
<p>Then, there is a sequence of instructions that handles any additional
elements at the end of the array.  (These instructions don't execute if
there aren't any left-over values to process, but they do lengthen the
amount of generated code.)</p>
<pre class="literal-block">
## BB#4:                                ## %partial_inner_only
      vmovd   %eax, %xmm0
      vinsertf128     $1, %xmm0, %ymm0, %ymm0
      vpermilps       $0, %ymm0, %ymm0 ## ymm0 = ymm0[0,0,0,0,4,4,4,4]
      vextractf128    $1, %ymm0, %xmm3
      vmovd   %esi, %xmm2
      vmovaps LCPI0_1(%rip), %ymm1
      vextractf128    $1, %ymm1, %xmm4
      vpaddd  %xmm4, %xmm3, %xmm3
      # ....
      vmulps  LCPI0_0(%rip), %ymm1, %ymm1
      vmaskmovps      %ymm1, %ymm0, (%rdi,%rax)
</pre>
<p>If you know that the number of elements to be processed will always be an
exact multiple of the 8, 16, etc., then adding a simple assignment to
<tt class="docutils literal">count</tt> like the one below gives the compiler enough information to be
able to eliminate the code for the additional array elements.</p>
<pre class="literal-block">
void foo(uniform float a[], uniform int count) {
    // This assignment doesn't change the value of count
    // if it's a multiple of 16, but it gives the compiler
    // insight into this fact, allowing for simpler code to
    // be generated for the foreach loop.
    count = (count &amp; ~(16-1));
    foreach (i = 0 ... count)
        a[i] *= 2;
}
</pre>
<p>With this new version of <tt class="docutils literal">foo()</tt>, only the code for the first loop above
is generated.</p>
</div>
<div class="section" id="how-do-i-launch-an-individual-task-for-each-active-program-instance">
<h2>How do I launch an individual task for each active program instance?</h2>
<p>Recall from the <a class="reference external" href="ispc.html#task-parallelism-launch-and-sync-statements">discussion of &quot;launch&quot; in the ispc User's Guide</a> that a
<tt class="docutils literal">launch</tt> statement launches a single task corresponding to a single gang
of executing program instances, where the indices of the active program
instances are the same as were active when the <tt class="docutils literal">launch</tt> statement
executed.</p>
<p>In some situations, it's desirable to be able to launch an individual task
for each executing program instance.  For example, we might be performing
an iterative computation where a subset of the program instances determine
that an item they are responsible for requires additional processing.</p>
<pre class="literal-block">
bool itemNeedsMoreProcessing(int);
int itemNum = ...;
if (itemNeedsMoreProcessing(itemNum)) {
    // do additional work
}
</pre>
<p>For performance reasons, it may be desirable to apply an entire gang's
worth of comptuation to each item that needs additional processing;
there may be available parallelism in this computation such that we'd like
to process each of the items with SPMD computation.</p>
<p>In this case, the <tt class="docutils literal">foreach_active</tt> and <tt class="docutils literal">unmasked</tt> constructs can be
applied together to accomplish this goal.</p>
<pre class="literal-block">
// do additional work
task void doWork(uniform int index);
foreach_active (index) {
    unmasked {
        launch doWork(extract(itemNum, index));
    }
}
</pre>
<p>Recall that the body of the <tt class="docutils literal">foreach_active</tt> loop runs once for each
active program instance, with each active program instance's
<tt class="docutils literal">programIndex</tt> value available in <tt class="docutils literal">index</tt> in the above.  In the loop,
we can re-establish an &quot;all on&quot; execution mask, enabling execution in all
of the program instances in the gang, such that execution in <tt class="docutils literal">doWork()</tt>
starts with all instances running.  (Alternatively, the <tt class="docutils literal">unmasked</tt> block
could be in the definition of <tt class="docutils literal">doWork()</tt>.)</p>
</div>
</div>
</div>
    <div class="clearfix"></div>
    <div id="footer"> &copy; <strong>Intel Corporation</strong> | Valid <a href="http://validator.w3.org/check?uri=referer">XHTML</a> | <a href="http://jigsaw.w3.org/css-validator/check/referer">CSS</a> | ClearBlue  by: <a href="http://www.themebin.com/">ThemeBin</a>
      <!-- Please Do Not remove this link, thank u -->
      </div>
      </div>
      </div>
      </div>
</div>
</body>
</html>