external/kiss_fft/README.simd

0001 If you are reading this, it means you think you may be interested in using the SIMD extensions within kissfft.
0002
0003 Beware! Beyond here there be dragons!
0004
0005 This API is not easy to use, is not well documented, and breaks the KISS principle.
0006
0007
0008 Still reading? Okay, you may get rewarded for your patience with a considerable speedup
0009 (2-3x) on intel x86 machines with SSE if you are willing to jump through some hoops.
0010
0011 The basic idea is to use the packed 4 float __m128 data type as a scalar element.
0012 This means that the format is pretty convoluted. It performs 4 FFTs per fft call on signals A,B,C,D.
0013
0014 For complex data, the data is interlaced as follows:
0015 rA0,rB0,rC0,rD0,      iA0,iB0,iC0,iD0,   rA1,rB1,rC1,rD1, iA1,iB1,iC1,iD1 ...
0016 where "rA0" is the real part of the zeroth sample for signal A
0017
0018 Real-only data is laid out:
0019 rA0,rB0,rC0,rD0,     rA1,rB1,rC1,rD1,      ...
0020
0021 Compile with gcc flags something like
0022 -O3 -mpreferred-stack-boundary=4  -DUSE_SIMD=1 -msse
0023
0024 Be aware of SIMD alignment.  This is the most likely cause of segfaults.
0025 The code within kissfft uses scratch variables on the stack.
0026 With SIMD, these must have addresses on 16 byte boundaries.
0027 Search on "SIMD alignment" for more info.
0028
0029
0030
0031 Robin at Divide Concept was kind enough to share his code for formatting to/from the SIMD kissfft.
0032 I have not run it -- use it at your own risk.
0033
0034 void SSETools::pack128(float* target, float* source, unsigned long size128)
0035 {
0036    __m128* pDest = (__m128*)target;
0037    __m128* pDestEnd = pDest+size128;
0038    float* source0=source;
0039    float* source1=source0+size128;
0040    float* source2=source1+size128;
0041    float* source3=source2+size128;
0042
0043    while(pDest<pDestEnd)
0044    {
0045        *pDest=_mm_set_ps(*source3,*source2,*source1,*source0);
0046        source0++;
0047        source1++;
0048        source2++;
0049        source3++;
0050        pDest++;
0051    }
0052 }
0053
0054 void SSETools::unpack128(float* target, float* source, unsigned long size128)
0055 {
0056
0057    float* pSrc = source;
0058    float* pSrcEnd = pSrc+size128*4;
0059    float* target0=target;
0060    float* target1=target0+size128;
0061    float* target2=target1+size128;
0062    float* target3=target2+size128;
0063
0064    while(pSrc<pSrcEnd)
0065    {
0066        *target0=pSrc[0];
0067        *target1=pSrc[1];
0068        *target2=pSrc[2];
0069        *target3=pSrc[3];
0070        target0++;
0071        target1++;
0072        target2++;
0073        target3++;
0074        pSrc+=4;
0075    }
0076 }