Introduction
I've been looking for a way to write math intensive code for the Propeller that can run in a single cog. I need to write code to do floating point quaternion multiplication, and I don't want to have to write it in assembly. So, I posted a question on the Parallax forums about how to compile from a high level language (HLL) to readable Propeller assembly (PASM). I got two responses: PropBasic and PropGCC. I took a look at each of them, and did a side by side comparison of their features and the code they produce. GCC seemed to be better, mostly because it can do certain optimizations.
For the most part, I hope to use this information to write a quaternion math object. Quaternion math is computationally intensive: one quaternion multiplication has 12 floating point additions and 16 floating point multiplications. This is particularly relevant for the Propeller since it has no floating point hardware or multiplier hardware. Thus, I can't have ~28 source code lines with just the math, it has to have routines for floating point operations.
The root of the problem is that most HLL compilers for the Propeller use an interpreter, rather than generating PASM directly. This is because the code space for a processor is so small: 496 32 bit words (instructions plus data). So it's very difficult to get a useful program in that amount of memory. Usually a compiler will store instructions in the hub RAM, EEPROM, external RAM, or an SD card. The compiler will place a interpreter in the cog RAM which fetches instructions and executes them. This allows for larger programs, but it is much slower since each instruction needs to be fetched.
With this testing I needed something that did not use a memory model, but just treated the available cog memory as unlimited and compiled directly to PASM. With this method it is up to the designer to make sure that the result fits in cog memory, and is interfaced to properly.
I evaluated the two languages that were suitable: PropBasic and PropGCC.
PropBasic
I used PropBasic version 00.01.14 (2011-07-26) to test with. The source code that I used is a simple program to multiply some numbers together, add them, and divide with them. This goes well with the math intensive but no I/O application that I need. It's probably not a good benchmark to use if you're going to be doing complicated serial communication or anything with delays, I/O, etc.
A side note about PropBasic: the syntax is a bit quirky. It requires that your code have only one operator/statement per line. So "num = a+b+c" is out. It's odd, but easy enough to work with.
1: DEVICE P8X32A, XTAL1, PLL16X
2: FREQ 80_000_000
3: num1 VAR LONG
4: num2 VAR LONG
5: num3 VAR LONG
6: result0 VAR LONG
7: PROGRAM Start
8: Start:
9: DO
10: num3 = num3 * num3
11: num2 = num2 * num2
12: num1 = num1 * num1
13: result0 = num1 + num2
14: result0 = result0 + num3
15: result0 = result0 / num3
16: LOOP
17: END
I used the following command to test with:
./PropBasic-bst.linux test.pbas
There doesn't seem to be any command line options to use. Anyway, that generated the following Spin file:
1: '{$BST PATH {REMOVED FROM POSTING}}
2: '' *** COMPILED WITH PropBasic VERSION 00.01.14 July 26, 2011 ***
3: '' This program tests the compiler for PropBasic.
4: '' Is result a command???
5: CON 'DEVICE P8X32A, XTAL1, PLL16X
6: _ClkMode = XTAL1 + PLL16X
7: _XInFreq = 5000000 'FREQ 80_000_000
8: ' num1 VAR LONG 'num1 VAR LONG
9: ' num2 VAR LONG 'num2 VAR LONG
10: ' num3 VAR LONG 'num3 VAR LONG
11: ' result0 VAR LONG 'result0 VAR LONG
12: PUB __Program 'PROGRAM Start
13: CogInit(0, @__Init, @__DATASTART)
14: DAT
15: org 0
16: __Init
17: __RAM
18: mov dira,__InitDirA
19: mov outa,__InitOutA
20: jmp #Start
21: Start 'Start:
22: __DO_1 ' DO
23: mov __temp1,num3 ' num3 = num3 * num3
24: mov __temp2,num3
25: abs __temp1,__temp1 WC
26: muxc __temp3,#1
27: abs __temp2,__temp2 WC, WZ
28: IF_C xor __temp3,#1
29: mov __temp4,#0
30: mov __temp5,#32
31: shr __temp1,#1 WC
32: __L0001
33: IF_C add __temp4,__temp2 WC
34: rcr __temp4,#1 WC
35: rcr __temp1,#1 WC
36: djnz __temp5,#__L0001
37: test __temp3,#1 WZ
38: IF_NZ neg __temp4,__temp4
39: IF_NZ neg __temp1,__temp1 WZ
40: IF_NZ sub __temp4,#1
41: mov num3,__temp1
42: mov __temp1,num2 ' num2 = num2 * num2
43: mov __temp2,num2
44: abs __temp1,__temp1 WC
45: muxc __temp3,#1
46: abs __temp2,__temp2 WC, WZ
47: IF_C xor __temp3,#1
48: mov __temp4,#0
49: mov __temp5,#32
50: shr __temp1,#1 WC
51: __L0002
52: IF_C add __temp4,__temp2 WC
53: rcr __temp4,#1 WC
54: rcr __temp1,#1 WC
55: djnz __temp5,#__L0002
56: test __temp3,#1 WZ
57: IF_NZ neg __temp4,__temp4
58: IF_NZ neg __temp1,__temp1 WZ
59: IF_NZ sub __temp4,#1
60: mov num2,__temp1
61: mov __temp1,num1 ' num1 = num1 * num1
62: mov __temp2,num1
63: abs __temp1,__temp1 WC
64: muxc __temp3,#1
65: abs __temp2,__temp2 WC, WZ
66: IF_C xor __temp3,#1
67: mov __temp4,#0
68: mov __temp5,#32
69: shr __temp1,#1 WC
70: __L0003
71: IF_C add __temp4,__temp2 WC
72: rcr __temp4,#1 WC
73: rcr __temp1,#1 WC
74: djnz __temp5,#__L0003
75: test __temp3,#1 WZ
76: IF_NZ neg __temp4,__temp4
77: IF_NZ neg __temp1,__temp1 WZ
78: IF_NZ sub __temp4,#1
79: mov num1,__temp1
80: mov result0,num1 ' result0 = num1 + num2
81: adds result0,num2
82: ' result0 = result0 + num3
83: adds result0,num3
84: mov __temp1,result0 ' result0 = result0 / num3
85: mov __temp2,num3
86: mov __temp4,#0
87: abs __temp1,__temp1 WC
88: muxc __temp5,#1
89: abs __temp2,__temp2 WC, WZ
90: IF_Z mov __temp1,#0
91: IF_Z jmp #__L0004
92: IF_C xor __temp5,#1
93: mov __temp3,#0
94: min __temp2,#1
95: __L0005
96: add __temp3,#1
97: shl __temp2,#1 WC
98: IF_NC jmp #__L0005
99: rcr __temp2,#1
100: __L0006
101: cmpsub __temp1,__temp2 WC
102: rcl __temp4,#1
103: shr __temp2,#1
104: djnz __temp3,#__L0006
105: test __temp5,#1 WZ
106: IF_NZ neg __temp4,__temp4
107: IF_NZ neg __temp1,__temp1
108: __L0004
109: mov result0,__temp4
110: jmp #__DO_1 ' LOOP
111: __LOOP_1
112: mov __temp1,#0 'END
113: waitpne __temp1,__temp1
114: '**********************************************************************
115: __InitDirA LONG 000000_00000000_00000000_00000000
116: __InitOutA LONG 000000_00000000_00000000_00000000
117: _FREQ LONG 80000000
118: __remainder
119: __temp1 RES 1
120: __temp2 RES 1
121: __temp3 RES 1
122: __temp4 RES 1
123: __temp5 RES 1
124: __param1 RES 1
125: __param2 RES 1
126: __param3 RES 1
127: __param4 RES 1
128: __paramcnt RES 1
129: num1 RES 1
130: num2 RES 1
131: num3 RES 1
132: result0 RES 1
133: FIT 492
134: CON
135: LSBFIRST = 0
136: MSBFIRST = 1
137: MSBPRE = 0
138: LSBPRE = 1
139: MSBPOST = 2
140: LSBPOST = 3
141: DAT
142: __DATASTART
1. Every source code line is in the .spin file as a comment, which is very helpful.I think the compiler did a good job of being faithful to the original code, but I noticed some things:
2. The multiplication and division is done inline, so each additional multiplication consumes 18 longs. It does share temporary variables however.
3. All variables are stored in cog RAM, and user defined variables use the user defined name.
4. The compiler added the remnants of some serial communication code: three longs at "__RAM" and a constants block.
5. The code is nicely formatted straight from the compiler (although it uses spaces instead of tabs).
Propeller GCC
I used the most recent (and only) version posted in the GCC downloads page (v0_2_3 from 2012-02-08). The source program I used was the same as from the PropBasic, except modified a bit for C.
1: #if defined(__propeller__)
2: #include <propeller.h>
3: #define int32_t int
4: #define int16_t short int
5: #else
6: #endif
7: int main()
8: {
9: for(;;){
10: volatile int num1, num2, num3, result0;
11: num3 = num3 * num3;
12: num2 = num2 * num2;
13: num1 = num1 * num1;
14: result0 = num1 + num2;
15: result0 = result0 + num3;
16: result0 = result0 / num3;
17: }
18: }
I based it off the fft_bench.c demo, which is why it has the various preprocessor statements at the begining. Note the use of the keyword "volatile" for the int declaration: wihtout it the compiler simply optimized away everything into a simple jump loop.
Anyway, I used the following command to generate the code:
propeller-elf-gcc -Os -S -mcog -mspin test.c
The options do the following:
-0s: optimize code for minimum size
-S: output source code as a file
-mcog: use the cog memory model (put everything in a single cog)
-mspin: generate the resulting spin file
There is also the -mfcache option, but in this case it did not generate code any differently.
And, when run it generated the following spin code:
1: '' spin code automatically generated by gcc
2: CON
3: _clkmode = xtal1+pll16x
4: _clkfreq = 80_000_000
5: __clkfreq = 0 '' pointer to clock frequency
6: '' adjust STACKSIZE to how much stack your program needs
7: STACKSIZE = 256
8: VAR
9: long cog '' cog that was started up by start method
10: long stack[STACKSIZE]
11: '' add parameters here
12: long param
13: '' add any appropriate methods below
14: PUB start
15: stop
16: cog := cognew(@entry, @param) + 1
17: PUB stop
18: if cog
19: cogstop(cog~ - 1)
20: DAT
21: org
22: entry
23: r0 mov sp,PAR
24: r1 mov r0,sp
25: r2 jmp #_main
26: r3 long 0
27: r4 long 0
28: r5 long 0
29: r6 long 0
30: r7 long 0
31: r8 long 0
32: r9 long 0
33: r10 long 0
34: r11 long 0
35: r12 long 0
36: r13 long 0
37: r14 long 0
38: lr long 0
39: sp long 0
40: '.text
41: long
42: 'global variable _main
43: _main
44: sub sp, #16
45: L_L2
46: mov r5, #4
47: add r5, sp
48: mov r6, #8
49: add r6, sp
50: mov r7, #12
51: add r7, sp
52: rdlong r0, r5
53: rdlong r1, r5
54: call #__MULSI
55: wrlong r0, r5
56: rdlong r0, r6
57: rdlong r1, r6
58: call #__MULSI
59: wrlong r0, r6
60: mov r6, #8
61: add r6, sp
62: rdlong r0, r7
63: rdlong r1, r7
64: call #__MULSI
65: wrlong r0, r7
66: mov r7, #12
67: add r7, sp
68: rdlong r7, r7
69: rdlong r6, r6
70: add r7, r6
71: wrlong r7, sp
72: rdlong r7, sp
73: rdlong r6, r5
74: add r7, r6
75: wrlong r7, sp
76: rdlong r0, sp
77: rdlong r1, r5
78: call #__DIVSI
79: wrlong r0, sp
80: jmp #L_L2
81: __MASK_0000FFFF long $0000FFFF
82: __TMP0 long 0
83: __MULSI
84: mov __TMP0,r0
85: min __TMP0,r1
86: max r1,r0
87: mov r0,#0
88: __MULSI_loop
89: shr r1,#1 wz,wc
90: IF_C add r0,__TMP0
91: add __TMP0,__TMP0
92: IF_NZ jmp #__MULSI_loop
93: __MULSI_ret ret
94: __MASK_00FF00FF long $00FF00FF
95: __MASK_0F0F0F0F long $0F0F0F0F
96: __MASK_33333333 long $33333333
97: __MASK_55555555 long $55555555
98: __CLZSI rev r0,#0
99: __CTZSI neg __TMP0,r0
100: and __TMP0,r0 wz
101: mov r0,#0
102: IF_Z mov r0,#1
103: test __TMP0, __MASK_0000FFFF wz
104: IF_Z add r0,#16
105: test __TMP0, __MASK_00FF00FF wz
106: IF_Z add r0,#8
107: test __TMP0, __MASK_0F0F0F0F wz
108: IF_Z add r0,#4
109: test __TMP0, __MASK_33333333 wz
110: IF_Z add r0,#2
111: test __TMP0, __MASK_55555555 wz
112: IF_Z add r0,#1
113: __CLZSI_ret ret
114: __DIVR long 0
115: __DIVCNT long 0
116: __UDIVSI
117: mov __DIVR,r0
118: call #__CLZSI
119: neg __DIVCNT,r0
120: mov r0,r1
121: call #__CLZSI
122: add __DIVCNT,r0
123: mov r0,#0
124: cmps __DIVCNT,#0 wz,wc
125: IF_C jmp #__UDIVSI_done
126: shl r1,__DIVCNT
127: add __DIVCNT,#1
128: __UDIVSI_loop
129: cmpsub __DIVR,r1 wz,wc
130: addx r0,r0
131: shr r1,#1
132: djnz __DIVCNT,#__UDIVSI_loop
133: __UDIVSI_done
134: mov r1,__DIVR
135: __UDIVSI_ret ret
136: __DIVSGN long 0
137: __DIVSI mov __DIVSGN,r0
138: xor __DIVSGN,r1
139: abs r0,r0 wc
140: muxc __DIVSGN,#1 wc
141: abs r1,r1
142: call #__UDIVSI
143: cmps __DIVSGN,#0 wz,wc
144: IF_B neg r0,r0
145: test __DIVSGN,#1 wz
146: IF_NZ neg r1,r1
147: __DIVSI_ret ret
Some things that I have noticed about this code:
1. The output lacks suitable comments, and the resultant code is rather difficult to understand. It doesn't use original variable names.
2. It creates a multiplication subroutine. This is slightly less efficient in execution time than putting it inline, but it is vastly more efficient on space.
3. The code stores variables in the hub, not the cog as expected.
4. The -Os option appears to be needed: with no optimization the output code is 192 lines. Interstingly, -O2 gives the same output as -0s.
5. The multiply loop ("__MULSI") is very compact (9 longs). It looks like it is O(1). It is also only 4 lines, so at most it will take 32*4 cycles to complete. I'm not sure how it works yet though (especially with a sign).
6. The divide routine is a bit more expensive: 51 longs. To support it though, the loop ("__UDIVSI") is as efficient as the multiply loop.
7. GCC isn't very efficient in memory management from the default: it creates a 256 long hub stack and a 16 long cog stack frame. This could probably be cleaned up manually.
8. It's missing a "FIT" statement at the end.
9. The generated code isn't very well formatted.
Next, I tried a slightly modified source:
1: #if defined(__propeller__)
2: #include <propeller.h>
3: #define int32_t int
4: #define int16_t short int
5: #else
6: #endif
7: int main()
8: {
9: for(;;){
10: int num1, num2, num3;
11: volatile int result0;
12: num3 = num3 * num3;
13: num2 = num2 * num2;
14: num1 = num1 * num1;
15: result0 = num1 + num2;
16: result0 = result0 + num3;
17: result0 = result0 / num3;
18: }
19: }
Note here that the only variable marked volatile is result0. I compiled with
propeller-elf-gcc -Os -S -mcog -mspin test.c
And got the following output:
1: '' spin code automatically generated by gcc
2: CON
3: _clkmode = xtal1+pll16x
4: _clkfreq = 80_000_000
5: __clkfreq = 0 '' pointer to clock frequency
6: '' adjust STACKSIZE to how much stack your program needs
7: STACKSIZE = 256
8: VAR
9: long cog '' cog that was started up by start method
10: long stack[STACKSIZE]
11: '' add parameters here
12: long param
13: '' add any appropriate methods below
14: PUB start
15: stop
16: cog := cognew(@entry, @param) + 1
17: PUB stop
18: if cog
19: cogstop(cog~ - 1)
20: DAT
21: org
22: entry
23: r0 mov sp,PAR
24: r1 mov r0,sp
25: r2 jmp #_main
26: r3 long 0
27: r4 long 0
28: r5 long 0
29: r6 long 0
30: r7 long 0
31: r8 long 0
32: r9 long 0
33: r10 long 0
34: r11 long 0
35: r12 long 0
36: r13 long 0
37: r14 long 0
38: lr long 0
39: sp long 0
40: '.text
41: long
42: 'global variable _main
43: _main
44: sub sp, #4
45: L_L2
46: mov r1, r7
47: mov r0, r7
48: call #__MULSI
49: mov r7, r0
50: mov r1, r4
51: mov r0, r4
52: call #__MULSI
53: mov r1, r5
54: mov r4, r0
55: mov r0, r5
56: call #__MULSI
57: mov r6, r0
58: add r6, r4
59: mov r5, r0
60: mov r1, r7
61: wrlong r6, sp
62: rdlong r6, sp
63: add r6, r7
64: wrlong r6, sp
65: rdlong r0, sp
66: call #__DIVSI
67: wrlong r0, sp
68: jmp #L_L2
69: __MASK_0000FFFF long $0000FFFF
70: __TMP0 long 0
71: __MULSI
72: mov __TMP0,r0
73: min __TMP0,r1
74: max r1,r0
75: mov r0,#0
76: __MULSI_loop
77: shr r1,#1 wz,wc
78: IF_C add r0,__TMP0
79: add __TMP0,__TMP0
80: IF_NZ jmp #__MULSI_loop
81: __MULSI_ret ret
82: __MASK_00FF00FF long $00FF00FF
83: __MASK_0F0F0F0F long $0F0F0F0F
84: __MASK_33333333 long $33333333
85: __MASK_55555555 long $55555555
86: __CLZSI rev r0,#0
87: __CTZSI neg __TMP0,r0
88: and __TMP0,r0 wz
89: mov r0,#0
90: IF_Z mov r0,#1
91: test __TMP0, __MASK_0000FFFF wz
92: IF_Z add r0,#16
93: test __TMP0, __MASK_00FF00FF wz
94: IF_Z add r0,#8
95: test __TMP0, __MASK_0F0F0F0F wz
96: IF_Z add r0,#4
97: test __TMP0, __MASK_33333333 wz
98: IF_Z add r0,#2
99: test __TMP0, __MASK_55555555 wz
100: IF_Z add r0,#1
101: __CLZSI_ret ret
102: __DIVR long 0
103: __DIVCNT long 0
104: __UDIVSI
105: mov __DIVR,r0
106: call #__CLZSI
107: neg __DIVCNT,r0
108: mov r0,r1
109: call #__CLZSI
110: add __DIVCNT,r0
111: mov r0,#0
112: cmps __DIVCNT,#0 wz,wc
113: IF_C jmp #__UDIVSI_done
114: shl r1,__DIVCNT
115: add __DIVCNT,#1
116: __UDIVSI_loop
117: cmpsub __DIVR,r1 wz,wc
118: addx r0,r0
119: shr r1,#1
120: djnz __DIVCNT,#__UDIVSI_loop
121: __UDIVSI_done
122: mov r1,__DIVR
123: __UDIVSI_ret ret
124: __DIVSGN long 0
125: __DIVSI mov __DIVSGN,r0
126: xor __DIVSGN,r1
127: abs r0,r0 wc
128: muxc __DIVSGN,#1 wc
129: abs r1,r1
130: call #__UDIVSI
131: cmps __DIVSGN,#0 wz,wc
132: IF_B neg r0,r0
133: test __DIVSGN,#1 wz
134: IF_NZ neg r1,r1
135: __DIVSI_ret ret
This is much better: the output no longer has a bunch of RDLONG and WRLONGs, and is hench much more efficient. Previously, the main loop was 34 lines (many of which are hub access), and now it is 22 lines. The other comments still apply though. Also as before, -mfcache did not change the output code.
Conclusion
I think I will look into Propeller GCC more. It seems to do a good job for compiling down to efficient Propeller assembly, and it isn't too hard to read the output. I hope that it will be improved over time as well. The PropBasic compiler has a more understandable output, but the inefficient use of cog RAM and the lack of updates (no changes in 8 months) has me worried. Propeller GCC seems to fit my requirements.