Poor C code generation
I'm grateful for the efforts of everyone involved. I know that I couldn't do any of it myself. And now, my rant.
I've been using the toolchains from this site for the last 8 months, so 4_7-2013q1 thru 4_8-2013q4. I'm seeing some minor variations in the quality of the generated machine code, but none of the strides toward what I would call an okay result. Below I've included one example for discussion fodder.
Am I missing something, like a command line option (see example below)?
If not then why is this not progressing? I can think of plenty of plausible good reasons, like focus of limited effort sunk into things like C++11 feature completeness, that no one is noticing what is happening anymore at the lowest level, or that everyone thinks the generated code is just fine the way it is (so shut up). But I'd like to hear from non-speculative sources.
Here is an example that I would argue is so fundamental that any C compiler should be able to produce the 'best' possible assembly (in this -Os case, meaning 'smallest'):
crap0.c:
#include <stdint.h>
extern uint32_t __bss_start[];
extern uint32_t __data_start[];
void Reset_Handler(void)
{
/* Clear .bss section (initialize with zeros) */
for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) {
*bss_ptr = 0;
}
}
>[...]\
The latest version (4.8.3) yields the following:
0800018c <Reset_Handler>:
800018c: 4a06 ldr r2, [pc, #24] ; (80001a8 <Reset_
800018e: 4907 ldr r1, [pc, #28] ; (80001ac <Reset_
8000190: 1a89 subs r1, r1, r2
8000192: f021 0103 bic.w r1, r1, #3
8000196: 2300 movs r3, #0
8000198: 428b cmp r3, r1
800019a: d003 beq.n 80001a4 <Reset_
800019c: 2000 movs r0, #0
800019e: 50d0 str r0, [r2, r3]
80001a0: 3304 adds r3, #4
80001a2: e7f9 b.n 8000198 <Reset_Handler+0xc>
80001a4: 4770 bx lr
80001a6: bf00 nop
80001a8: 20000000 .word 0x20000000
80001ac: 20000030 .word 0x20000030
80001b0: [...]
First off, the compiler is now (vs 4.7.x) including an additional wide instruction (bic.w) because, why, it no longer trusts that 32bit ints are 4-byte aligned? Yuck! Next I see that the compiler now recognizes that it can work with an integer index, but then it squanders the opportunities opened up in so doing -- instead of counting toward zero it counts away from it necessitating both an additional register and a compare instruction in the inner loop. Finally, a forth register is utilized for the zero being stored so as to prevent a reload of the array start address at each iteration, but then that load of zero is left within the loop!
Unless the index is needed somewhere, the result should have this form (note the inner loop is 50% shorter):
0800018c <Reset_Handler>:
800018c: ldr r1, [pc, #16] ; (80001a0)
800018e: ldr r2, [pc, #20] ; (80001a4)
8000190: subs r1, r1, r2
8000192: beq.n 800019c
8000194: movs r0, #0
8000196: str r0, [r2, r1]
8000198: adds r1, #4
800019a: bne.n 8000196
800019c: bx lr
800019e: nop
80001a0: 20000000 .word 0x20000000
80001a4: 20000030 .word 0x20000030
80001a8: [...]
In my experience, it's a fools errand trying to write in C what one hopes that the compiler will generate -- the frontend optimizations shred any optimizing ideas. But in this rare case the result is close:
#include <stdint.h>
crap1.c:
#include <stdint.h>
extern uint32_t __bss_start[];
extern uint32_t __data_start[];
void Reset_Handler(void)
{
/* Clear .bss section (initialize with zeros) */
uint8_t* bss_end = (uint8_
uint32_t index = (uint8_t const*)__bss_start - bss_end;
if (index != 0)
{
uint32_t zero = 0;
do {
// BADNESS: cast changes alignment requirements
*(uint32_
index += 4;
} while (index != 0);
}
}
0800018c <Reset_Handler>:
800018c: 4b04 ldr r3, [pc, #16] ; (80001a0 <Reset_
800018e: 4a05 ldr r2, [pc, #20] ; (80001a4 <Reset_
8000190: 1a98 subs r0, r3, r2
8000192: d003 beq.n 800019c <Reset_
8000194: 2100 movs r1, #0
8000196: 5011 str r1, [r2, r0]
8000198: 3004 adds r0, #4
800019a: e7fa b.n 8000192 <Reset_Handler+0x6>
800019c: 4770 bx lr
800019e: bf00 nop
80001a0: 20000000 .word 0x20000000
80001a4: 20000030 .word 0x20000030
80001a8: [...]
Nevertheless, notice how the zero loading was pushed into the loop and the how the loop was enlarged even further to stupidly share the conditional branch instruction at 0x8000192, all for a net anti-optimization of the original C code.
Does anyone working on gcc care about this kind of stuff? If not, no sweat, I'll not bother anymore.
-Gary
Question information
- Language:
- English Edit question
- Status:
- Solved
- Assignee:
- No assignee Edit question
- Solved by:
- Gary Fuehrer
- Solved:
- Last query:
- Last reply: