Page 1 of 1

heap corruption inconsistency

Posted: Thu Jul 19, 2018 4:59 am
by 0xffff
Hi,

I'm chasing a heap corruption issue so I enabled comprehensive heap poisoning, and built with gdbstub to break when this occurrs. When it happened I saw:

Code: Select all

CORRUPT HEAP: Invalid data at 0x3ffdc0b8. Expected 0xfefefefe got 0xfefefeff
CORRUPT HEAP: Invalid data at 0x3ffdc190. Expected 0xfefefefe got 0xfefefeff
assertion "verify_fill_pattern(data, size, true, true, true)" failed: file "/dev/p/Firmware/esp-idf/components/heap/./multi_heap_poisoning.c", line 183, function: multi_heap_malloc
abort() was called at PC 0x400dfacb on core 0


However, in gdb when I see:

Code: Select all

(gdb) x/12x 0x3ffdc0b0
0x3ffdc0b0:	0xcececece	0xcececece	0xcececece	0xcececece
0x3ffdc0c0:	0xcececece	0xcececece	0xcececece	0xcececece
0x3ffdc0d0:	0xcececece	0xcececece	0xcececece	0xcececece
which seems inconsistent with the error message. What am I missing?

Re: heap corruption inconsistency

Posted: Thu Jul 19, 2018 5:59 am
by WiFive

Re: heap corruption inconsistency

Posted: Thu Jul 19, 2018 6:09 am
by ESP_Angus
Hi 0xffff,

This had me scratching my head for a minute as well!

The reason is that verify_fill_pattern() is swapping each word from 0xfefefefe (free memory) to 0xcececece as it goes through the memory region during allocation (uses one pass for performance), and even if it finds an invalid word it finishes the sweep before aborting (the idea being to report all of the invalid bytes in the region).

Clearly this is a bit confusing when you go to do a post-mortem in gdb.

I'm going to benchmark the performance of checking all words before swapping them (I suspect it's OK on regular RAM but may have issues on PSRAM). If we can do this, I'll change the function (premature optimisation is the root of all evil, etc, etc).

If the performance impact is too high, we can at least stop swapping patterns from the invalid word onwards. You can make this change yourself by putting "swap_pattern = false;" underneath "valid = false;" in multi_heap_poisoning.c:149

(BTW You can assume that every word in the buffer which doesn't trigger an error message was 0xfefefefe before it was 0xcececece, and the others were the values shown in the error message. 0xfefefeff usually means something has done "var++" on a freed address.)