Part One: Egress
While Postgres is an incredibly robust and defensively-written system, occasionally one does run into problems with data corruption. This can happen due to hardware issues, filesystem or other OS bugs, and sometimes, yes, even Postgres bugs. However, even when data corruption does occur, Postgres will generally contain the damage to specific chunks of data, and manual intervention will let you recover anything that was not affected.
This is the story of lessons learned from treating an extensive case of such corruption in the course of my job with the Heroku Postgres Department of Data. This post describes recognizing the problems and getting the data out of a corrupt system; the next details restoring it.
All data and metadata is mocked out; no customer data was used in the making of this post.
Josh Williams has a fantastic post about treating corruption; Josh’s post and tips from my colleagues (and Postgres contributors) Daniel Farina and Peter Geogheghan were instrumental in helping me wrap my head around all the nuts and bolts relevant here.
Now before we dig in, note that the easiest recovery is one you don’t need
to perform at all. So how can you get a pristine copy of your database? Try your
a chance that the corruption only affects the primary, because it
never made it into the WAL stream and was not present when the
replica’s base backup was taken. Don’t have a standby, but have a
and the required WAL segments? Try making a fresh replica (this can
take some time when the base backup is old, but it’s better than the
alternative). You can try a fresh replica in parallel with other
recovery efforts. Validate these with
pg_dump, as discussed below,
and then run a dump and restore (again, see below) to ensure that the
corruption is quarantined. Seriously,
don’t delude yourself into thinking it’s going to be okay.
This is by far the easiest and safest mechanism for treating corruption. It’s not a panacea, but we’ve found it to be unreasonably effective.
In my case, the existing standby and fresh standbys all failed and extensive manual recovery was required, but at least it makes for an interesting story.
If you do need to treat corruption manually, there are a few things you should keep in mind:
- Avoid writes: Postgres is in a precarious state; don’t tempt fate by asking it to do more than necessary. In particular, this is a good point to turn off autovacuum.
- Stop Postgres and make a filesystem copy of the full data directory, ideally to another disk if you suspect the hardware. The recovery process will involve clobbering data; you want to be able to start over or cross-reference with the original state of the system if you run into problems.
- Your final step in a corruption recovery should almost always be a dump and restore via pg_dump. This is the only way to ensure that you’re actually running on a sane system. This can take many hours for a large database, but if you skip this you might be back at square one before too long.
- The zero_damaged_pages setting may be an easier way to recover. It doesn’t always work, but it basically does what is outlined here for you.
- This ain’t Kansas anymore: this is effectively Byzantine failure territory. The advice here might work or everything might go horribly wrong. There are no fool-proof step-by-step instructions. You’re on your own. Furthermore, what’s outlined here is a general-purpose, fairly coarse approach to recovery. If you need to take a more precise approach or just need someone to blame, consider hiring a Postgres consultancy.
A bad sign
In this particular case, the problem was almost certainly caused by a filesystem or hardware issue. Postgres keeps track of the state of your database via three main sets of data: the heap, where all the physical data resides; the c(ommit)log, which tracks which transactions have been committed and which haven’t; and the write-ahead log (WAL), which does all the bookkeeping to ensure crash-safety and is a key mechanism in Postgres replication.
After intially noticing the corruption, I poked at the affected files
hexdump, and noticed that one of the
clog files, which normally
look like this:
1 2 3 4 5 6 7 8 9 10 11
…looked more like this instead:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Note the ASCII view of the last few lines. If you see anything
human-readable in the
clog bitmaps, you should either start buying
lottery tickets, or something very bad has happened.
I originally checked the file after seeing an unexpected error message that led me to file a Postgres bug. (The Postgres community, by the way, takes data integrity issues very seriously and addresses them quickly, so don’t hesitate to file a bug if you think you’re running into a legitimate Postgres problem.)
Trouble with transactions
The above happened on a replica, and checking the master revealed a different—but just as serious—problem in the logs:
clog files are bitmaps that live in the
pg_clog subdirectory of
your Postgres data directory and track which transactions have been
committed and which have aborted.
You can take a look in the subdirectory and see the various
files. If you don’t know where your data directory is, you can run the
following Postgres command (this must be run as a Postgres supseruser,
postgres in most installations):
1 2 3 4 5
In that directory, you’ll see a number of
1 2 3 4 5 6 7 8
These are sequential, and end way before the
09BF file that the
error message complained about, so that’s not a great sign. So what
can we do? A reasonable assumption might be that everything
referencing transactions in that suspect
clog file committed
(actually, that’s a terrible assumption, but it’s easy to fake
transaction metadata like that, and that’s the only way to see the
effects of these alleged transactions, so let’s roll with it). In
order to gin up a hunky-dory
clog file, where everything committed, we
can use the trusty
dd tool. If you’re not familiar with
can think of it as a bit scalpel for file surgery:
1 2 3
of is the output
clog file you need to generate (dd has a
somewhat unconventional way to pass arguments). Everything else (block
size, block count, and input file—the handy
/dev/zero special file)
can stay the same. Note that you’ll probably need to run with sudo as
whatever user owns the other
clog files (typically
So does this fix our problem? One way to check is to re-run the query
that triggered the error in the first place. A more thorough check is
to just run
This works because generally, corruption manifests as either
references to bogus transactions (which can be addressed as above), or
some sort of failure when attempting to read a specific set of data
from disk. Postgres organizes data on disk in pages, normally 8192
bytes each. When things get corrupted, problems tend to be
page-aligned—either a page will be fine, or it’s wholy
suspect. Failure to read a page indicates that either the Postgres
page headers are corrupt, or that the data is mangled in such a way as
to cause errors in the per-datatype
recv deserialization functions.
Occasionally, a page will be corrupt in such a way as to crash Postgres when one attempts to read it. This is problematic for recovery, but not a showstopper—if we can figure out what pages are causing crashes, we can surgically excise them.
In either case, pg_dump attempts to read every single live page of data, so if you make it through that, you’re on your way to a functioning system. Note, however, that this does not validate the integrity of your indexes: another good reason to always go through a dump and restore.
Note also that the most pernicious aspects of corruption occur when none of the above trigger: when bits get shuffled around, but in such a way as to still represent valid data (just not the data you put there). Unfortunately, prior to Postgres 9.3’s checksums feature, there’s not much you can do about this. Fortunately, this is also the least likely problem, as corruption tends to cause chaos and very few datatypes are resilient to bit scrambling (numeric types being a notable exception).
Pages behaving badly
In this case, the pg_dump still failed with the same error a few more
times. After applying the same remedy a few times to different
files, I eventually got to a different error:
While Postgres supports pretty big rows and fields, its internal memory management is limited to working with chunks up to 1GB. This request for 2.5GB seems shady. It’s likely the metadata about the size of the value stored on this page is corrupted.
Without significantly more digging, we have to give up on this data as lost and nuke the page. There are a few other errors you may see that likely point to corruption:
1 2 3
1 2 3
1 2 3
So how do we track down the offending page? Fortunately, all rows in
Postgres have a hidden column called
ctid that is effectively the
“physical” row id. If you have the
ctid, you can figure out where
the page lives on disk. You can project the
ctid just like any other
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
ctid values you see are a (page, offset) pair. Chances are,
projecting just the
ctid from an affected table is not going to
trigger the memory allocation error we saw above (because Postgres
doesn’t have to read any of the affected data).
With that, you can go row by row, projecting every value to see if it
causes problems. Doing this manually can take forever on a good-sized
table, so I wrote a PL/pgSQL function to do it (the
from Josh’s post could also have helped, but I didn’t think of that
at the time):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This looks at all the
ctids in the target table, and for each one,
tries to project the values in every column for that row. The columns
are projected one at a time, so that if one exhibits a problem and we
have to purge the row, we can still check the valid ones and save
that data beforehand.
This works for small and medium-sized tables, but takes forever on
anything sizable. I tried to optimize it first by doing a binary
search for bad tuples across
ctids, and then by taking advantage of
the fact that corruption is—as mentioned above—typically
page-oriented. In the end, neither of these panned out (for reasons
I’ll get to below), but I learned some interesting PL/pgSQL. In any
case, this approach was serviceable for a while and did help me make
progress. The output looks something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Trouble on page 6! At this point, you can try to project the other columns from the affected rows (and any potentially unaffected rows) so you can save whatever data is still recoverable.
Once we find an affected page, we need to figure out where it lives on
disk. Again, we find what we’re looking for in the data directory.
base subdirectory contains all the “actual” data (as opposed to
metadata) in your Postgres instance, so this is where we look. The
first level looks like this:
1 2 3 4 5 6 7 8 9
This contains all the different databases (as in
DATABASE ...) in your Postgres instance. This query can help you
figure out the right subdirectory:
1 2 3 4 5 6 7 8 9 10
oid is another hidden column, a surrogate logical identifier for
many system tables.) In there, you’ll see files for your tables and
other aspects of your database:
1 2 3 4 5 6 7 8 9 10 11 12 13
To find the right table, you might assume you can use the oid again (I
certainly did), but this won’t always work. The
catalog table can show you the actual file you need (adjust the namespace
1 2 3 4 5 6 7
So we have the subdirectory for the database, the file for the table,
and the page and offset from the
ctid. Now we need to do some bit
surgery, just as with the
clog files above. This is where
“addressing” feature (via its
seek parameters) comes
in really handy:
dd makes it easy to treat a file as a sequence
of contiguous chunks, and to overwrite just one of these chunks.
Let’s go to town:
1 2 3 4
bs=8192 indicates the Postgres page size,
count=1 means we
overwrite just one page, and
seek=6 means we want to seek to the 6th
8192-byte page. Again, we use
/dev/zero as a source. The very
conv=notruc means don’t truncate the file; just write to
the middle. Remarkably, overwriting a page like this in Postgres is enough
to wipe it out with no ill effects. That’s a fantastic property for what
is effectively a massive binary serialization format.
In any case, doing these one-by-one can get tedious (and
error-prone). If you find a lot of these up front, you can even script
1 2 3 4 5
Be extremely careful with this, though—it’s easy to fat-finger your way to wiping out large swathes of your database.
Trouble in purgatory
After a while of making progress like this, I ran into a problem that I couldn’t solve with PL/pgSQL:
1 2 3 4
Postgres is written defensively, but if its data is scrambled badly enough, it will still occasionally wig out. With luck, it will recover in a few seconds and you can continue treating the corruption, but subsequent attempts to read the affected pages are likely to cause repeat crashes. This means that any PL/pgSQL functions that rely on being able to scan a whole table can’t really cope with server crashes.
Bash (and shell scripting in general) gets a bad rap—in many cases, deservedly so—but it’s still a fantastic way to glue together other programs. It seemed like an appropriate tool here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This selects all
ctids from a table and then iterates over them.
For each, it dumps all data on a page, and then skips
the next page (yes, it would probably be faster to query for only
distinct pages and iterate over just that; this was fast enough).
This prints errors on failed pages (I no longer cared about individual
columns anymore; these few pages were a mess and I couldn’t get
anything out), but the really nice part is if that Postgres crashes,
it also just prints an error and waits for it to come back up. This
lets you scan a full table even if Postgres crashes on every page,
and you can apply the page-zapping technique above.
Victory (sort of)
After a few more rounds of dealing with a potpourri of the issues above, I was eventually able to clear out all the errors and get pg_dump to complete successfully. Only a few dozen pages were affected, but sorting through everything took hours and was very mentally demanding: this is where any abstraction provided by your database breaks down, and you need to start thinking of your data in a completely different way.
I’ve also elided some of the more frustrating inconsistencies and
gotchas in dealing with a system in this state. Occasionally corrupt
pages that I had cleared would come back (autovacuum and other
connections were off, so I’m still not sure what caused these),
and in one case a manual
COPY worked even though
of the same command failed. Be prepared for weirdness.
Overall, though, Postgres held up admirably: it’s very impressive how well it does in the face of serious data corruption.
Unfortunately, after all this, the dump I got would still not restore cleanly. I’ll go over the (much simpler) steps I had to take to get things back into Postgres in the next post.