Bit Rot Incarnate

Postgres et cetera

Tuple Flensing I: Treating Postgres Corruption

Part One: Egress

While Postgres is an incredibly robust and defensively-written system, occasionally one does run into problems with data corruption. This can happen due to hardware issues, filesystem or other OS bugs, and sometimes, yes, even Postgres bugs. However, even when data corruption does occur, Postgres will generally contain the damage to specific chunks of data, and manual intervention will let you recover anything that was not affected.

This is the story of lessons learned from treating an extensive case of such corruption in the course of my job with the Heroku Postgres Department of Data. This post describes recognizing the problems and getting the data out of a corrupt system; the next details restoring it.

All data and metadata is mocked out; no customer data was used in the making of this post.

Preface

Josh Williams has a fantastic post about treating corruption; Josh’s post and tips from my colleagues (and Postgres contributors) Daniel Farina and Peter Geogheghan were instrumental in helping me wrap my head around all the nuts and bolts relevant here.

Now before we dig in, note that the easiest recovery is one you don’t need to perform at all. So how can you get a pristine copy of your database? Try your standby. There’s a chance that the corruption only affects the primary, because it never made it into the WAL stream and was not present when the replica’s base backup was taken. Don’t have a standby, but have a base backup and the required WAL segments? Try making a fresh replica (this can take some time when the base backup is old, but it’s better than the alternative). You can try a fresh replica in parallel with other recovery efforts. Validate these with pg_dump, as discussed below, and then run a dump and restore (again, see below) to ensure that the corruption is quarantined. Seriously, don’t delude yourself into thinking it’s going to be okay.

This is by far the easiest and safest mechanism for treating corruption. It’s not a panacea, but we’ve found it to be unreasonably effective.

In my case, the existing standby and fresh standbys all failed and extensive manual recovery was required, but at least it makes for an interesting story.

If you do need to treat corruption manually, there are a few things you should keep in mind:

  • Avoid writes: Postgres is in a precarious state; don’t tempt fate by asking it to do more than necessary. In particular, this is a good point to turn off autovacuum.
  • Stop Postgres and make a filesystem copy of the full data directory, ideally to another disk if you suspect the hardware. The recovery process will involve clobbering data; you want to be able to start over or cross-reference with the original state of the system if you run into problems.
  • Your final step in a corruption recovery should almost always be a dump and restore via pg_dump. This is the only way to ensure that you’re actually running on a sane system. This can take many hours for a large database, but if you skip this you might be back at square one before too long.
  • The zero_damaged_pages setting may be an easier way to recover. It doesn’t always work, but it basically does what is outlined here for you.
  • This ain’t Kansas anymore: this is effectively Byzantine failure territory. The advice here might work or everything might go horribly wrong. There are no fool-proof step-by-step instructions. You’re on your own. Furthermore, what’s outlined here is a general-purpose, fairly coarse approach to recovery. If you need to take a more precise approach or just need someone to blame, consider hiring a Postgres consultancy.

A bad sign

In this particular case, the problem was almost certainly caused by a filesystem or hardware issue. Postgres keeps track of the state of your database via three main sets of data: the heap, where all the physical data resides; the c(ommit)log, which tracks which transactions have been committed and which haven’t; and the write-ahead log (WAL), which does all the bookkeeping to ensure crash-safety and is a key mechanism in Postgres replication.

After intially noticing the corruption, I poked at the affected files with hexdump, and noticed that one of the clog files, which normally look like this:

1
2
3
4
5
6
7
8
9
10
11
$ hexdump -C pg_clog/0000 | head
00000000  40 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |@UUUUUUUUUUUUUUU|
00000010  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
*
000000b0  55 55 55 55 55 55 55 55  55 55 65 55 55 59 55 95  |UUUUUUUUUUeUUYU.|
000000c0  55 55 55 55 55 55 59 55  55 55 55 55 55 55 55 55  |UUUUUUYUUUUUUUUU|
000000d0  55 95 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |U.UUUUUUUUUUUUUU|
000000e0  55 65 55 55 95 55 55 95  55 55 55 55 55 55 59 55  |UeUU.UU.UUUUUUYU|
000000f0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
*
00014920  55 55 55 55 55 55 55 55  95 55 55 95 55 55 95 55  |UUUUUUUU.UU.UU.U|

…looked more like this instead:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ hexdump -C pg_clog/0002 | head -20
00000000  40 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |@UUUUUUUUUUUUUUU|
00000010  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
*
00000040  02 00 00 00 00 00 00 00  00 00 00 00 00 00 03 00  |................|
00000050  01 00 16 00 07 09 20 ff  7f 0d 00 00 00 00 00 00  |...... .........|
00000060  e7 04 00 00 16 00 00 00  00 00 00 00 10 00 00 00  |................|
00000070  0c 18 66 bf 01 00 02 00  03 00 00 00 62 00 00 00  |..f.........b...|
00000080  98 02 00 00 98 02 00 00  00 00 00 00 da 00 00 00  |................|
00000090  a4 01 00 00 04 01 00 03  01 bc 02 00 00 64 08 00  |.............d..|
000000a0  00 00 01 10 7b 1f 50 3d  00 65 e1 43 3b 05 dd df  |....{.P=.e.C;...|
000000b0  3a 88 c4 e5 a7 0f 04 13  05 dd 5f 0f 04 ff 01 0f  |:........._.....|
000000c0  04 36 33 01 00 00 00 00  00 00 00 bc 02 00 00 01  |.63.............|
000000d0  00 00 00 01 00 00 00 38  ef 4b 3e 00 00 00 00 00  |.......8.K>.....|
000000e0  ee 0d 00 00 c8 06 00 00  04 01 00 03 01 19 00 00  |................|
000000f0  00 64 08 00 00 00 01 10  4c 00 00 00 00 61 67 67  |.d......L....agg|
00000100  72 65 67 61 74 00 65 5f  64 75 6d 6d 79 00 00 48  |regat.e_dummy..H|
00000110  00 00 00 73 65 6c 65 00  63 74 20 24 32 20 2b 20  |...sele.ct $2 + |
00000120  00 24 31 00 00 2c 00 00  00 00 74 65 78 74 6c 65  |.$1..,....textle|

Note the ASCII view of the last few lines. If you see anything human-readable in the clog bitmaps, you should either start buying lottery tickets, or something very bad has happened.

I originally checked the file after seeing an unexpected error message that led me to file a Postgres bug. (The Postgres community, by the way, takes data integrity issues very seriously and addresses them quickly, so don’t hesitate to file a bug if you think you’re running into a legitimate Postgres problem.)

Trouble with transactions

The above happened on a replica, and checking the master revealed a different—but just as serious—problem in the logs:

1
2
PG::Error: ERROR: could not access status of transaction 2616655872
DETAIL: Could not open file "pg_clog/09BF": No such file or directory.

The clog files are bitmaps that live in the pg_clog subdirectory of your Postgres data directory and track which transactions have been committed and which have aborted.

You can take a look in the subdirectory and see the various clog files. If you don’t know where your data directory is, you can run the following Postgres command (this must be run as a Postgres supseruser, e.g. postgres in most installations):

1
2
3
4
5
maciek# show data_directory;
        data_directory        
------------------------------
 /var/lib/postgresql/9.2/main
(1 row)

In that directory, you’ll see a number of clog files

1
2
3
4
5
6
7
8
$ sudo ls -l /var/lib/postgresql/9.2/main/pg_clog
total 1072
-rw------- 1 postgres postgres 262144 Jan  9  2013 0000
-rw------- 1 postgres postgres 262144 Mar 12 11:53 0001
-rw------- 1 postgres postgres 262144 May  8 11:07 0002
...
-rw------- 1 postgres postgres 262144 Aug 12 13:41 0011
-rw------- 1 postgres postgres  32768 Aug 13 16:38 0012

These are sequential, and end way before the 09BF file that the error message complained about, so that’s not a great sign. So what can we do? A reasonable assumption might be that everything referencing transactions in that suspect clog file committed (actually, that’s a terrible assumption, but it’s easy to fake transaction metadata like that, and that’s the only way to see the effects of these alleged transactions, so let’s roll with it). In order to gin up a hunky-dory clog file, where everything committed, we can use the trusty dd tool. If you’re not familiar with dd, you can think of it as a bit scalpel for file surgery:

1
2
3
$ sudo -u postgres dd if=/dev/zero of=/var/lib/postgresql/9.2/main/pg_clog/09BF bs=256k count=1
1+0 records out
262144 bytes (262 kB) copied, 0.000437751 s, 599 MB/s

Here, of is the output clog file you need to generate (dd has a somewhat unconventional way to pass arguments). Everything else (block size, block count, and input file—the handy /dev/zero special file) can stay the same. Note that you’ll probably need to run with sudo as whatever user owns the other clog files (typically postgres).

Checking sanity

So does this fix our problem? One way to check is to re-run the query that triggered the error in the first place. A more thorough check is to just run pg_dump to /dev/null:

1
$ pg_dump --verbose >/dev/null my_db

This works because generally, corruption manifests as either references to bogus transactions (which can be addressed as above), or some sort of failure when attempting to read a specific set of data from disk. Postgres organizes data on disk in pages, normally 8192 bytes each. When things get corrupted, problems tend to be page-aligned—either a page will be fine, or it’s wholy suspect. Failure to read a page indicates that either the Postgres page headers are corrupt, or that the data is mangled in such a way as to cause errors in the per-datatype recv deserialization functions.

Occasionally, a page will be corrupt in such a way as to crash Postgres when one attempts to read it. This is problematic for recovery, but not a showstopper—if we can figure out what pages are causing crashes, we can surgically excise them.

In either case, pg_dump attempts to read every single live page of data, so if you make it through that, you’re on your way to a functioning system. Note, however, that this does not validate the integrity of your indexes: another good reason to always go through a dump and restore.

Note also that the most pernicious aspects of corruption occur when none of the above trigger: when bits get shuffled around, but in such a way as to still represent valid data (just not the data you put there). Unfortunately, prior to Postgres 9.3’s checksums feature, there’s not much you can do about this. Fortunately, this is also the least likely problem, as corruption tends to cause chaos and very few datatypes are resilient to bit scrambling (numeric types being a notable exception).

Pages behaving badly

In this case, the pg_dump still failed with the same error a few more times. After applying the same remedy a few times to different clog files, I eventually got to a different error:

1
2
ERROR:  invalid memory alloc request size 2667865904
STATEMENT:  COPY public.some_table (col1 int, col2 text) TO stdout

While Postgres supports pretty big rows and fields, its internal memory management is limited to working with chunks up to 1GB. This request for 2.5GB seems shady. It’s likely the metadata about the size of the value stored on this page is corrupted.

Without significantly more digging, we have to give up on this data as lost and nuke the page. There are a few other errors you may see that likely point to corruption:

1
2
3
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: timestamp out of range
pg_dump: The command was: COPY public.table (id, created_at) TO stdout

and

1
2
3
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  invalid page header in block 29 of relation "foo"
pg_dump: The command was: COPY public.foo (id, bar) TO stdout

and

1
2
3
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: missing chunk number 0 for toast value 118848655 in pg_toast_2619
pg_dump: The command was: COPY public.bar (id, baz, quux) TO stdout

So how do we track down the offending page? Fortunately, all rows in Postgres have a hidden column called ctid that is effectively the “physical” row id. If you have the ctid, you can figure out where the page lives on disk. You can project the ctid just like any other column:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
maciek=# select ctid, relname from pg_class;
  ctid  |                  relname                   
--------+--------------------------------------------
 (0,1)  | pg_statistic
 (0,2)  | pg_type
 (0,38) | pg_toast_75018
 (0,39) | pg_toast_2619
 (0,40) | pg_toast_2619_index
 (0,41) | pg_authid_rolname_index
 (0,42) | pg_authid_oid_index
 (0,43) | pg_attribute_relid_attnam_index
 (0,44) | pg_attribute_relid_attnum_index
...
 (7,73) | pg_toast_11620
 (7,77) | pg_toast_11620_index
(320 rows)

The ctid values you see are a (page, offset) pair. Chances are, projecting just the ctid from an affected table is not going to trigger the memory allocation error we saw above (because Postgres doesn’t have to read any of the affected data).

With that, you can go row by row, projecting every value to see if it causes problems. Doing this manually can take forever on a good-sized table, so I wrote a PL/pgSQL function to do it (the FETCH_COUNT trick from Josh’s post could also have helped, but I didn’t think of that at the time):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
CREATE OR REPLACE FUNCTION public.check_ctids(t regclass)
 RETURNS void
 LANGUAGE plpgsql
AS $function$
DECLARE
  tquoted text;
  ctid tid;
  attr name;
BEGIN
SELECT quote_ident(t::text) INTO tquoted;
FOR ctid IN EXECUTE 'SELECT ctid FROM ' || tquoted LOOP
  RAISE NOTICE 'dumping data for ctid %', ctid;
  FOR attr IN EXECUTE 'SELECT quote_ident(attname) FROM pg_attribute WHERE attrelid = $1 and attnum > 0' USING t LOOP
    BEGIN
      PERFORM 'SELECT ' || attr || ' FROM ' || tquoted || ' WHERE ' || tquoted || '.ctid = check_ctids.ctid';
    EXCEPTION
      WHEN internal_error THEN
        RAISE NOTICE 'data for column % at ctid % is corrupt', attr, ctid;
    END;
  END LOOP;
END LOOP;
END;
$function$;

This looks at all the ctids in the target table, and for each one, tries to project the values in every column for that row. The columns are projected one at a time, so that if one exhibits a problem and we have to purge the row, we can still check the valid ones and save that data beforehand.

This works for small and medium-sized tables, but takes forever on anything sizable. I tried to optimize it first by doing a binary search for bad tuples across ctids, and then by taking advantage of the fact that corruption is—as mentioned above—typically page-oriented. In the end, neither of these panned out (for reasons I’ll get to below), but I learned some interesting PL/pgSQL. In any case, this approach was serviceable for a while and did help me make progress. The output looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
maciek=# select check_ctids('pg_class');
NOTICE:  dumping data for ctid (0,1)
NOTICE:  dumping data for ctid (0,2)
NOTICE:  dumping data for ctid (0,38)
NOTICE:  dumping data for ctid (0,39)
NOTICE:  dumping data for ctid (0,40)
...
NOTICE:  dumping data for ctid (6,4)
NOTICE:  data for column col1 at ctid (6,4) is corrupt
NOTICE:  data for column col2 at ctid (6,4) is corrupt
NOTICE:  dumping data for ctid (6,5)
NOTICE:  dumping data for ctid (6,6)
NOTICE:  data for column col1 at ctid (6,6) is corrupt
NOTICE:  data for column col3 at ctid (6,6) is corrupt
NOTICE:  dumping data for ctid (6,9)
NOTICE:  data for column col1 at ctid (6,9) is corrupt
NOTICE:  data for column col2 at ctid (6,9) is corrupt
NOTICE:  data for column col3 at ctid (6,9) is corrupt
NOTICE:  dumping data for ctid (6,13)
NOTICE:  data for column col1 at ctid (6,13) is corrupt
NOTICE:  data for column col2 at ctid (6,13) is corrupt
NOTICE:  data for column col3 at ctid (6,13) is corrupt
NOTICE:  dumping data for ctid (6,14)
...
 check_ctids
-------------
 
(1 row)

Trouble on page 6! At this point, you can try to project the other columns from the affected rows (and any potentially unaffected rows) so you can save whatever data is still recoverable.

Once we find an affected page, we need to figure out where it lives on disk. Again, we find what we’re looking for in the data directory. The base subdirectory contains all the “actual” data (as opposed to metadata) in your Postgres instance, so this is where we look. The first level looks like this:

1
2
3
4
5
6
7
8
9
$ sudo ls -l /var/lib/postgresql/9.2/main/base
total 60
drwx------ 2 postgres postgres 12288 Jun 30 15:52 1
drwx------ 2 postgres postgres  4096 Jun 26  2012 12035
drwx------ 2 postgres postgres  4096 Aug  9 15:20 12040
drwx------ 2 postgres postgres 12288 Aug 15 16:59 32775
drwx------ 2 postgres postgres 20480 Aug 16 17:36 75568
drwx------ 2 postgres postgres  4096 Aug 18 02:25 76255
drwx------ 2 postgres postgres  4096 Jun 20 21:51 pgsql_tmp

This contains all the different databases (as in CREATE DATABASE ...) in your Postgres instance. This query can help you figure out the right subdirectory:

1
2
3
4
5
6
7
8
9
10
maciek=# select oid, datname from pg_database;
  oid  |     datname      
-------+------------------
     1 | template1
 12035 | template0
 12040 | postgres
 32775 | maciek
 75568 | observatory_test
 76255 | pqgotest
(6 rows)

(oid is another hidden column, a surrogate logical identifier for many system tables.) In there, you’ll see files for your tables and other aspects of your database:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ sudo ls -l /var/lib/postgresql/9.2/main/base/32775
total 6796
-rw------- 1 postgres postgres 163840 Aug 18 01:50 11777
-rw------- 1 postgres postgres  24576 Aug  7 19:33 11777_fsm
-rw------- 1 postgres postgres   8192 Aug  7 19:33 11777_vm
...
-rw------- 1 postgres postgres   8192 Aug  7 19:31 75207
-rw------- 1 postgres postgres   8192 Aug  7 19:31 75208
-rw------- 1 postgres postgres      0 Aug 15 16:59 80343
-rw------- 1 postgres postgres      0 Aug 15 16:59 80346
-rw------- 1 postgres postgres    512 Jun  3 14:37 pg_filenode.map
-rw------- 1 postgres postgres 111220 Aug  9 15:21 pg_internal.init
-rw------- 1 postgres postgres      4 Jun  3 14:37 PG_VERSION

To find the right table, you might assume you can use the oid again (I certainly did), but this won’t always work. The pg_class system catalog table can show you the actual file you need (adjust the namespace as needed):

1
2
3
4
5
6
7
maciek=# select relfilenode from pg_class c
  inner join pg_namespace ns on c.relnamespace = ns.oid
  where relname = 'foo' and nspname = 'public';
 relfilenode 
-------------
       80343
(1 row)

So we have the subdirectory for the database, the file for the table, and the page and offset from the ctid. Now we need to do some bit surgery, just as with the clog files above. This is where dd’s “addressing” feature (via its blocksize and seek parameters) comes in really handy: dd makes it easy to treat a file as a sequence of contiguous chunks, and to overwrite just one of these chunks.

Let’s go to town:

1
2
3
4
$ sudo dd if=/dev/zero bs=8192 count=1 seek=6 \
    of=/var/lib/postgresql/9.2/main/base/32775/80343 conv=notrunc
1+0 records out
262144 bytes (262 kB) copied, 0.000437751 s, 599 MB/s

Here, bs=8192 indicates the Postgres page size, count=1 means we overwrite just one page, and seek=6 means we want to seek to the 6th 8192-byte page. Again, we use /dev/zero as a source. The very important conv=notruc means don’t truncate the file; just write to the middle. Remarkably, overwriting a page like this in Postgres is enough to wipe it out with no ill effects. That’s a fantastic property for what is effectively a massive binary serialization format.

In any case, doing these one-by-one can get tedious (and error-prone). If you find a lot of these up front, you can even script the dd invocations:

1
2
3
4
5
for bad_page in 89 203 221 227;
do
  sudo dd if=/dev/zero of=/var/lib/postgresql/9.2/main/base/16385/17889 \
    bs=8192 seek=${bad_page} count=1 conv=notrunc
done

Be extremely careful with this, though—it’s easy to fat-finger your way to wiping out large swathes of your database.

Trouble in purgatory

After a while of making progress like this, I ran into a problem that I couldn’t solve with PL/pgSQL:

1
2
3
4
LOG:  server process (PID 23544) was terminated by signal 11
LOG:  terminating any other active server processes
FATAL:  the database system is in recovery mode
LOG:  all server processes terminated; reinitializing

Postgres is written defensively, but if its data is scrambled badly enough, it will still occasionally wig out. With luck, it will recover in a few seconds and you can continue treating the corruption, but subsequent attempts to read the affected pages are likely to cause repeat crashes. This means that any PL/pgSQL functions that rely on being able to scan a whole table can’t really cope with server crashes.

Bash (and shell scripting in general) gets a bad rap—in many cases, deservedly so—but it’s still a fantastic way to glue together other programs. It seemed like an appropriate tool here:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash

# Fill in with everything your own psql invocation needs
psql="psql -U <user> -d <database> -p <port> -h <host>"

relation=${1-"Usage: $0 relation"}

last_page=-1
for ctid in $(<<< "select ctid from $relation" $psql -qAt)
do
    page=$(echo $ctid | sed -r 's/\(([0-9]+),[0-9]+\)/\1/')
    if [ "$page" -eq "$last_page" ]
    then
        continue
    fi
    echo "dumping page $page"
    if ! <<< "select * from $relation where ctid >= '($page,1)' and ctid < '($((page+1)),1)'" $psql > /dev/null
    then
        echo "Failed on page $page"
        until <<< "select 1" $psql >/dev/null; do :; done
    fi
    last_page="$page"
done

This selects all ctids from a table and then iterates over them. For each, it dumps all data on a page, and then skips ctids until the next page (yes, it would probably be faster to query for only distinct pages and iterate over just that; this was fast enough). This prints errors on failed pages (I no longer cared about individual columns anymore; these few pages were a mess and I couldn’t get anything out), but the really nice part is if that Postgres crashes, it also just prints an error and waits for it to come back up. This lets you scan a full table even if Postgres crashes on every page, and you can apply the page-zapping technique above.

Victory (sort of)

After a few more rounds of dealing with a potpourri of the issues above, I was eventually able to clear out all the errors and get pg_dump to complete successfully. Only a few dozen pages were affected, but sorting through everything took hours and was very mentally demanding: this is where any abstraction provided by your database breaks down, and you need to start thinking of your data in a completely different way.

I’ve also elided some of the more frustrating inconsistencies and gotchas in dealing with a system in this state. Occasionally corrupt pages that I had cleared would come back (autovacuum and other connections were off, so I’m still not sure what caused these), and in one case a manual COPY worked even though pg_dump’s invocation of the same command failed. Be prepared for weirdness.

Overall, though, Postgres held up admirably: it’s very impressive how well it does in the face of serious data corruption.

Unfortunately, after all this, the dump I got would still not restore cleanly. I’ll go over the (much simpler) steps I had to take to get things back into Postgres in the next post.