Database storage bloat

Discussion:

(too old to reply)

Tony Reina

2004-04-08 09:15:07 UTC

I'm developing a database for scientific recordings. These recordings
are traditionally saved as binary flat files for simplicity and
compact storage. Although I think ultimately having a database is
better than 1,000s of flat files in terms of data access, I've found
that the database (or at least my design) is pretty wasteful on
storage space compared with the binary flat files.

In particular, I tried importing all of the data from a binary flat
file that is 1.35 MB into a PostgreSQL database (a very small test
file; average production file is probably more like 100 MB). The
database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the
original storage of the binary flat file).

Now I know that table design and normalizing is important. As far as
my partner and I can tell, we've made good use of normalizing (no
redundancy), we've set field sizes to their theoretical skinniness,
and we've made use of foreign keys and views. I'm also aware that
indicies/keys and other database internals will necessarily make the
DBMS solution bloated in terms of storage space. However, a 65X bloat
in space seems excessive.

Has anyone run across similar storage concerns? I'd be interested in
knowing if I just have really poorly designed tables, or if something
else is going on here. I figure a bloat of 3-4X would be permissible
(and possibly expected). But this bloat just seems too much.

Thanks.
-Tony

Uwe C. Schroeder

2004-04-08 09:57:45 UTC

Permalink

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Did you run vacuum full after your import ?

Post by Tony Reina
I'm developing a database for scientific recordings. These recordings
are traditionally saved as binary flat files for simplicity and
compact storage. Although I think ultimately having a database is
better than 1,000s of flat files in terms of data access, I've found
that the database (or at least my design) is pretty wasteful on
storage space compared with the binary flat files.
In particular, I tried importing all of the data from a binary flat
file that is 1.35 MB into a PostgreSQL database (a very small test
file; average production file is probably more like 100 MB). The
database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the
original storage of the binary flat file).
Now I know that table design and normalizing is important. As far as
my partner and I can tell, we've made good use of normalizing (no
redundancy), we've set field sizes to their theoretical skinniness,
and we've made use of foreign keys and views. I'm also aware that
indicies/keys and other database internals will necessarily make the
DBMS solution bloated in terms of storage space. However, a 65X bloat
in space seems excessive.
Has anyone run across similar storage concerns? I'd be interested in
knowing if I just have really poorly designed tables, or if something
else is going on here. I figure a bloat of 3-4X would be permissible
(and possibly expected). But this bloat just seems too much.
Thanks.
-Tony
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
http://archives.postgresql.org

- --
UC

- --
Open Source Solutions 4U, LLC 2570 Fleetwood Drive
Phone: +1 650 872 2425 San Bruno, CA 94066
Cell: +1 650 302 2405 United States
Fax: +1 650 872 2417
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFAdSIajqGXBvRToM4RApqaAJ9wOZa8NAWqgPk4ZZpWgeh1L2VwqQCdHKgv
3ruwuPZRC/rOxX3nb1q6khE=
=BoL3
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)

Tony and Bryn Reina

2004-04-08 12:51:14 UTC

Permalink

Yep. That's after a 'vacuum verbose analyze'.

-Tony

----- Original Message -----
From: "Uwe C. Schroeder" <***@oss4u.com>
To: "Tony Reina" <***@hotmail.com>; <pgsql-***@postgresql.org>
Sent: Thursday, April 08, 2004 11:57 AM
Subject: Re: [ADMIN] Database storage bloat

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Did you run vacuum full after your import ?

- --
UC

- --
Open Source Solutions 4U, LLC 2570 Fleetwood Drive
Phone: +1 650 872 2425 San Bruno, CA 94066
Cell: +1 650 302 2405 United States
Fax: +1 650 872 2417
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFAdSIajqGXBvRToM4RApqaAJ9wOZa8NAWqgPk4ZZpWgeh1L2VwqQCdHKgv
3ruwuPZRC/rOxX3nb1q6khE=
=BoL3
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Steve Crawford

2004-04-08 17:26:22 UTC

Permalink

Post by Tony and Bryn Reina
Yep. That's after a 'vacuum verbose analyze'.

No, he asked if you had run a "vacuum full". A "standard" vacuum just
marks space available for reuse - it does not shrink file sizes. A
"vacuum full" will shrink the files on disk.

Are you doing many updates on the table that is the space-using
culprit? Each record that is updated is created anew. The old record
is marked as "defunct" when no remaining transactions need that
record. If you were to, say, update the entire table to change a
single character field you would basically double the size of the
table.

A vacuum will mark the space that is no longer used as available for
reuse so if you run regular vacuums you should achieve a fairly
static size.

One thing to consider: since this is how PostgreSQL achieves MVCC you
may want to try to avoid updates to a table that has huge record
sizes.

For example, say you wanted to store a "last viewed date" for each
file. If you had that date in the same table with your data every
view would add considerable "bloat" to your table. If, instead, your
file is in a table by itself along with a sequence or other unique
key then the "last viewed date" or other changing data could be put
into a separate table linked to the file storage table by that unique
ID. Updating the date or other info about the file would only
increase the bloat in the other table and the bloat would be small.

Performance would probably improve as well due to smaller file sizes
and less read/write action on the disks.

Cheers,
Steve

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ***@postgresql.org so that your
message can get through to the mailing list cleanly

Douglas Trainor

2004-04-08 11:41:37 UTC

Permalink

Saying "we've set field sizes to their theoretical skinniness" makes me
think that
you may have the wrong data types. For example, you may have used CHAR
and not VARCHAR.

douglas

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Tony and Bryn Reina

2004-04-08 13:01:26 UTC

Permalink

I'm using 'text' instead of char. That seemed to cut the bloat down by about
30%. There's only about 11 tables in the DB. I included them at the bottom
in case you're interested.

Perhaps there's just something fundamentally boneheaded about my design
(re-reading "Database Design for Mere Mortals" as I write).

-Tony

CREATE SEQUENCE FileIndicies_fileID_seq;
CREATE TABLE FileIndicies (
fileID integer DEFAULT nextval('FileIndicies_fileID_seq') UNIQUE NOT
NULL, --PK
szFileName text NOT NULL,
szFileType text,
CreationDate date,
CreationTime time,
dtimestampresolution float4,
dtimespan float4,
szAppName text,
szFileComment text,
PRIMARY KEY (fileID),
UNIQUE (szFileName, szFileType, CreationDate, CreationTime)
);

CREATE SEQUENCE EntityFile_dbIndex_seq;
CREATE TABLE EntityFile (
EntityID integer, --PK
fileID integer REFERENCES FileIndicies (fileID),
dbIndex integer DEFAULT nextval('EntityFile_dbIndex_seq') UNIQUE NOT
NULL,
PRIMARY KEY (fileID, EntityID)
);

-- AnalogIndex
CREATE TABLE AnalogIndicies (
dbIndex integer REFERENCES EntityFile (dbIndex),
dsamplerate float4 CHECK (dsamplerate > (0)::float4),
dminval float4,
dmaxval float4 CHECK (dmaxval >= dminval),
szunits text,
dresolution float4 CHECK (dresolution > (0)::float4),
dlocationx float4,
dlocationy float4,
dlocationz float4,
dlocationuser float4,
dhighfreqcorner float4 CHECK (dhighfreqcorner >= (0)::float4),
dwhighfreqorder float4 CHECK (dwhighfreqorder >= (0)::float4),
szhighfiltertype text,
dlowfreqcorner float4 CHECK (dlowfreqcorner >= (0)::float4),
dwlowfreqorder float4 CHECK (dwlowfreqorder >= (0)::float4),
szlowfiltertype float4,
szprobeinfo text,
PRIMARY KEY (dbIndex)
);

CREATE TABLE AnalogValues (
dbIndex integer REFERENCES EntityFile (dbIndex),
dtimestamp float4 NOT NULL,
dvalue float4,
PRIMARY KEY (dbIndex, dtimestamp)
);

CREATE TABLE EventIndicies (
dbIndex integer REFERENCES EntityFile (dbIndex),
dweventtype smallint CHECK (dweventtype >=0 AND dweventtype <= 4),
dwmindatalength smallint CHECK (dwmindatalength > (0)::float4),
dwmaxdatalength smallint CHECK (dwmaxdatalength >= dwmindatalength),
szcsvdesc text,
PRIMARY KEY (dbIndex)
);

CREATE TABLE EventValues (
dbIndex integer REFERENCES EntityFile (dbIndex),
dtimestamp float4,
dwdatabytesize smallint CHECK (dwdatabytesize > (0)::float4),
eventvalue text,
PRIMARY KEY (dbIndex, dtimestamp)
);

CREATE TABLE NeuralIndicies (
dbIndex integer REFERENCES EntityFile (dbIndex),
dwsourceentityid smallint,
dwsourceunitid smallint,
szprobeinfo text,
PRIMARY KEY (dbIndex)
);

CREATE TABLE NeuralValues (
dbIndex integer REFERENCES EntityFile (dbIndex),
dtimestamp float4 NOT NULL,
PRIMARY KEY (dbIndex, dtimestamp)
);

CREATE TABLE SegmentIndicies (
dbIndex integer REFERENCES EntityFile (dbIndex),
dwsourcecount smallint CHECK (dwsourcecount > (0)::float4),
dwminsamplecount smallint CHECK (dwminsamplecount > (0)::float4),
dwmaxsamplecount smallint CHECK (dwmaxsamplecount > (0)::float4),
dsamplerate float4 CHECK (dsamplerate > (0)::float4),
szunits text,
PRIMARY KEY (dbIndex)
);

CREATE TABLE SegmentSourceIndicies (
dbIndex integer REFERENCES EntityFile (dbIndex),
dwsourceid integer CHECK (dwsourceid >= 0),
dminval float4,
dmaxval float4 CHECK (dmaxval >= dminval),
dresolution float4 CHECK (dresolution > (0)::float4),
dsubsampleshift float4,
dlocationx float4,
dlocationy float4,
dlocationz float4,
dlocationuser float4,
dhighfreqcorner float4 CHECK (dhighfreqcorner >= (0)::float4),
dwhighfreqorder float4 CHECK (dwhighfreqorder >= (0)::float4),
szhighfiltertype text,
dlowfreqcorner float4 CHECK (dlowfreqcorner >= (0)::float4),
dwlowfreqorder float4 CHECK (dwlowfreqorder >= (0)::float4),
szlowfiltertype text,
szprobeinfo text,
PRIMARY KEY (dbIndex, dwsourceid)
);

CREATE TABLE SegmentValues (
dbIndex integer REFERENCES EntityFile (dbIndex),
dwunitid smallint,
dwsampleindex smallint,
dtimestamp float4,
dvalue float4,
PRIMARY KEY (dbIndex, dtimestamp, dwsampleindex, dwunitid)
);

CREATE VIEW SegmentData AS
SELECT a.szFileName, a.szFileType, a.creationdate, a.creationtime,
c.EntityID,
b.dwunitid, b.dwsampleindex, b.dtimestamp, b.dvalue
FROM FileIndicies AS a, SegmentValues AS b, EntityFile AS c
WHERE
b.dbIndex = c.dbIndex AND
a.fileID = c.fileID
ORDER BY a.szFileName, a.szFileType, a.creationdate, a.creationtime,
c.EntityID,
b.dwunitid, b.dwsampleindex, b.dtimestamp;

----- Original Message -----
From: "Douglas Trainor" <***@uic.edu>
To: "Tony Reina" <***@hotmail.com>
Cc: <pgsql-***@postgresql.org>
Sent: Thursday, April 08, 2004 1:41 PM
Subject: Re: [ADMIN] Database storage bloat

Post by Douglas Trainor
Saying "we've set field sizes to their theoretical skinniness" makes me
think that
you may have the wrong data types. For example, you may have used CHAR
and not VARCHAR.
douglas

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Tom Lane

2004-04-08 16:14:52 UTC

Permalink

Post by Tony and Bryn Reina
There's only about 11 tables in the DB. I included them at the bottom
in case you're interested.

What are the actual sizes of the tables? Probably the most useful way
you could present that info is "vacuum verbose" output for each table.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nicolas Kowalski

2004-04-08 13:33:24 UTC

Permalink

Post by Tony Reina
Has anyone run across similar storage concerns? I'd be interested in
knowing if I just have really poorly designed tables, or if something
else is going on here. I figure a bloat of 3-4X would be permissible
(and possibly expected). But this bloat just seems too much.

Yes, however my example is not very accurate because we are still
running an old version (7.2.1, Debian stable).

In our case, dump file created from pg_dumpall is around 4 MB, but the
data directory is around 180MB.

Regards.

--
Nicolas

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Stephan Szabo

2004-04-08 13:46:20 UTC

Permalink

Well, an important question is where is that space going? It'd be
interesting to give a breakup by the directories and then which files (and
using the contrib/oid2name to get which table/indexes/etc they are).

At least 16MB of that is probably going into the transaction log (IIRC
that's the default size for the segments) in pg_xlog.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Tony and Bryn Reina

2004-04-08 15:49:26 UTC

Permalink

Post by Stephan Szabo
Well, an important question is where is that space going? It'd be
interesting to give a breakup by the directories and then which files (and
using the contrib/oid2name to get which table/indexes/etc they are).
At least 16MB of that is probably going into the transaction log (IIRC
that's the default size for the segments) in pg_xlog.

Thanks Stephan. That at least helps me narrow it down to my one problem
table: segmentvalues. It is taking up almost 50 MB and its primary key is
over 38MB. Nothing else even comes close. Here's the relevant output for
oid2name in order of size.

SIZE (KB) OID TABLENAME
48844 17296 = segmentvalues
38100 17298 = segmentvalues_pkey
1108 16642 = pg_proc_proname_args_nsp_index
852 17246 = neuralvalues
676 17248 = neuralvalues_pkey
..
..

Tom asked to see the table schema. Here's the 3 relevant tables for the
bloat:

CREATE SEQUENCE FileIndicies_fileID_seq;
CREATE TABLE FileIndicies (
fileID integer DEFAULT nextval('FileIndicies_fileID_seq') UNIQUE
NOT NULL, --PK
szFileName text NOT NULL,
szFileType text,
CreationDate date,
CreationTime time,
dtimestampresolution float4,
dtimespan float4,
szAppName text,
szFileComment text,
PRIMARY KEY (fileID),
UNIQUE (szFileName, szFileType, CreationDate, CreationTime)
);

CREATE SEQUENCE EntityFile_dbIndex_seq;
CREATE TABLE EntityFile (
EntityID integer, --PK
fileID integer REFERENCES FileIndicies (fileID),
dbIndex integer DEFAULT nextval('EntityFile_dbIndex_seq') UNIQUE NOT
NULL,
PRIMARY KEY (fileID, EntityID)
);

CREATE TABLE SegmentValues (
dbIndex integer REFERENCES EntityFile (dbIndex),
dwunitid smallint,
dwsampleindex smallint,
dtimestamp float4,
dvalue float4,
PRIMARY KEY (dbIndex, dtimestamp, dwsampleindex, dwunitid)
);

I suppose one thing the binary flat file may be doing is not including the
time stamp in table SegmentValues. Since I know the sampling rate, I can
just calculate the timestamp on the fly by the rate times the index
(assuming no time offset). That would lose a float4 field, but would add
back a smallint field to the table.

Is there any further normalization that I'm missing?

-Tony

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)

Tom Lane

2004-04-08 17:49:23 UTC

Permalink

Post by Tony and Bryn Reina
CREATE TABLE SegmentValues (
dbIndex integer REFERENCES EntityFile (dbIndex),
dwunitid smallint,
dwsampleindex smallint,
dtimestamp float4,
dvalue float4,
PRIMARY KEY (dbIndex, dtimestamp, dwsampleindex, dwunitid)
);
I suppose one thing the binary flat file may be doing is not including the
time stamp in table SegmentValues. Since I know the sampling rate, I can
just calculate the timestamp on the fly by the rate times the index
(assuming no time offset). That would lose a float4 field, but would add
back a smallint field to the table.

That won't buy you anything at all --- the two bytes saved would be lost
again to alignment padding. (I'm assuming you're on PC hardware with
MAXALIGN = 4 bytes.)

I don't see orders-of-magnitude bloat here though. You've got 16 bytes
of useful data per row (which I suppose was 12 bytes in the flat file?).
There will be 28 bytes of overhead per table row. In addition the index
will require 12 data bytes + 12 overhead bytes per entry; allowing for
the fact that b-tree only likes to pack pages about 2/3ds full, we could
estimate index size as about 36 bytes per original row, giving an
aggregate bloat factor of 6.67X compared to a binary flat file if the
flat file needed 12 bytes per row.

The only way I could see to get to a 65X bloat factor would be if you'd
repeatedly updated the table rows without vacuuming.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Tony and Bryn Reina

2004-04-09 05:09:11 UTC

Permalink

Post by Tom Lane
I don't see orders-of-magnitude bloat here though. You've got 16 bytes
of useful data per row (which I suppose was 12 bytes in the flat file?).
There will be 28 bytes of overhead per table row. In addition the index
will require 12 data bytes + 12 overhead bytes per entry; allowing for
the fact that b-tree only likes to pack pages about 2/3ds full, we could
estimate index size as about 36 bytes per original row, giving an
aggregate bloat factor of 6.67X compared to a binary flat file if the
flat file needed 12 bytes per row.
The only way I could see to get to a 65X bloat factor would be if you'd
repeatedly updated the table rows without vacuuming.

Thanks Tom (and everyone else). At least I feel more comfortable that
there's not something obvious. I did perform a 'vacuum full', but still no
dice. At least thanks to Stephan I've narrowed my search down to one table
and I have a ballpark theoretical bloat to shoot for (6 .67X). I'm going to
go back and try to determine if I'm correctly interpretting the part of the
flat file that holds this table's data.

-Tony

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ***@postgresql.org

Tony Reina

2004-04-11 20:57:40 UTC

Permalink

I thought I'd make a followup to the question of storage bloat. I
tried approaching the problem by living a little loosely with database
normalization and use arrays instead of flattening everything out in
the tables.

So my original table,

CREATE TABLE original (
dbIndex integer,
index1 smallint,
index2 smallint,
index3 smallint,
dvalue real
):

becomes

CREATE TABLE newtable (
dbIndex integer,
dvalue real[][][]
);

Now I'd expect to get better storage here just because I'm dropping 3
smallint fields (so 2 bytes x 3 for each value, or a 6 fold decrease).

However, in actuality I'm seeing a 30 fold decrease:

original table = 48 MB
newtable = 1.6 MB !!!!!

If I'm understanding the data directory correctly, the array tuples
are being toasted (which I think must compress them). The actual table
in the newtable format is only 8K and the pg_toast that goes with this
table is 1.6MB.

So I'm very pleased with the storage here. Plus, reads are faster than
the original design. All in all, I think I've found a good solution
for my kind of data.

Now I know that the use of arrays is considered taboo in the books,
but in my case the data being stored actually do fall into an array
(matrix) structure naturally (it's a timebased recording so there's a
sample for each time point and each X,Y position -- a natural 3D
matrix).

What are the possible downsides that I may be missing?

Tom Lane

2004-04-08 14:15:19 UTC

Permalink

However, a 65X bloat in space seems excessive.

Without concrete details about your schema, it's hard to answer that.
I suspect you made some inefficient choices, but have no data.

For starters you should do a database-wide VACUUM to ensure
pg_class.relpages is up to date, then look to see which tables and
indexes contain the most pages.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings