Raw devices vs. Filesystems

Discussion:

(too old to reply)

Gregory S. Williamson

2004-04-05 19:43:21 UTC

No point to beating a dead horse (other than the sheer joy of the thing) since postgres does not have raw device support, but ...

raw devices, at least on solaris, are about 10 times as fast as cooked file systems for Informix. This might still be a gain for postgres' performance, but the portability issues remain.

raw device use in Informix is safer in terms of data because Informix does not ever have to use the regular file system and so issues of buffering and so on go away. My understanding -- fortunately not ever tried in the real world -- is that postgres' WAL log system is as reliable as Informix writing to raw devices.

raw devices can't be copied or tampered with with regular file tools (mv, cp etc.); this changes how backups get done but also adds a layer of insulation between valuable data and users.

Greg Williamson
DBA
GlobeXplorer LLC
-----Original Message-----
From: Christopher Browne [mailto:***@acm.org]
Sent: Mon 3/29/2004 10:28 AM
To: pgsql-***@postgresql.org
Cc:
Subject: Re: [ADMIN] Raw devices vs. Filesystems

Can you tell me (or at least guide me to a palce where i can find the
answer) what are the benefits of filesystems over raw devices?

For PostgreSQL, filesystems have the merit that you can actually use
them. PostgreSQL doesn't support use of "raw devices."

Two major benefits of using filesystems as opposed to raw devices are
that:

a) The use of raw devices is dramatically non-portable; you have to
reimplement data access on every platform you are trying to
support;

b) The use of raw devices essentially mandates that you implement
some form of generic filesystem on top of them, which adds
considerable complexity to your code.

Two benefits to raw devices are claimed...

c) It's faster. But that assumes that the "cooked" filesystems are
implemented fairly badly. That was typically true, a dozen
years ago, but it isn't so typical now, particularly with a
fancy cacheing controller.

d) It guarantees application control of update ordering. Of course,
with a cacheing controller, or disk drives that lie to one degree
or another, those guarantees might be gone anyways.

There are other filesystem advantages, such as

e) Shifting "cooked" data around may be as simple as a "mv," whereas
reorganizing on raw disk requires DB-specific tools...

And what filesystem is the best for postgresql performance?

That would depend, assortedly, on what OS you are using, what kind of
hardware you are running on, what kind of usage patterns you have, as
well as on how you define the notion of "best."

Absent of any indication of any of those things, the best that can be
said is "that depends..."
--
(format nil "~S@~S" "cbbrowne" "acm.org")
http://cbbrowne.com/info/languages.html
TTY Message from The-XGP at MIT-AI:
The-***@AI 02/59/69 02:59:69
Your XGP output is startling.

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Chris Browne

2004-04-06 20:57:02 UTC

Permalink

Post by Gregory S. Williamson
No point to beating a dead horse (other than the sheer joy of the
thing) since postgres does not have raw device support, but ... raw
devices, at least on solaris, are about 10 times as fast as cooked
file systems for Informix. This might still be a gain for postgres'
performance, but the portability issues remain.

--
select 'cbbrowne' || '@' || 'cbbrowne.com';
http://www.ntlug.org/~cbbrowne/nonrdbms.html
Rules of the Evil Overlord #1. "My Legions of Terror will have helmets
with clear plexiglass visors, not face-concealing ones."
<http://www.eviloverlord.com/>

Tom Lane

2004-04-07 05:26:02 UTC

Permalink

Post by Chris Browne
That claim seems really rather remarkable.
It implies an entirely stunning degree of inefficiency in the
implementation of filesystems on Solaris.

Solaris has a reputation for having stunning degrees of inefficiency
in a number of places :-(. On the other hand I've also heard it praised
for its ability to survive partial hardware failures (eg, N out of M
CPUs down), so maybe that's the price you gotta pay.

But to get back to the point of this discussion: to allow PG to use raw
devices instead of filesystems, we'd first have to do a ton of
portability work (since raw disk access is nowhere standard), and
abandon our principle that Postgres does not run as root (since raw disk
access is not permitted to non-root processes by any sane sysadmin).
But that last is a mighty comforting principle to have, anytime someone
complains that their el cheapo whitebox PC locks up as soon as they
start to stress the database. I know I'd have wasted a lot more time
chasing random hardware breakages if I couldn't say "system freezes and
filesystem corruption are Clearly Not Our Fault".

After that, we get to implement our own filesystem-equivalent management
of disk space allocation, disk I/O scheduling, etc. Are we really
smarter than all those kernel hackers doing this for a living? I doubt it.

After that, we get to re-optimize all the existing Postgres behaviors
that are designed to sit on top of a standard Unix buffering filesystem
layer.

After that, we might reap some performance benefits. Or maybe not.
There's not a heck of a lot of hard evidence that we would --- and
what there is traces to twenty-year-old assumptions about disk drive
and OS behavior, which are quite unlikely to still apply today.

Personally, I have a lot of more-promising projects to pursue...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Harald Fuchs

2004-04-07 13:05:55 UTC

Permalink

Post by Tom Lane
But to get back to the point of this discussion: to allow PG to use raw
devices instead of filesystems, we'd first have to do a ton of
portability work (since raw disk access is nowhere standard), and
abandon our principle that Postgres does not run as root (since raw disk
access is not permitted to non-root processes by any sane sysadmin).

Why not? In MySQL/InnoDB, you do a "chown mysql.daemon /dev/raw/raw1"
(or whatever raw disk you want to access), and that's all.

Post by Tom Lane
After that, we get to implement our own filesystem-equivalent management
of disk space allocation, disk I/O scheduling, etc. Are we really
smarter than all those kernel hackers doing this for a living? I doubt it.

Ditto. I don't have hard numbers for MySQL, but I didn't see any
noticeable improvement when messing with raw disks (at least under
Linux).

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Gregory S. Williamson

2004-04-06 21:23:42 UTC

Permalink

Remarkable, perhaps, to you. Not in the Informix world. But irrelevant to postgres, no ?

-----Original Message-----
From: Chris Browne [mailto:***@acm.org]
Sent: Tuesday, April 06, 2004 1:57 PM
To: pgsql-***@postgresql.org
Subject: Re: [ADMIN] Raw devices vs. Filesystems

That claim seems really rather remarkable.

It implies an entirely stunning degree of inefficiency in the
implementation of filesystems on Solaris.

The amount of indirection involved in walking through i-nodes and such
is something I would expect to introduce some percentage of
performance loss, but for it to introduce overhead of over 900%
presumably implies that Sun (and/or Veritas) got something really
horribly wrong.
--
select 'cbbrowne' || '@' || 'cbbrowne.com';
http://www.ntlug.org/~cbbrowne/nonrdbms.html
Rules of the Evil Overlord #1. "My Legions of Terror will have helmets
with clear plexiglass visors, not face-concealing ones."
<http://www.eviloverlord.com/>

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

scott.marlowe

2004-04-06 22:46:31 UTC

Permalink

Note that the innefficiency could well lie with Informix's file system
interfacing as easily as it could lie with the operating system. Do they
charge extra for being able to access raw devices or somehow make more
money by supporting them? If so, there could be a clear business case for
lots of uwaits() in the code path that handles file systems.

I'm just saying it's a possibility.

Post by Gregory S. Williamson
Remarkable, perhaps, to you. Not in the Informix world. But irrelevant to postgres, no ?
-----Original Message-----
Sent: Tuesday, April 06, 2004 1:57 PM
Subject: Re: [ADMIN] Raw devices vs. Filesystems

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Marsh Ray

2004-04-07 02:56:38 UTC

Permalink

Post by Chris Browne

I too am a little surprised by those numbers, but I think the potential
for a performance gain of that order is relevant.

As I once heard someone remark: "When show up at a pool hall talking
those kind of odds, well, people start making phone calls."

- Marsh

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ***@postgresql.org

Murthy Kambhampaty

2004-04-07 16:46:16 UTC

Permalink

Post by Tom Lane
But to get back to the point of this discussion: to allow PG
to use raw devices instead of filesystems, we'd first have to do a ton of
portability work

...

[The following is said in a low, tentative voice :) ]

I wonder if writing the postgresql data structures as HDF5 data structures (http://hdf.ncsa.uiuc.edu/whatishdf5.html) within a single HDF5 file (perhaps the WAL files would still reside elsewhere) would improve performance while allowing HDF5 to handle portability, and other useful features, is a better solution than the relying on filesystem features.

HDF5 actually provides an added portability advantage that postgresql does not currently enjoy:
"a completely portable file format, so that a file can be written on any system and read on any other"
(See http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf).
The HDF5 "distribution" includes tools for dumping data structures, etc. so if you're hooked on filesystem level operations, you have the ability to inspect postgresql data structures within the HDF5 file, i.e., "outside postgresql".

HDF5's is also designed for clustered/grid computing systems:
"The HDF5 format and library provide a powerful means of organizing and accessing data in a manner that allows scientists to share, process, and manipulate data in today's heterogeneous and quickly-evolving high-performance computational environment, including the emerging computational GRIDs." (http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf, p. 3).
So, the main purpose of this post is to suggest that HDF5's design moves a postgresql version built on a HDF5 datastore that much closer to being ready for cluster-computing environments, with respect to the datastore (there's still the shared memory, etc., that need to be addressed, but ...).

We're playing with HDF5 from Python (see the pytables project) for our "analytics" work, but that requires moving data out of postgresql. I suspect that an SQL interface to HDF5 data structures using postgresql would be a lot more convenient, and that postgresql would gain multiple benefits from having all its data structures in a single HDF5 file. OTOH, maybe us analytics types are better off with Python over HDF5 and "postgresql on HDF5" is not a net win for postgresql. Still, there seems to a great advantage to having rich data structures to operate on rather than just "files", and allowing the HDF5 library to deal with portability, I/O efficiency, and clustering.

Hope my $0.02 worth was.

Cheers,
Murthy

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)

Josh Berkus

2004-04-09 16:02:00 UTC

Permalink

Grega,

Well, as I said, that's why I was asking - I'm willing to give it a go
if nobody can prove me wrong. :)

Why not? If you have time?

I thought you knew - OCFS, OCFS-Tools and OCFSv2 have not only been open-
source for quite a while now - they're released under the GPL.

Keen! Wonder if we can make them regret it.

Seriously, if Oracle opened this stuff, it's probably becuase they used some
GPL components in it. It also probably means that it won't work for
anything but Oracle ...

I don't know what that means to you (probably nothing good, as PostgreSQL
is released under the BSD license),

Well, it just means that we can't ship OCFS with PostgreSQL.

The question does spring up though, that Steve raised in another post -
just for the record, what POSIX semantics can a postmaster live without in
a filesystem?

You might want to ask that question again on Hackers. I don't know the
answer, myself.

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Christopher Browne

2004-04-09 19:34:44 UTC

Permalink

Post by Josh Berkus

Well, as I said, that's why I was asking - I'm willing to give it a go
if nobody can prove me wrong. :)

Why not? If you have time?

True enough.

Post by Josh Berkus

I thought you knew - OCFS, OCFS-Tools and OCFSv2 have not only been
open- source for quite a while now - they're released under the
GPL.

Keen! Wonder if we can make them regret it.
Seriously, if Oracle opened this stuff, it's probably becuase they
used some GPL components in it. It also probably means that it
won't work for anything but Oracle ...

It could be that the experiment shows that OCFS isn't all that
helpful. Or that it helps cover inadequacies in certain aspects of
how Oracle accesses filesystems.

If it _does_ show that it is helpful, then that may suggest a
filesystem implementation strategy useful for the BSD folks.

The main "failure case" would be if the exercise shows that using OCFS
is pretty futile.

--
select 'cbbrowne' || '@' || 'acm.org';
http://www3.sympatico.ca/cbbrowne/linux.html
Do you know where your towel is?