This is one of the few programming quotes that is not just abstract crap, but on...

smoyer · on Sept 23, 2012

+1 ... and it's been known for a long time. Back in the '70s, Fred Brooks said "Show me your [code] and conceal your [data structures], and I shall continue to be mystified. Show me your [data structures], and I won't usually need your [code]; it'll be obvious."

necrodome · on Sept 23, 2012

any concrete examples you can point to?

gbog · on Sept 23, 2012

Suppose you want to manage the marital status of some people. You could have one data structure, here a table in a database, where you keep (name, status) tuples. This is bad, for many reason: what if someone changes name? If you need to allow undoing, how can you know which was the previous status before "maried" has been entered (widow? Single?)

A better data structure here is an event table (who, did what, when), then not only you know the marital status but you know where it comes from, and can build much more on this data.

hnriot · on Sept 23, 2012

You're completely neglecting how the data will be used, by using an rcs like data storage of deltas you penalize the common case of wanting to query the current state efficiently.

A better way would be to store a person table without name and a marital status but use it a a primary key into a detail table that has multiple rows for any individual along with dates so you have a row representing the persons state, married, name from, to.

But there are also about a dozen othe better solutions than the one you describe.

gbog · on Sept 24, 2012

What you describe is just a denormalization of my solution. You are already optimizing a proposal that was designed as a very short example of a better data structure. That's absurd. If you need to access often the current marital status you can cache it, store it on the client, use a materialized view, write it in a file along with other info, etc.

I would advice personally against the plain denormalization you propose (if I understood it well), because it means your application logic will have to handle it, and your data structure will not produce a very simple straight forward code that is the appendage of good data structure.

lloeki · on Sept 24, 2012

If you want to keep marital status history, then the marital status is a SCD [0]

With Rails-like conventions, here's a minimalist Type II SCD definition:

    +--------+    +-----------------+    +----------+
    | people | -> | people_statuses | <- | statuses |
    +--------+    +-----------------+    +----------+
    | - id   |    | - id            |    | - id     |
    | - name |    | - person_id     |    | - label  |
    +--------+    | - status_id     |    +----------+
                  | - until         |                
                  +-----------------+

where person_statuses.until is the last date where this relationship is valid.

The logic follows from the data structure definition in a natural manner.

'WHERE person_statuses.until IS NULL' will immediately give you the last status of someone/everyone. You can trivially update one's status by UPDATEing 'until' and INSERTing a new record, wrapped in a transaction. Additionally, you can use until as a guard WHERE clause for such an update to implement a form of optimistic locking.

You can also add a column relating a person to another. With a slightly more complex query you can easily make the relation symmetric and remove the need for 'duplicate' reciprocal records.

Handling name changes and preserving navigable history in the people table is not much harder.

The wikipedia page about SCDs gives interesting cases.

[0] http://en.wikipedia.org/wiki/Slowly_changing_dimension

gbog · on Sept 24, 2012

That's interesting, thanks. I still wonder if a simpler event table is not a better data structure, mostly because it is a read write only structure, no need for updates. Granted, getting the current status is a bit slower but keeping pointers to the latest event plus chaining events can fix it.

mturmon · on Sept 23, 2012

Try to implement the game Asteroids without a thought about data structures, just start programming it procedurally as it comes to you. See how far you get in, say four hours. Find a graphics library, of course.

Then, use a very simple object oriented model, where everything on-screen (asteroids, ships, enemy ships, shots) has a draw method, a move method, a create method, and an I'm-hit method, together with logical internal state like position and velocity. See how far you get in the same four hours.

Note how much easier it is to do the second way. That's the power of data structures (and, admittedly, some simple oo ideas, deployed in a lightweight way).

YZF · on Sept 24, 2012

This is the power of designing before you code. Nothing to do with data structures. Data structures are part of the design but you're focusing on the object oriented design aspect.

mturmon · on Sept 24, 2012

You know, I put in a caveat anticipating this objection, but since you raised it, let me offer the following counterpoint:

You could implement the approach I described in a language without any OO support, and still come out ahead.

Just do it in plain C using function pointers for methods, and write your own dispatcher. Or, do it in assembly (I implemented this approach in PDP-11 assembly).

If you do it this way, it really is all about the data structure -- the methods are subordinate to the data structure, acting just like other state. You could even change the methods after objects have been created -- for example, swap in the "evasive maneuver" move method once you fire on an alien ship, so that the alien goes from lazy drifting to taking evasive action.

javajosh · on Sept 23, 2012

Well, let's say I decide to have two copies of an important number in my program. Now in various places in code I need to update that number. In every place I now need to do the update twice. Also, some of my code needs to read the number when it changes. I could go ahead and just call all of the routines that need to fire when I change the number.

Or I can decompose my application so that I maintain only one number, and modify the update mechanism to support firing off routines when it updates.

So, we went from two numbers, to one number and a lookup table (the table that tracks which routines need to fire). And the code will be vastly simpler in the second case.

(This is probably not the best example - there are probably some nice database normalization problems that highlight the benefits of good data structures a lot better).