Proggers Bookclub

Clube do livro de programação.

abra um pr para editar essa pagina: github.com/guites/proggers.

Material do segundo encontro (15/04/2026)

Livro: Designing Data-Intensive Applications
Capítulo discutido: Capítulo 2 - Data models and Query Languages
Data da reunião: 15 de abril de 2026


Este foi o material usado para guiar a reunião. O resumo se encontra em https://guites.github.io/proggers/ddia/2/notes.html.

Data Modelling

“The limits of my language mean the limits of my world.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​​​​Wittgenstein

Choice of data model shapes how you think about the problem.

Most applications are built by layering one data model on top of another, and usually a common question is: how is it represented in terms of the next-lower layer?

  1. Your brain Layer ​​“I need to store the user name, email and age” you are only thinking about the problem now

  2. The data model layer ​ Then you/or a developer translate the data structures into something a computer can organize, like ​an spreadsheet, JSON files, XML docs or even just PODs

  3. The database layer ​ ​ The database you select to use, gets the data in those formats and figures out how to store it efficiently, or, how to represent that data as bytes in memory, disks or over the internet.

  4. The hardware layer ​ ​The computer turns all of that into actual electrical signals, ones and zeros on a chip or disk, also is the only layer that thinks about electricity.

Game Engine Example

[[Game Engine Architecture Vol I..pdf]]

[!cite] About Layers Like all software systems, game engines are built in layers. Normally upper layers depend on lower layers, but not vice versa. When a lower layer depends upon a higher layer, we call this a circular dependency.

Dependency cycles are to be avoided in any software system, because they lead to undesirable coupling between system make the software untestable and inhibit code reuse. This is especially true for a large‑scale system like a game engine.

[!Cite] About data and assets “Game engines deal with a wide range of asset types, from renderable geometry to materials and textures to animation data to audio. These assets are defined in part by the raw data produced by the artists when they use a tool like Maya, Photoshop or SoundForge.

Every game engine requires some kind of database to manage all of the metadata associated with the game’s assets. This database might be implemented using an honest‑to‑goodness relational database such as MySQL or Oracle, or it might be implemented as a collection of text files, managed by a revision control system such as Subversion, Perforce or Git. We’ll call this metadata the resource database in this book.

No matter in what format the resource database is stored and managed, some kind of user interface must be provided to allow users to author and edit the data. At Naughty Dog, we wrote a custom GUI in C# called Builder for this purpose.”

Relational Vs Document

Impedance Mismatch: friction between two ways of describing the same information.

“descoordenação” entre o modelo mental do programa e o modelo de armazenamento do banco

Your program = objects, linked lists, types, inheritence etc

Your database = tables, rows and columns

exemplo:

struct Item {
    std::string nome;
    double preco;
    int quantidade;
};

struct Pedido {
    int id;
    std::string cliente;
    std::vector<Item> itens;
};

tabela pedidos:

id | cliente

tabela itens_pedido:

id | pedido_id | nome | preco | quantidade

salvar um Pedido:

  1. inserir na table pedidos
  2. pega id
  3. pra cada item:

carregar um Pedido:

  1. busca o pedido
  2. busca todos os itens com pedido_id
  3. reconstroi vector<Item>

ou seja, constantemente:

Problema:

std::vector<Pedido> pedidos = carregarPedidos();

isso por baixo dos panos roda:

  1. query para pegar os pedidos:
SELECT id, cliente FROM pedidos;

e digamos que isso retorne ~100 pedidos

entao pra cada pedido, essa query vai rodar 1x:

SELECT nome, preco, quantidade 
FROM itens_pedido 
WHERE pedido_id = X;

ou seja,

1 query inicial + 100 queries = 101 queries = N+1

entao se:

Relational (SQL, since the 70s) won because it hid the messy storage stuff and let people just declare what they wanted.

It assumed data is rows in tables with relationships between them via foreign keys.

Document databases (aka noSQL) (MongoDB, CouchDB, etc.) came back into fashion because:

  1. Devs wanted JSON-shaped data that matches their objects
  2. Schemas felt rigid.
  3. Better locality, one document, one disk read.
  4. Some apps have tree-shaped data and joins felt dumb for that.

A person has a name, a job history (list), an education list, contact info. In a document DB you just store the whole résumé as one JSON blob: [[Pasted image 20260415093248.png]] [[Pasted image 20260415093248.png|322]]

{
  "user_id": 251,
  "first_name": "Bill",
  "last_name": "Gates",
  "positions": [
    {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
    {"job_title": "Co-founder", "organization": "Microsoft"}
  ],
  "education": [
    {"school_name": "Harvard University", "start": 1973, "end": 1975}
  ]
}

In SQL you’d shred that into a users table, a positions table, an education table, and join them back together. The document version has nice locality (one fetch, all the data). The SQL version is better when many people share the same data, like if “Microsoft” is its own entity that 10,000 résumés point to. Update the company name once, everyone sees it. In the document version you’d have to find every résumé and update it.

-- tables
CREATE TABLE users (
  user_id     INTEGER PRIMARY KEY,
  first_name  TEXT,
  last_name   TEXT
);

CREATE TABLE organizations (
  org_id  INTEGER PRIMARY KEY,
  name    TEXT
);

CREATE TABLE schools (
  school_id  INTEGER PRIMARY KEY,
  name       TEXT
);

CREATE TABLE positions (
  position_id  INTEGER PRIMARY KEY,
  user_id      INTEGER REFERENCES users(user_id),
  org_id       INTEGER REFERENCES organizations(org_id),
  job_title    TEXT
);

CREATE TABLE education (
  education_id  INTEGER PRIMARY KEY,
  user_id       INTEGER REFERENCES users(user_id),
  school_id     INTEGER REFERENCES schools(school_id),
  start_year    INTEGER,
  end_year      INTEGER
);

-- data
INSERT INTO users VALUES (251, 'Bill', 'Gates');

INSERT INTO organizations VALUES
  (1, 'Bill & Melinda Gates Foundation'),
  (2, 'Microsoft');

INSERT INTO schools VALUES
  (1, 'Harvard University');

INSERT INTO positions VALUES
  (1, 251, 1, 'Co-chair'),
  (2, 251, 2, 'Co-founder');

INSERT INTO education VALUES
  (1, 251, 1, 1973, 1975);

One JSON blob became 5 tables and 7 inserts.

To get the résumé back you do a join:

SELECT u.first_name, u.last_name, p.job_title, o.name AS organization
FROM users u
LEFT JOIN positions p     ON p.user_id = u.user_id
LEFT JOIN organizations o ON o.org_id  = p.org_id
WHERE u.user_id = 251;

Many-to-one & Many-to-many

To determine relationships in SQL:

One-to-Many (1:N): One entity instance relates to many instances of another entity.


one user  ─────────>  many positions
(Bill)           (Co-founder at MS, Co-chair at Gates Foundation)

users                          positions
┌─────────┬──────┐             ┌─────────┬───────────────┐
│ user_id │ name │             │ user_id │   job_title   │
├─────────┼──────┤             ├─────────┼───────────────┤
│   251   │ Bill │ <────────── │   251   │  Co-founder   │
└─────────┴──────┘    <─────── │   251   │   Co-chair    │
                               │   251   │   Trustee     │
                               └─────────┴───────────────┘

Bill has 3 positions. Each position belongs to exactly one Bill. One-to-many.

Many-to-One (N:1): Many instances of one entity relate to one instance of another entity. Same relationship, looked at from the other side. Many rows in B point at one row in A.

many positions  ─────────>  one organization
(Bill's job, Linus's job)    (Microsoft)

positions                      organizations
┌─────────┬────────┐           ┌────────┬───────────┐
│ user_id │ org_id │           │ org_id │   name    │
├─────────┼────────┤           ├────────┼───────────┤
│   251   │   1    │ ─────────>│   1    │ Microsoft │
│   252   │   1    │ ─────────>│        │           │
│   253   │   1    │ ─────────>│        │           │
└─────────┴────────┘           └────────┴───────────┘

Many positions, one Microsoft. “One-to-many” and “many-to-one” are literally the same rela=-tionship, you just pick which side you’re standing on.

Many-to-Many (N:M): Many instances of one entity relate to many instances of another entity. Both sides can have many of the other. A student takes many courses, a course has many students. Neither side “owns” the other.

students                  enrollments               courses
┌────────────┬──────┐    ┌────────────┬───────────┐    ┌───────────┬─────────┐
│ student_id │ name │    │ student_id │ course_id │    │ course_id │  title  │
├────────────┼──────┤    ├────────────┼───────────┤    ├───────────┼─────────┤
│     1      │ Ada  │ ◄──│     1      │    101    │───►│    101    │  Math   │
│     2      │ Bob  │ ◄──│     1      │    102    │───►│    102    │ Physics │
└────────────┴──────┘ ◄──│     2      │    101    │───►└───────────┴─────────┘
                         └────────────┴───────────┘

You can’t represent this with one foreign key, you will need a third table in the middle, called a join table.

This is the core trade-off:

  1. if the thing has its own identity and other things reference it, it deserves its own table.
  2. If the thing only exists as part of its parent and nobody else cares about it, stuff it inside the parent as a document.

TLDR:

[[Pasted image 20260415100707.png|116]]

Tree-shaped

Tree-shaped means the data branches downward and never loops back to itself.

              Resume (Bill)
                  │
        ┌─────────┼──────────┐
    Positions  Education  Contact info
        │          │
    ┌───┴───┐      │
   Job1    Job2  Harvard

Each thing has one parent. Nothing points sideways. Nothing points back up.

A document database is basically storing the whole tree as one nested blob:

{
  "name": "Bill",
  "positions": [{...}, {...}],
  "education": [{...}]
}

If your data is shaped like a tree, this is great. One read gets the whole thing.

But the moment your data stops being a tree, or the moment one node needs to point at another node that lives in a different tree, you’re not in tree territory anymore. You’re in graph territory. That’s when you start needing foreign keys, join tables, or an actual graph database.

Tree:​  ​​                       Not a tree (graph):

      A                                  A ──> B
     / \                                 │     │
    B   C                                v     v
   / \                                   C <── D
  D   E                                  │
                                         └──> A   <- loops back