Capítulo 2, Designing Data-Intensive Applications
Aviso: Este resumo foi gerado por IA usando como entrada o áudio transcrito da reunião.
Me ajude a gerar um relatório desta reunião de estudos que teve como foco
discussão do capitulo 2 (Data models and Query Languages) do
livro Designing Data Intensive Applications.
A transcrição está no arquivo "transcript.txt" no diretório atual.
Você deve criar um documento com os seguintes pontos:
1. Resumo (cerca de 300 palavras)
2. Principais tópicos debatidos (até 10)
3. Principais questões levantadas (ex. "Como gerenciar rollback em sistemas
com muitos usuários?")
4. Exemplos utilizados para ilustrar os conceitos abordados (tanto exemplo
presentes no livro quanto exemplos tragos do dia a dia profissional
dos participantes)
A reunião discutiu como os modelos de dados são organizados em camadas e como cada camada abstrai a complexidade da camada inferior. O grupo começou pela ideia de modelar o problema no domínio (ex.: sistema de estudantes), depois converter para estruturas manipuláveis pelo computador (JSON, CSV, structs), em seguida escolher o modelo e mecanismo de persistência, e por fim considerar os níveis mais baixos de armazenamento e execução. Foi reforçada a importância de manter independência entre camadas e evitar dependências circulares.
Na sequência, o debate central foi a escolha entre bancos relacionais, documentais (NoSQL) e, mais adiante, grafos. Um ponto recorrente foi que a decisão não deve ser guiada apenas por “ter relacionamento ou não”, mas principalmente por padrão de acesso aos dados. O grupo usou o contraste entre modelos orientados a objeto e tabelas relacionais para explicar o impedance mismatch e o risco de consultas N+1 quando a consulta é mal estruturada.
Também foi discutido que NoSQL não significa “sem SQL”, mas “Not Only SQL”, e que na prática muitos sistemas misturam abordagens (ex.: PostgreSQL com JSONB). Surgiram trade-offs claros: documentos favorecem localidade dos dados e leitura agregada de entidades; relacionais favorecem normalização, consistência e consultas declarativas otimizadas pelo motor do banco. Em cenários com muitos relacionamentos (especialmente N:N), o grupo observou que o modelo documental pode perder eficiência e clareza, exigindo agregações e junções mais manuais na aplicação.
No trecho final, a conversa avançou para grafos: nós, arestas, múltiplos tipos de relacionamento e linguagens declarativas específicas (como Cypher), com comparação ao custo de simular grafos em SQL puro. Como encaminhamento, foi sugerido montar um laboratório prático para comparar performance, ergonomia de consulta e aderência à regra de negócio entre modelos relacional, documental e grafo.
struct em C para explicar proximidade de dados
em memória.“The limits of my language mean the limits of my world.” Wittgenstein
Choice of data model shapes how you think about the problem.
Most applications are built by layering one data model on top of another, and usually a common question is: how is it represented in terms of the next-lower layer?
Your brain Layer “I need to store the user name, email and age” you are only thinking about the problem now
The data model layer Then you/or a developer translate the data structures into something a computer can organize, like an spreadsheet, JSON files, XML docs or even just PODs
The database layer The database you select to use, gets the data in those formats and figures out how to store it efficiently, or, how to represent that data as bytes in memory, disks or over the internet.
The hardware layer The computer turns all of that into actual electrical signals, ones and zeros on a chip or disk, also is the only layer that thinks about electricity.
[[Game Engine Architecture Vol I..pdf]]
[!cite] About Layers Like all software systems, game engines are built in layers. Normally upper layers depend on lower layers, but not vice versa. When a lower layer depends upon a higher layer, we call this a circular dependency.
Dependency cycles are to be avoided in any software system, because they lead to undesirable coupling between system make the software untestable and inhibit code reuse. This is especially true for a large‑scale system like a game engine.
[!Cite] About data and assets “Game engines deal with a wide range of asset types, from renderable geometry to materials and textures to animation data to audio. These assets are defined in part by the raw data produced by the artists when they use a tool like Maya, Photoshop or SoundForge.
Every game engine requires some kind of database to manage all of the metadata associated with the game’s assets. This database might be implemented using an honest‑to‑goodness relational database such as MySQL or Oracle, or it might be implemented as a collection of text files, managed by a revision control system such as Subversion, Perforce or Git. We’ll call this metadata the resource database in this book.
No matter in what format the resource database is stored and managed, some kind of user interface must be provided to allow users to author and edit the data. At Naughty Dog, we wrote a custom GUI in C# called Builder for this purpose.”
Impedance Mismatch: friction between two ways of describing the same information.
“descoordenação” entre o modelo mental do programa e o modelo de armazenamento do banco
Your program = objects, linked lists, types, inheritence etc
Your database = tables, rows and columns
exemplo:
struct Item {
std::string nome;
double preco;
int quantidade;
};
struct Pedido {
int id;
std::string cliente;
std::vector<Item> itens;
};tabela pedidos:
id | clientetabela itens_pedido:
id | pedido_id | nome | preco | quantidadesalvar um Pedido:
pedidosiditens_pedido com pedido_idcarregar um Pedido:
pedido_idvector<Item>ou seja, constantemente:
Problema:
std::vector<Pedido> pedidos = carregarPedidos();isso por baixo dos panos roda:
SELECT id, cliente FROM pedidos;e digamos que isso retorne ~100 pedidos
entao pra cada pedido, essa query vai rodar 1x:
SELECT nome, preco, quantidade
FROM itens_pedido
WHERE pedido_id = X;ou seja,
1 query inicial + 100 queries = 101 queries = N+1
entao se:
Relational (SQL, since the 70s) won because it hid the messy storage stuff and let people just declare what they wanted.
It assumed data is rows in tables with relationships between them via foreign keys.
Document databases (aka noSQL) (MongoDB, CouchDB, etc.) came back into fashion because:
A person has a name, a job history (list), an education list, contact info. In a document DB you just store the whole résumé as one JSON blob: [[Pasted image 20260415093248.png]] [[Pasted image 20260415093248.png|322]]
{
"user_id": 251,
"first_name": "Bill",
"last_name": "Gates",
"positions": [
{"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
{"job_title": "Co-founder", "organization": "Microsoft"}
],
"education": [
{"school_name": "Harvard University", "start": 1973, "end": 1975}
]
}In SQL you’d shred that into a users table, a
positions table, an education table, and join
them back together. The document version has nice locality (one fetch,
all the data). The SQL version is better when many people share the same
data, like if “Microsoft” is its own entity that 10,000 résumés point
to. Update the company name once, everyone sees it. In the document
version you’d have to find every résumé and update it.
-- tables
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
first_name TEXT,
last_name TEXT
);
CREATE TABLE organizations (
org_id INTEGER PRIMARY KEY,
name TEXT
);
CREATE TABLE schools (
school_id INTEGER PRIMARY KEY,
name TEXT
);
CREATE TABLE positions (
position_id INTEGER PRIMARY KEY,
user_id INTEGER REFERENCES users(user_id),
org_id INTEGER REFERENCES organizations(org_id),
job_title TEXT
);
CREATE TABLE education (
education_id INTEGER PRIMARY KEY,
user_id INTEGER REFERENCES users(user_id),
school_id INTEGER REFERENCES schools(school_id),
start_year INTEGER,
end_year INTEGER
);
-- data
INSERT INTO users VALUES (251, 'Bill', 'Gates');
INSERT INTO organizations VALUES
(1, 'Bill & Melinda Gates Foundation'),
(2, 'Microsoft');
INSERT INTO schools VALUES
(1, 'Harvard University');
INSERT INTO positions VALUES
(1, 251, 1, 'Co-chair'),
(2, 251, 2, 'Co-founder');
INSERT INTO education VALUES
(1, 251, 1, 1973, 1975);One JSON blob became 5 tables and 7 inserts.
To get the résumé back you do a join:
SELECT u.first_name, u.last_name, p.job_title, o.name AS organization
FROM users u
LEFT JOIN positions p ON p.user_id = u.user_id
LEFT JOIN organizations o ON o.org_id = p.org_id
WHERE u.user_id = 251;To determine relationships in SQL:
One-to-Many (1:N): One entity instance relates to many instances of another entity.
one user ─────────> many positions
(Bill) (Co-founder at MS, Co-chair at Gates Foundation)
users positions
┌─────────┬──────┐ ┌─────────┬───────────────┐
│ user_id │ name │ │ user_id │ job_title │
├─────────┼──────┤ ├─────────┼───────────────┤
│ 251 │ Bill │ <────────── │ 251 │ Co-founder │
└─────────┴──────┘ <─────── │ 251 │ Co-chair │
│ 251 │ Trustee │
└─────────┴───────────────┘
Bill has 3 positions. Each position belongs to exactly one Bill. One-to-many.
Many-to-One (N:1): Many instances of one entity relate to one instance of another entity. Same relationship, looked at from the other side. Many rows in B point at one row in A.
many positions ─────────> one organization
(Bill's job, Linus's job) (Microsoft)
positions organizations
┌─────────┬────────┐ ┌────────┬───────────┐
│ user_id │ org_id │ │ org_id │ name │
├─────────┼────────┤ ├────────┼───────────┤
│ 251 │ 1 │ ─────────>│ 1 │ Microsoft │
│ 252 │ 1 │ ─────────>│ │ │
│ 253 │ 1 │ ─────────>│ │ │
└─────────┴────────┘ └────────┴───────────┘
Many positions, one Microsoft. “One-to-many” and “many-to-one” are literally the same rela=-tionship, you just pick which side you’re standing on.
Many-to-Many (N:M): Many instances of one entity relate to many instances of another entity. Both sides can have many of the other. A student takes many courses, a course has many students. Neither side “owns” the other.
students enrollments courses
┌────────────┬──────┐ ┌────────────┬───────────┐ ┌───────────┬─────────┐
│ student_id │ name │ │ student_id │ course_id │ │ course_id │ title │
├────────────┼──────┤ ├────────────┼───────────┤ ├───────────┼─────────┤
│ 1 │ Ada │ ◄──│ 1 │ 101 │───►│ 101 │ Math │
│ 2 │ Bob │ ◄──│ 1 │ 102 │───►│ 102 │ Physics │
└────────────┴──────┘ ◄──│ 2 │ 101 │───►└───────────┴─────────┘
└────────────┴───────────┘
You can’t represent this with one foreign key, you will need a third table in the middle, called a join table.
This is the core trade-off:
TLDR:
[[Pasted image 20260415100707.png|116]]
Tree-shaped means the data branches downward and never loops back to itself.
Resume (Bill)
│
┌─────────┼──────────┐
Positions Education Contact info
│ │
┌───┴───┐ │
Job1 Job2 Harvard
Each thing has one parent. Nothing points sideways. Nothing points back up.
A document database is basically storing the whole tree as one nested blob:
{
"name": "Bill",
"positions": [{...}, {...}],
"education": [{...}]
}If your data is shaped like a tree, this is great. One read gets the whole thing.
But the moment your data stops being a tree, or the moment one node needs to point at another node that lives in a different tree, you’re not in tree territory anymore. You’re in graph territory. That’s when you start needing foreign keys, join tables, or an actual graph database.
Tree: Not a tree (graph):
A A ──> B
/ \ │ │
B C v v
/ \ C <── D
D E │
└──> A <- loops back