Real-Time Collaboration is Harder Than It Looks

It started with a wrong assumption

When I set out to build real-time collaborative drawing, I thought the hard part would be WebSockets. Setting up a persistent connection, handling reconnects, broadcasting to the right room. That stuff has solid libraries. I figured I'd be done with the hard part in a day.

I was done with WebSockets in a day. Then I spent a week on a problem that didn't have a name yet in my vocabulary: conflict resolution.

The naive approach breaks immediately

The first version was simple: user draws something, send it to the server, server broadcasts to everyone else, everyone applies it. Works great with one user. With two users drawing simultaneously, you get this:

Both clients sent an operation at nearly the same time. The server received them in different orders. Each client applied the remote operation on top of a different local state. Now Client A shows a blue stroke where Client B shows red. They've diverged and there's no automatic way to reconcile.

For canvas drawing this might seem minor — who cares which color wins? But for collaborative text editing (or any structured data), divergence means your users are staring at completely different documents. This is the core problem.

The architecture that actually works

Before getting into conflict resolution algorithms, the server architecture matters a lot. Here's what a solid real-time canvas server looks like:

Two things to notice:

Operations are logged, not just applied. Every draw operation gets appended to an immutable log before being applied to the canvas state. When a new client joins, you replay the log to reconstruct their state. This is how they catch up without receiving a potentially huge current-state snapshot.
The server is the source of truth for ordering. The server sees all operations and defines their canonical order. Clients send operations; the server decides when (in what sequence) those operations happened. This is the key to solving divergence.

Operational Transforms vs CRDTs — the real choice

Once you accept that the server orders operations, you need a way to transform operations so they make sense applied to a state that's different from what the client had when they generated the operation. This is called Operational Transform (OT).

OT says: when Client A's insert(pos=3) arrives at the server after Client B already did a delete(pos=3), transform A's operation to account for that delete. The position has shifted. The transformed operation is what gets applied and broadcast.

OT works well but has a nasty problem: the transformation functions are hard to get right. For simple operations (insert/delete on a linear document) it's manageable. For complex operations on a 2D canvas (bezier curves, group transforms, layer reordering) the transform algebra gets complicated fast.

CRDTs (Conflict-free Replicated Data Types) take a different approach. Instead of transforming operations after the fact, you design your data structure so that any two concurrent operations commute — applying them in any order always produces the same result. No server coordination needed. This is what Figma uses, and it's why Figma can work offline and sync cleanly when you reconnect.

For my canvas app, I ended up using a simpler version: last-write-wins with vector clocks. Each operation carries a logical timestamp. When two operations conflict, the one with the higher timestamp wins. It's not perfect — you can lose work — but for a drawing app where conflicts are rare and stakes are low, it's the pragmatic choice.

Cursor sync — the detail that makes it feel real

Here's something I didn't appreciate until I saw a demo without it: showing other users' cursors is what makes real-time collaboration feel collaborative. Without it, you just see strokes appearing. With it, you see someone else's intention — where they're about to draw.

Cursor sync has a different latency budget than drawing sync. For strokes, ~100ms of latency is fine — nobody notices. For cursors, anything over 60ms feels laggy because you're watching smooth mouse movement that suddenly teleports. You need either a separate high-frequency channel or interpolation on the client side.

// Client-side cursor interpolation
function interpolateCursor(current, target, dt) {
  const speed = 0.15; // tune this
  return {
    x: current.x + (target.x - current.x) * speed * dt,
    y: current.y + (target.y - current.y) * speed * dt,
  };
}

// Called every animation frame, not on each WebSocket message
requestAnimationFrame(function animate(now) {
  const dt = now - lastFrame;
  remoteCursors.forEach(cursor => {
    cursor.rendered = interpolateCursor(cursor.rendered, cursor.target, dt);
  });
  drawCursors();
  lastFrame = now;
  requestAnimationFrame(animate);
});

The cursor positions received over WebSocket become targets. The rendered positions chase those targets smoothly every animation frame. 60fps rendering from ~10 WebSocket updates per second per user.

The latency budget

One thing I didn't think about until production: what's the acceptable latency for each type of event? They're different:

Drawing strokes: <150ms feels live. >300ms feels broken.
Cursor position: <80ms ideally, <120ms acceptable.
Presence (who's online): 1-2 second updates are fine.
Undo/redo: Must be instant and consistent — sync carefully.

These different tolerances suggest different architectures: draw strokes go through the reliable conflict-resolution path, cursor updates go through a lossy-but-fast path, presence goes through polling or a much lower-frequency channel.

What I'd do differently

I'd use Yjs from the start. It's a production-grade CRDT library with first-class awareness (cursor sync) support and bindings for pretty much every editor framework. Rolling your own conflict resolution is a great learning exercise, but it's not what you want under a product.

And I'd instrument latency from day one. p50 WebSocket round-trip time is the metric that tells you whether your architecture is working. Everything else is a guess.