this started as "just record ar"

i thought this would be easy.

open ar session -> capture frames -> save -> upload

done.

then reality hit:

camera -> around 30 fps
imu -> 100 to 200 hz
depth -> around 15 hz
tracking randomly pauses
ios and android behave differently

and suddenly:

nothing lines up.

the core realization

this is not a camera problem.

this is a time synchronization problem across multiple asynchronous streams.

system architecture

here is the actual system we ended up building:

svg

each layer has one job:

native -> capture
orchestrator -> decisions
writer -> structure
upload -> reliability

this separation is what made the system stable.

the core loop (this is everything)

this runs around 30 times per second:

this loop is your ground truth generator.

deterministic sampling (the real unlock)

initially we did:

if a frame arrives -> record it

this breaks instantly.

the correct model

text

why this matters

svg

this gives you:

device-independent datasets
stable cadence (15 hz, 30 hz)
zero long-term drift

what ar frameworks actually give you

both arkit and arcore are doing slam:

visual tracking (camera)
inertial tracking (imu)
map reconstruction

they output:

pose (6dof)
camera frame
feature points
depth (if available)

but:

these are not synchronized streams.

the real problem: multi-rate data

svg

everything must align on:

timestamp, not frame index.

deep dive: mcap (why this was a game changer)

we moved from:

text

to:

text

what mcap actually is

mcap is a container format for heterogeneous timestamped data (mcap.dev).

its not encoding, it wraps multiple streams into one file.

why it exists

before mcap:

ros bags (hard to use outside ros)
sqlite logs (not self-contained)
custom formats (painful)

mcap solves this by being:

self-contained
multi-stream
indexed
append-only (Foxglove)

mcap mental model

svg

each stream = topic

each entry = timestamped message

actual file structure (internal)

svg

key concept: records

mcap is built from records:

schema -> defines structure
channel -> defines topic
message -> actual data
chunk -> batch of messages
index -> fast lookup

(Monday Morning Haskell)

why chunking matters

svg

your system:

android -> around 1 mb chunks
ios -> around 512 kb

this gives:

high write throughput
fewer disk ops
recoverable files

mcap's append-only design even allows recovery after crashes (Foxglove).

indexing (this is huge)

without index:

text

with index:

text

mcap supports:

topic-based lookup
timestamp-based seeking
partial reads over network (Segments.ai)

this is critical when:

files are 500 mb and larger
data is remote

serialization layer (ros2 + cdr)

your pipeline uses:

ros2 message schemas
cdr encoding

cdr (common data representation):

binary serialization format used by dds and ros2
ensures cross-language compatibility

so each message becomes:

text

ios vs android architecture

android (pull model)

ios (push model)

swift

why this matters

| problem | android | ios | | -------------- | ------- | -------- | | timing control | easy | hard | | buffering | minimal | required | | backpressure | rare | common |

this is why:

ios needs queues and backpressure handling
android can stay simpler

upload architecture

naive approach

svg

actual system

svg

implementation idea

dart

this gives:

resumability
parallelism
reliability

hardest problems (real ones)

these took the most time:

imu and frame timestamp alignment
tracking loss compensation
deterministic sampling correctness
writer backpressure
storage exhaustion handling

these are invisible in demos. but define production systems.

final takeaway

svg

once you understand this:

sampling becomes obvious
mcap makes sense
uploads become solvable

closing

this started as:

lets record ar

it became:

real-time systems
data engineering
serialization design
distributed uploads

and honestly, thats what made it worth building.