this started as "just record ar"
i thought this would be easy.
open ar session -> capture frames -> save -> upload
done.
then reality hit:
- camera -> around 30 fps
- imu -> 100 to 200 hz
- depth -> around 15 hz
- tracking randomly pauses
- ios and android behave differently
and suddenly:
nothing lines up.
the core realization
this is not a camera problem.
this is a time synchronization problem across multiple asynchronous streams.
system architecture
here is the actual system we ended up building:
each layer has one job:
- native -> capture
- orchestrator -> decisions
- writer -> structure
- upload -> reliability
this separation is what made the system stable.
the core loop (this is everything)
this runs around 30 times per second:
this loop is your ground truth generator.
deterministic sampling (the real unlock)
initially we did:
if a frame arrives -> record it
this breaks instantly.
the correct model
why this matters
this gives you:
- device-independent datasets
- stable cadence (15 hz, 30 hz)
- zero long-term drift
what ar frameworks actually give you
both arkit and arcore are doing slam:
- visual tracking (camera)
- inertial tracking (imu)
- map reconstruction
they output:
- pose (6dof)
- camera frame
- feature points
- depth (if available)
but:
these are not synchronized streams.
the real problem: multi-rate data
everything must align on:
timestamp, not frame index.
deep dive: mcap (why this was a game changer)
we moved from:
to:
what mcap actually is
mcap is a container format for heterogeneous timestamped data (mcap.dev).
its not encoding, it wraps multiple streams into one file.
why it exists
before mcap:
- ros bags (hard to use outside ros)
- sqlite logs (not self-contained)
- custom formats (painful)
mcap solves this by being:
- self-contained
- multi-stream
- indexed
- append-only (Foxglove)
mcap mental model
each stream = topic
each entry = timestamped message
actual file structure (internal)
key concept: records
mcap is built from records:
- schema -> defines structure
- channel -> defines topic
- message -> actual data
- chunk -> batch of messages
- index -> fast lookup
why chunking matters
your system:
- android -> around 1 mb chunks
- ios -> around 512 kb
this gives:
- high write throughput
- fewer disk ops
- recoverable files
mcap's append-only design even allows recovery after crashes (Foxglove).
indexing (this is huge)
without index:
with index:
mcap supports:
- topic-based lookup
- timestamp-based seeking
- partial reads over network (Segments.ai)
this is critical when:
- files are 500 mb and larger
- data is remote
serialization layer (ros2 + cdr)
your pipeline uses:
- ros2 message schemas
- cdr encoding
cdr (common data representation):
- binary serialization format used by dds and ros2
- ensures cross-language compatibility
so each message becomes:
ios vs android architecture
android (pull model)
ios (push model)
why this matters
| problem | android | ios | | -------------- | ------- | -------- | | timing control | easy | hard | | buffering | minimal | required | | backpressure | rare | common |
this is why:
- ios needs queues and backpressure handling
- android can stay simpler
upload architecture
naive approach
actual system
implementation idea
this gives:
- resumability
- parallelism
- reliability
hardest problems (real ones)
these took the most time:
- imu and frame timestamp alignment
- tracking loss compensation
- deterministic sampling correctness
- writer backpressure
- storage exhaustion handling
these are invisible in demos. but define production systems.
final takeaway
once you understand this:
- sampling becomes obvious
- mcap makes sense
- uploads become solvable
closing
this started as:
lets record ar
it became:
- real-time systems
- data engineering
- serialization design
- distributed uploads
and honestly, thats what made it worth building.