this started as "just record ar"
i thought this would be easy.
open ar session → capture frames → save → upload. done.
then reality hit:
- camera → around 30 fps
- imu → 100 to 200 hz
- depth → around 15 hz
- tracking randomly pauses
- ios and android behave differently
and suddenly nothing lines up.
the core realization
this is not a camera problem. this is a time synchronization problem across multiple asynchronous streams.
system architecture
here is the actual system we ended up building:
each layer has one job:
native
capture raw frames + sensor data
orchestrator
sampling and sync decisions
writer
serialize into structured format
upload
reliability and recovery
this separation is what made the system stable.
the core loop (this is everything)
this runs around 30 times per second:
val shouldSample = isTracking && frameSampler.shouldEncodeFrame(timestamp)
if (shouldSample) {
val pose = frameProcessor.extractPose(camera)
datasetWriter.writePoseRow(timestamp, pose, trackingState)
val pointCloud = frameProcessor.extractPointCloudFrame(frame, timestamp)
datasetWriter.writePointCloud(timestamp, pointCloud)
val depth = frameProcessor.extractDepthImage(frame)
datasetWriter.writeDepthFrame(timestamp, depth)
val jpeg = eglManager.renderCameraToJpeg(...)
datasetWriter.writeCompressedRgbFrame(timestamp, jpeg)
}
val imuSamples = imuCollector.drainSamples()
datasetWriter.writeImuSamples(imuSamples)this loop is your ground truth generator.
deterministic sampling (the real unlock)
initially we did:
if a frame arrives → record it
this breaks instantly.
the correct model
elapsed = timestamp - sessionStart - trackingPauseOffset
expectedFrames = floor(elapsed / frameInterval) + 1
if expectedFrames > deterministicFrameIndex:
deterministicFrameIndex = expectedFrames
encode
else:
skipwhy this matters
Time →
│----│----│----│----│
t0 t1 t2 t3
Only sample at boundariesthis gives you:
- device-independent datasets
- stable cadence (15 hz, 30 hz)
- zero long-term drift
what ar frameworks actually give you
both arkit and arcore are doing slam:
- visual tracking (camera)
- inertial tracking (imu)
- map reconstruction
they output:
- pose (6dof)
- camera frame
- feature points
- depth (if available)
these are not synchronized streams.
the real problem: multi-rate data
everything must align on timestamp, not frame index.
deep dive: mcap (why this was a game changer)
we moved from a folder of loose files:
frames/
imu.csv
pose.csv
depth/to a single container:
session.mcapwhat mcap actually is
mcap is a container format for heterogeneous timestamped data (mcap.dev). it's not encoding — it wraps multiple streams into one file.
why it exists
before mcap:
- ros bags (hard to use outside ros)
- sqlite logs (not self-contained)
- custom formats (painful)
mcap solves this by being:
self-contained
everything in one file
multi-stream
multiple topics, one container
indexed
seek by topic or timestamp
append-only
recoverable after crashes
mcap mental model
session.mcap
├── /camera/pose
├── /device/imu
├── /camera/rgb
├── /camera/depth
└── metadataeach stream = topic. each entry = timestamped message.
actual file structure (internal)
[Magic]
[Header]
[Data Section]
├── Chunk
│ ├── Message
│ ├── Message
│ └── ...
[Summary / Index]
[Footer]
[Magic]key concept: records
mcap is built from records:
- schema → defines structure
- channel → defines topic
- message → actual data
- chunk → batch of messages
- index → fast lookup
why chunking matters
Incoming Data
│
▼
Buffer (chunk)
│
├── flush → disk
└── continueour system:
- android → around 1 mb chunks
- ios → around 512 kb
this gives:
- high write throughput
- fewer disk ops
- recoverable files
mcap's append-only design even allows recovery after crashes (Foxglove).
indexing (this is huge)
without index
scan entire file → find data
with index
jump → read exact time range
mcap supports:
- topic-based lookup
- timestamp-based seeking
- partial reads over network (Segments.ai)
this is critical when files are 500 mb and larger, or when data is remote.
serialization layer (ros2 + cdr)
our pipeline uses ros2 message schemas with cdr encoding.
cdr (common data representation) is a binary serialization format used by dds and ros2. it ensures cross-language compatibility.
so each message becomes:
(topic, timestamp, binary payload)ios vs android architecture
while (true) {
val frame = session.update()
process(frame)
}we drive the loop ourselves, so timing control is straightforward.
func session(_ session: ARSession, didUpdate frame: ARFrame) {
processingQueue.async {
process(frame)
}
}the os pushes frames at us. we have to absorb whatever cadence ARKit decides on.
why this matters
| problem | android | ios |
|---|---|---|
| timing control | easy | hard |
| buffering | minimal | required |
| backpressure | rare | common |
ios needs queues and backpressure handling. android can stay simpler.
upload architecture
naive
file → upload → fail → restart from zero
actual
split into 30mb chunks → parallel upload → retry failed parts → complete
MCAP File
│
▼
Split (30MB chunks)
│
▼
Parallel Upload
│
▼
Retry Failed Parts
│
▼
Complete Uploadimplementation idea
final parts = splitFile(file, 30MB);
await Future.wait(parts.map(uploadPart));
await completeMultipartUpload();this gives us resumability, parallelism, and reliability.
hardest problems (real ones)
these took the most time:
these are invisible in demos. but they define production systems.
final takeaway
the core idea
AR Recording ≠ video capture.
AR Recording = a time-synchronized multi-stream system.
once you understand this:
- sampling becomes obvious
- mcap makes sense
- uploads become solvable
closing
this started as "lets record ar". it became:
- real-time systems
- data engineering
- serialization design
- distributed uploads
and honestly, that's what made it worth building.