Praxis

Turn human actions into training data for physical AI

Praxis converts real-world behavior into structured, machine-readable representations. Usable for robotics and imitation learning.

See slip and recovery in actionInteractive · ~8s clip

Turning video into structured action data

InputAction pipelineOutput

Observe

track hands and objects

Decompose

break into phases

Structure

structure action data

{
"task": "Pick the battery and place it down.",
"duration_sec": 3.65,
"segments": [
{
"action": "reach",
"language": "Right hand reaches for the battery.",
"object": "battery",
"right": {
"grasp_type": "precision",
"contact_state": "no_contact"
},
"t_start": 0.0, "t_end": 0.3
},
{
"action": "contact",
"language": "Right hand touches the battery.",
"object": "battery",
"right": {
"grasp_type": "precision",
"contact_state": "making_contact"
},
"t_start": 0.3, "t_end": 0.4
},
{
"action": "grasp",
"language": "Right hand picks up the battery.",
"object": "battery",
"right": {
"grasp_type": "pinch",
"contact_state": "making_contact"
},
"t_start": 0.4, "t_end": 0.6
},
{
"action": "translate",
"language": "Right hand slowly carries the battery.",
"object": "battery",
"right": {
"contact_state": "in_contact"
},
"t_start": 0.6, "t_end": 2.6
},
{
"action": "release",
"language": "Right hand slowly places the battery.",
"object": "battery",
"right": {
"contact_state": "breaking_contact"
},
"t_start": 2.6, "t_end": 3.5
}
]
}

Every moment becomes structured action data.

Time-aligned sequences of actions and interactions.

Per-hand interaction signals, including grasp, contact, and timing.

The same pipeline runs across all episodes, producing consistent data ready for training.

Why this matters

Robots don’t fail because they can’t see.

They fail because they don’t understand how actions unfold over time.

Vision isn’t the bottleneck. Representation is.

Most datasets reduce actions to

frames
objects
outcomes

Real-world manipulation requires

transitions
contact
adjustment
failure and recovery

Praxis captures how actions unfold—not just what happens, but how it happens over time.

Without this, manipulation doesn’t transfer reliably.

What Praxis enables

Praxis turns human behavior into training-ready representations for physical AI systems.

Train manipulation policies from structured, per-hand action sequences

Use structured action sequences instead of raw video to train more stable, reliable policies.

Handle failure and recovery

Capture slips, retries, and adjustments, not just successful outcomes.

Learn from action sequences

Learn from how actions unfold over time, not just final states.

Transfer skills across environments

Generalize behaviors beyond the original recording setup.

What makes Praxis data different

Most datasets capture results. Praxis captures how actions unfold over time.

Per-hand structure, not just scenes

Each hand is modeled separately with role, grasp, and interaction context.

Failure is modeled, not filtered out

Captures slips, retries, and recovery with cause and outcome.

Contact is signal, not noise

Model how hands actually engage with objects: where contact happens, when it begins, and how it evolves over time.

Interaction, not just objects

Capture how objects are used: grasp points, contact patterns, and intent, not just object identity.

Fully traceable data

Every action is linked to its source—perception or human refinement.

Built for learning systems that need more than pixels and labels.

For robots that need to handle the real world.

We’re working with a small number of teams using structured action data to improve policy learning and real-world robustness.

Collaborate with us