iOS Custom Pose Estimation

Note

If you haven’t set up the SDK yet, make sure to go through those directions first. You’ll need to add the Core library to the app before using the specific feature API or custom model. Follow iOS setup or Android setup directions.

Use a custom 2D pose model to estimate the position of object keypoints in images and video.

If you want to track the position of people in an image or video, see documentation for Human Pose Estimation.

1. Build the FritzVisionPoseModel

To create the pose estimation model complete the following steps.

Add FritzVision to your Podfile

Include Fritz/Vision in your Podfile.

pod 'Fritz/Vision'

Make sure to run a pod install with the latest changes.

pod install

Add the model to your Xcode project

Download your trained pose estimation Core ML model from the Fritz webapp and drag the file into your Xcode project.

This will trigger Xcode to generate the Swift files necessary to interact with your model.

Define a Skeleton

Define a new enum that inherits from the Skeleton class. Each value of the enum will be the name and index of a keypoint matching the output of your model.

Additionally, specify the objectName. This should be the name of the object.

For example, if your pose estimation model predicts the location of each finger tip on a hand, you might define a skeleton as follows:

public enum HandSkeleton: Int, SkeletonType {
  public static let objectName = "hand"

  case thumb
  case index
  case middle
  case ring
  case pinky
}

Note

Model initialization

It’s important to intialize one instance of the model so you are not loading the entire model into memory on each model execution. Usually this is a property on a ViewController. When loading the model in a ViewController, the following ways are recommended:

Lazy-load the model

By lazy-loading model, you won’t load the model until the first prediction. This has the benefit of not prematurely loading the model, but it may make the first prediction take slghtly longer.

class MyViewController: UIViewController {
  lazy var model = FritzVisionHumanPoseModelFast()
}

Load model in viewDidLoad

By loading the model in viewDidLoad, you’ll ensure that you’re not loading the model before the view controller is loaded. The model will be ready to go for the first prediction.

class MyViewController: UIViewController {
  let model: FritzVisionHumanPoseModelFast!

  override func viewDidAppear(_ animated: Bool) {
    super.viewDidAppear(animated)
    model = FritzVisionHumanPoseModelFast()
  }
}

Alternatively, you can initialize the model property directly. However, if the ViewController is instantiated by a Storyboard and is the Initial View Controller, the properties will be initialized before the appDelegate function is called. This can cause the app to crash if the model is loaded before FritzCore.configure() is called.

Create a custom ``FritzVisionPosePredictor<T>`` class

To leverage the built in pre- and post-processing provided by Fritz AI, create a new subclass for your model that inherits from the FritzVisionPosePredictor<T> base class.

The base class takes the skeleton defined in the previous step as a type.

import Fritz

// Register your model with the Fritz SDK
extension hand_pose_model: SwiftIdentifiedModel {
  static let modelIdentifier = "your-model-id"
  static let packagedModelVersion = 1
}

// Create the predictor
let handPoseModel = FritzVisionPosePredictor<HandSkeleton>(
  model: hand_pose_model()
)

2. Create FritzVisionImage

FritzVisionImage supports different image formats.

  • Using a CMSampleBuffer

    If you are using a CMSampleBuffer from the built-in camera, first create the FritzVisionImage instance:

    let image = FritzVisionImage(buffer: sampleBuffer)
    
    FritzVisionImage *visionImage = [[FritzVisionImage alloc] initWithBuffer: sampleBuffer];
    // or
    FritzVisionImage *visionImage = [[FritzVisionImage alloc] initWithImage: uiImage];
    

    The image orientation data needs to be properly set for predictions to work. Use FritzImageMetadata to customize orientation for an image. By default, if you specify FritzVisionImageMetadata the orientation will be .right:

    image.metadata = FritzVisionImageMetadata()
    image.metadata?.orientation = .left
    
    // Add metdata
    visionImage.metadata = [FritzVisionImageMetadata new];
    visionImage.metadata.orientation = FritzImageOrientationLeft;
    

    Note

    Data passed in from the camera will generally need the orientation set. When using a CMSampleBuffer to create a FritzVisionImage the orientation will change depending on which camera and device orientation you are using.

    When using the back camera in the portrait Device Orientation, the orientation should be .right (the default if you specify FritzVisionImageMetadata on the image). When using the front facing camera in portrait Device Orientation, the orientation should be .left.

    You can initialize the FritzImageOrientation with the AVCaptureConnection to infer orientation (if the Device Orientation is portrait):

    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        let image = FritzVisionImage(sampleBuffer: sampleBuffer, connection: connection)
        ...
    }
    
  • Using an UIImage

    If you are using an UIImage, create the FritzVisionImage instance:

    let image = FritzVisionImage(image: uiImage)
    

    The image orientation data needs to be properly set for predictions to work. Use FritzImageMetadata to customize orientation for an image:

    image.metadata = FritzVisionImageMetadata()
    image.metadata?.orientation = .right
    

    Note

    UIImage can have associated UIImageOrientation data (for example when capturing a photo from the camera). To make sure the model is correctly handling the orientation data, initialize the FritzImageOrientation with the image’s image orientation:

    image.metadata?.orientation = FritzImageOrientation(image.imageOrientation)
    

3. Run pose predictions

Configure Pose Prediction

Before running pose estimation, you can configure the prediction with a FritzVisionPoseModelOptions object.

Settings
imageCropAndScaleOption

.scaleFit (default)

Crop and Scale option for how to resize and crop the image for the model.

minPartThreshold

0.50 (default)

Minimum confidence score a part must have to be included in a pose.

minPoseThreshold

0.50 (default)

Minimum confidence score a pose must have to be included in result.

smoothingOptions

OneEuroPointFilter.low (default)

Pose smoothing options for predictions. By default applies light smoothing. For more details see Pose Smoothing. Setting this to nil will disable pose smoothing.

nmsRadius

20 (default)

Non-maximum suppression (NMS) distance for Part instances. Two parts

suppress each other if they are less than nmsRadius pixels away.

For example, to build a more lenient FritzVisionPoseModelOptions object:

let options = FritzVisionPoseModelOptions()
options.minPartThreshold = 0.3
options.minPoseThreshold = 0.3

Run Pose Estimation Model

Use the poseModel instance you created earlier to run predictions:

guard let poseResult = try? poseModel.predict(image),
  let pose = poseResult.pose()
  else { return }

// Overlays pose on input image.
let imageWithPose = image.draw(pose)

4. Get information about poses

Once you have a FritzVisionPoseResult object you can either access the pose result directly or overlay poses on the input image.

Use the pose result directly

You can access the results and all detected keypoints from the FritzVisionPoseResult object. All results are by default normalized from 0 to 1, with (0, 0) in the top left of an up oriented image.

// Created from model prediction.
let poseResult: FritzVisionPoseResult

let pose = poseResult.pose()

Each Pose has a [Keypoint] and a score. Here is an example using the keypoints to detect whether the thumb and index fingers are visible in a Pose predicted via the hand pose model described above.

guard let pose = poseResult.decodePose() else { return }

let fingers: [Keypoint<HandSkeleton>] = [
  pose.getKeypoint(for: .thumb),
  pose.getKeypoint(for: .index)
].compactMap { $0 }

Overlay pose on input image

You can overlay the pose on the input image to get an idea of how the pose detection performs:

// Created from model prediction.
let poseResult: FritzVisionPoseResult
let pose = poseResult.pose()
let imageWithPose = image.draw(pose: pose)

Multi-pose Estimation

Note

Multi-pose estimation allows developers to track multiple objects in the same image.

Detect multiple objects

Follow the steps above in Run Pose Estimation Model to get a FritzVisionPoseResult.

Once you have a FritzVisionPoseResult<Skeleton> object you can either access the pose result directly or overlay poses on the input image.

Use the pose result directly

You can access multiple poses and all detected keypoints from the FritzVisionPoseResult<Skeleton> object.

// Created from model prediction.
let poseResult: FritzVisionPoseResult<HandSkeleton>
let poses = poseResult.poses(limit: 10)

Overlay poses on input image

You can overlay the poses on the input image to get an idea of how the pose detection performs:

// Created from model prediction.
let poseResult: FritzVisionPoseResult<HandSkeleton>
let poses = poseResult.poses()
let imageWithPose = image.draw(poses: poses)

Pose Smoothing

To help improve stability of predictions between frames, use the PoseSmoother class constrained to either the OneEuroPointFilter or SavitzkyGolayPointFilter filter classes.

1-Euro Filter

“The 1-Euro filter (“one Euro filter”) is a simple algorithm to filter noisy signals for high precision and responsiveness. It uses a first order low-pass filter with an adaptive cutoff frequency: at low speeds, a low cutoff stabilizes the signal by reducing jitter, but as speed increases, the cutoff is increased to reduce lag.”

- 1-Euro point filter Paper

The 1-Euro filter runs in real-time with parameters minCutoff and beta which control the amount of lag and jitter.

Parameters
minCutoff

1.0 (default)

Minimum frequency cutoff. Lower values will decrease jitter but increase lag.

beta

0.0 (default)

Higher values of beta will help reduce lag, but may increase jitter.

derivateCutoff

1.0 (default)

Max derivative value allowed. Increasing will allow more sudden movements.

To get a better understanding of how different parameter values affect the results, I recommend trying out the 1-Euro Filter Demo.

let poseSmoother = PoseSmoother<OneEuroPointFilter, HandSkeleton>(
  options: .init(minCutoff: 1.0, beta: 0.0)
)

func smoothe(pose: Pose) -> Pose {

    let smoothedPose = poseSmoother.smooth(pose)

    return smoothedPose
}

Savitzky-Golay Filter

A Savitzky–Golay filter is a digital filter that can be applied to a set of digital data points for the purpose of smoothing the data, that is, to increase the precision of the data without distorting the signal tendency. This is achieved, in a process known as convolution, by fitting successive sub-sets of adjacent data points with a low-degree polynomial by the method of linear least squares.

- Savitzky-Golay wiki

The Savitzky-Golay filter essentially fits a polynomial to a window of data and uses that to smooth data points. The size of the buffer will determine the lag using the filter. If you want to minimize lag, we recommend using the 1-Euro filter.

Parameters
leftScan

2 (default)

Number of datapoints in window to look back to approximate polynomial.

rightScan

2 (default)

Number of datapoints in window to look forward to approximate polynomial.

polonomialOrder

2 (default)

Order of polynomial to approximate.

let poseSmoother = PoseSmoother<SavitzkyGolayPointFilter, HandSkeleton>(
  options: .init()
)

func smoothe(pose: Pose) -> Pose {
    let smoothedPose = poseSmoother.smooth(pose)
    return smoothedPose
}

5. Use the record method on the predictor to collect data

The FritzVisionPosePredictor used to make predictions has a record method allowing you to send an image, a model-predicted annotation, and a user-generated annotation back to your Fritz AI account.

guard let result = try? poseModel.predict(image, options: options),
  let pose = result.pose() else {}

// Implement your own custom UX for users to label an image and create a Pose
// object called modifiedPose
poseModel.record(image, predicted: pose, modified: modifiedPose)