2025-04-24
Deep Learning

Open-YOLO 3D

3D Point Cloud

Introduction

Open-vocabulary 3D instance segmentation is crucial for applications like robotics and augmented reality, where systems must identify and interact with previously unseen objects. Traditional methods rely heavily on closed-set models and computationally expensive foundation models like SAM and CLIP to project 2D features into 3D space, resulting in slow inference times—ranging from 5 to 10 minutes per scene—which hinders real-time deployment.

To address this challenge, the proposed method, Open-YOLO 3D, introduces a faster and more efficient approach by leveraging 2D object detectors instead of segmentation-heavy models. It uses bounding box predictions from an open-vocabulary 2D detector across multiple RGB frames and combines them with class-agnostic 3D instance masks. By constructing low granularity label maps and projecting the 3D point cloud onto them using camera parameters, Open-YOLO 3D significantly reduces computational costs while maintaining high segmentation accuracy, enabling practical deployment in real-world robotics tasks like material handling and inventory management.

OpenYOLO3D Pipeline
Figure 1: Open-world 3D instance segmentation pipeline

Method

1. Instance Proposal Generation

A 3D instance segmentation network generates binary mask proposals over the input point cloud, with each mask corresponding to a potential object instance.

2. 2D Detection & Label Map Construction

An open-vocabulary 2D object detector predicts bounding boxes and class labels for each RGB frame. These are used to build low-granularity label maps, marking object regions in each image.

3. 3D-to-2D Projection

All 3D points are projected onto the 2D frames using intrinsic and extrinsic camera parameters, creating Nf projections per point across all frames.

4. Accelerated Visibility Computation (VACc)

A fast visibility computation determines where each 3D mask is clearly visible across frames. The top-k most visible projections are selected for each instance.

5. Per-Point Label Assignment

Using (x, y) coordinates from the top-k visible projections, per-point labels are retrieved from the corresponding label maps, filtering out occluded or out-of-frame points.

6. Multi-View Prompt Distribution

Per-point labels from multiple views are aggregated to create a Multi-View Prompt Distribution, allowing assignment of a final prompt ID (class) to each 3D instance mask.

Why It Matters

  • Replaces heavy foundation models (SAM, CLIP) with fast 2D object detectors.
  • Achieves up to 16× faster inference than prior methods.
  • Suitable for real-world robotics scenarios like inventory management and material handling.
  • Maintains competitive accuracy while enabling real-time deployment.

Evaluation
Figure 2: OpenYOLO3D with evaluation.

Code

def test(dataset_type, path_to_3d_masks, is_gt):
    config = load_yaml(osp.join(f'./pretrained/config_{dataset_type}.yaml'))
    path_2_dataset = osp.join('./data', dataset_type)
    gt_dir = osp.join('./data', dataset_type, 'ground_truth')
    depth_scale = config["openyolo3d"]["depth_scale"]

    if dataset_type == "replica":
        scene_names = SCENE_NAMES_REPLICA
        datatype="point cloud"
    elif dataset_type == "scannet200":
        scene_names = SCENE_NAMES_SCANNET200
        datatype="mesh"

    evaluator = InstSegEvaluator(dataset_type)
    openyolo3d = OpenYolo3D(f"./pretrained/config_{dataset_type}.yaml")
    predictions = {}
    for scene_name in tqdm(scene_names):
        scene_id = scene_name.replace("scene", "")
        processed_file = osp.join(path_2_dataset, scene_name, f"{scene_id}.npy") if dataset_type == "scannet200" else None
        prediction = openyolo3d.predict(path_2_scene_data = osp.join(path_2_dataset, scene_name), 
                                        depth_scale = depth_scale,
                                        datatype = datatype, 
                                        processed_scene = processed_file,
                                        path_to_3d_masks = path_to_3d_masks,
                                        is_gt = is_gt)
        predictions.update(prediction)

    preds = {}
    print("Evaluation ...")
    for scene_name in tqdm(scene_names):
        preds[scene_name] = {
            'pred_masks': predictions[scene_name][0].cpu().numpy(),
            'pred_scores': torch.ones_like(predictions[scene_name][2]).cpu().numpy(),
            'pred_classes': predictions[scene_name][1].cpu().numpy()}

    inst_AP = evaluator.evaluate_full(preds, gt_dir, dataset=dataset_type)

Related Topics

Several approaches have tackled open-vocabulary semantic and instance segmentation by leveraging foundation models like CLIP for unknown class discovery. OpenScene lifts 2D CLIP features into 3D for semantic segmentation, while ConceptGraphs builds open-vocabulary scene graphs for broader tasks like object grounding and navigation. OpenMask3D focuses on 3D instance segmentation using class-agnostic masks combined with SAM and CLIP features, whereas some methods avoid foundation models altogether, relying instead on weak supervision.

  • Open-vocabulary semantic segmentation (OVSS): Uses CLIP to align pixel features with text embeddings for zero-shot segmentation.
  • AttrSeg: Decomposes class names into attribute phrases, then aggregates them into class representations.
  • Open-vocabulary instance segmentation (OVIS): Predicts masks for novel objects using:
    • Cross-modal pseudo-labeling with teacher-student models.
    • Annotation-free vision-language supervision at box/pixel level.
Nếu bạn đang tìm kiếm lộ trình học Data/AI bài bản từ con số 0, hãy tham gia cùng GeekieSeoul. Chúng tôi luôn sẵn sàng đồng hành cùng bạn trên hành trình khám phá và làm chủ trí tuệ nhân tạo – dù bạn là người mới bắt đầu hay đang định hướng sự nghiệp trong lĩnh vực này.