Course Content

Lesson 4.1: What is DataGradients?

The Importance of Computer Vision DataSet Profiling 

In computer vision, your model’s strength is directly tied to the quality of your training data. It’s vital to spot any issues in your dataset—it helps avoid training problems and sheds light on potential model underperformance. We often overlook how the compatibility between your data and model design influences your model’s effectiveness. Unique aspects of your dataset, like having many small objects or few large ones, can drive certain design decisions. Understanding these dataset attributes can help you make the right choices—from selecting the best model, to determining the optimal loss function, to choosing the most effective optimization method.

Data Visualization Libraries and Generic EDA Tools Fall Short

Most dataset profiling tools are data visualization libraries or general-purpose statistical analysis tools that work well for tabular data, but fall short when applied to image data.

Image data has distinct features that set it apart from tabular data. Images are collections of pixels organized in multiple colour channels.  This makes it challenging for generic EDA tools to give you meaningful insight.

But your concerns go beyond just pixels.

In deep learning, you have labels and annotations. Masks and bounding boxes come in many formats – labels first or last, xyxy,  xywh, cxcywh. Images can exhibit various color schemes – bright, dark, gray, or colorful. Objects in images can take various shapes – wide, narrow, small, or large.  Objects can be convex or concave, and bounding boxes may intersect. Classes can be imbalanced, distributed evenly in images, or not.

Generic EDA tools and data visualization libraries can’t take image data and quickly summarize it with simple statistical plots like histograms or scatter plots.

Fortunately, there is a tool that can – DataGradients!

DataGradients is an invaluable tool for computer vision practitioners. It automatically extracts features from datasets, such as image-level evaluations, class distribution statistics, and heatmap generation, and presents them in a user-friendly report. Its statistical analysis of your computer vision dataset focuses on common data problems, pitfalls, and general characteristics that may affect the model design or the training process. 

Key Dataset Features

DataGradients analyzes  your object detection or semantic segmentation dataset and delivers insights about

  • The nature of the objects depicted (convexity, fine details)
  • The size distribution of the objects (segments or bounding boxes)
  • Class distribution 
  • Image brightness and color distribution
  • Image aspect ratios and resolution

Common Dataset Issues

As we saw in Units 1-3, having insights about these key dataset features can help identify various common dataset issues. These include:

1. Corrupted Data

Extreme brightness values – unusual brightness levels might indicate image corruption
Anomalous channel statistics – unexpected per-channel mean, and standard deviation can flag corrupted data

2. Labeling Errors

Unusual object areas – small or large object areas, contrary to expectations for a particular class, might suggest labeling errors. For instance, if some cats are much bigger on average than cars. 

Object location anomalies – if objects of a particular class are consistently found in unlikely locations, this might indicate a labeling mistake. For instance, if the sky is usually in the lower part of the image. 

3. Faulty Augmentations

Unstable objects post-augmentation – if augmented data consistently results in objects that have a distribution too far from the original data, this might indicate a bad augmentation. It is good to change the distribution of robustness slightly, but only to a certain point.

4. Disparities Between the Training Set and the Validation Set

Class distribution disparities – a common mistake is having a class that is underrepresented in the training set but not in the test set, severely limiting the model’s ability to learn that class.

Image brightness distribution and color distribution disparities may indicate that the training and test datasets were captured under different conditions. For instance, if most images in the training set were taken in bright daylight, while most images in the test set were captured in low light conditions, it could result in poor performance on the validation set.

Using DataGradients’s Insights for Better Model Design

You can also use DataGradients’s insights to inform your model design choices. We will further explore the relationship between dataset characteristics and model design in Unit 5. But for now, we can present it in broad strokes:

1. The Impact of Object Size Distribution on Model Design 

When training a model, it is essential to determine whether your data consists of numerous small objects or just a few large objects in each image. This information can impact your skip connections, downscaling, receptive field, and model depth decisions. A common pitfall is discarding the initial non-efficient skip connection in a model when the data necessitates the high frequencies associated with these connections.

2. The Impact of Object Characteristics on Model Design

Consider factors such as convexity and fine details of the segments in your data. Once again, a typical mistake is eliminating a model’s non-efficient initial skip connections, particularly when your data includes segments with intricate details that require high-frequency and high-resolution information.

The Benefits of Using DataGradients

As we conclude this introductory lesson on DataGradients, Deci’s computer vision dataset profiler, it’s vital to highlight the key advantages this tool brings to the table:

Wrapping up, DataGradients is a practical and user-friendly solution for evaluating your computer vision datasets. It offers detailed analysis features that can significantly aid data scientists and researchers in understanding their dataset’s quality and characteristics while keeping your data private, secure, and on-premises. Its thoughtful design makes it a reliable choice for those aiming to enhance their data understanding while preserving data privacy.

Share
Add Your Heading Text Here
				
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")