Spaces:

iBrokeTheCode
/

Multimodal_Product_Classification

Running

App Files Files Community

Multimodal_Product_Classification / README.md

iBrokeTheCode

chore: Update demo screenshot

a9ded9e 7 months ago

preview code

raw

history blame contribute delete

4.91 kB

	---
	title: Multimodal Product Classification
	emoji: 📈
	colorFrom: purple
	colorTo: yellow
	sdk: gradio
	sdk_version: 5.44.0
	app_file: app.py
	pinned: true
	license: mit
	short_description: Product classification using image and text
	---

	# 🛍️Multimodal Product Classification with Gradio

	## Table of Contents

	1. [Project Description](#1-project-description)
	2. [Methodology & Key Features](#2-methodology--key-features)
	3. [Technology Stack](#3-technology-stack)
	4. [Model Details](#4-model-details)

	## 1. Project Description

	This project implements a multimodal product classification system for Best Buy products. The core objective is to categorize products using both their text descriptions and images. The system was trained on a dataset of almost 50,000 items.

	The entire system is deployed as a lightweight, web application using Gradio. The app allows users to:

	- Use both text and an image for the most accurate prediction.
	- Run predictions using only text or only an image to understand the contribution of each data modality.

	This project showcases the power of combining different data types to build a more robust and intelligent classification system.

	> [!IMPORTANT]
	>
	> - Check out the deployed app here: 👉️ [Multimodal Product Classification App](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification) 👈️
	> - Check out the Jupyter Notebook for a detailed walkthrough of the project here: 👉️ [Jupyter Notebook](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification/blob/main/notebook_guide.ipynb) 👈️

	![App](./assets/app-demo.jpg)

	## 2. Methodology & Key Features

	- Core Task: Multimodal Product Classification on a Best Buy dataset.

	- Pipeline:

	- Data: A dataset of \~50,000 products, each with a text description and an image.
	- Feature Extraction: Pre-trained models are used to convert raw text and image data into high-dimensional embedding vectors.
	- Classification: A custom-trained Multilayer Perceptron (MLP) model performs the final classification based on the embeddings.

	- Key Features:

	- Multimodal: Combines text and image data for a more accurate prediction.
	- Single-Service Deployment: The entire application runs as a single, deployable Gradio app.
	- Flexible Inputs: The app supports multimodal, text-only, and image-only prediction modes.

	## 3. Technology Stack

	This project was built using the following technologies:

	Deployment & Hosting:

	- [Gradio](https://gradio.app/) – interactive web app frontend.
	- [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) – for cost-effective deployment.

	Modeling & Training:

	- [TensorFlow / Keras](https://www.tensorflow.org/) – used to train the final MLP classification model.
	- [Sentence-Transformers](https://www.sbert.net/) – for generating text embeddings.
	- [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) – for the image feature extractor (`TFConvNextV2Model`).

	Development Tools:

	- [Ruff](https://github.com/charliermarsh/ruff) – Python linter and formatter.
	- [uv](https://github.com/astral-sh/uv) – fast Python package installer and resolver.

	## 4. Model Details

	The final classification is performed by a custom-trained Multilayer Perceptron (MLP) model that takes the extracted embeddings as input.

	- Text Embedding Model: `SentenceTransformer` (`all-MiniLM-L6-v2`)
	- Image Embedding Model: `TFConvNextV2Model` (`convnextv2-tiny-22k-224`)
	- Classifier: A custom MLP model trained on top of the embeddings.
	- Classes: The model classifies products into a set of specific Best Buy product categories.

	\| Model \| Modality \| Accuracy \| Macro Avg F1-Score \| Weighted Avg F1-Score \|
	\| :------------------ \| :----------- \| :------- \| :----------------- \| :-------------------- \|
	\| Random Forest \| Text \| 0.90 \| 0.83 \| 0.90 \|
	\| Logistic Regression \| Text \| 0.90 \| 0.84 \| 0.90 \|
	\| Random Forest \| Image \| 0.80 \| 0.70 \| 0.79 \|
	\| Random Forest \| Combined \| 0.89 \| 0.79 \| 0.89 \|
	\| Logistic Regression \| Combined \| 0.89 \| 0.83 \| 0.89 \|
	\| MLP \| Image \| 0.84 \| 0.77 \| 0.84 \|
	\| MLP \| Text \| 0.92 \| 0.87 \| 0.92 \|
	\| MLP \| Combined \| 0.92 \| 0.85 \| 0.92 \|

	> [!TIP]
	>
	> Based on the evaluation on the test set, the Multimodal MLP model achieved an excellent 92% accuracy and a 92% weighted F1-score, confirming its superior performance by leveraging both text and image data.