--- language: en license: apache-2.0 library_name: transformers tags: - code - java - codet5 - optimization - code-generation datasets: - nlpctx/java_optimisation base_model: Salesforce/codet5-small pipeline_tag: text-generation model-index: - name: codet5-java-optimizer results: [] --- # CodeT5-small Java Optimization Model A fine-tuned [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) model for Java code optimization tasks. - **Model**: [nlpctx/codet5-java-optimizer](https://huggingface.co/nlpctx/codet5-java-optimizer) - **Dataset**: [nlpctx/java_optimisation](https://huggingface.co/datasets/nlpctx/java_optimisation) - **Base Model**: [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) ## Overview This repository contains a fine-tuned CodeT5-small model specifically trained for Java code optimization. The model takes verbose or inefficient Java code and generates more optimal versions. ## Model Information - **Base Model**: Salesforce/codet5-small - **Training Dataset**: [nlpctx/java_optimisation](https://huggingface.co/datasets/nlpctx/java_optimisation) - **Framework**: HuggingFace Transformers with Seq2SeqTrainer - **Training Setup**: Dual-GPU DataParallel (Kaggle T4×2) - **Dataset Size**: ~6K training / 680 validation Java optimization pairs - **Optimization Focus**: Java code refactoring and performance improvements ## Files - `config.json` - Model configuration - `generation_config.json` - Generation parameters - `model.safetensors` - Model weights (safetensors format) - `merges.txt` - BPE merges file - `special_tokens_map.json` - Special tokens mapping - `tokenizer_config.json` - Tokenizer configuration - `vocab.json` - Vocabulary file ## Usage ```python from transformers import T5ForConditionalGeneration, RobertaTokenizer import torch # Load model and tokenizer model = T5ForConditionalGeneration.from_pretrained("nlpctx/codet5-java-optimizer") tokenizer = RobertaTokenizer.from_pretrained("nlpctx/codet5-java-optimizer") # Prepare input Java code java_code = "your Java code here" input_ids = tokenizer(java_code, return_tensors="pt").input_ids # Generate optimized code with torch.no_grad(): outputs = model.generate( input_ids, max_length=512, num_beams=4, early_stopping=True ) optimized_code = tokenizer.decode(outputs[0], skip_special_tokens=True) print(optimized_code) ``` ## Example Optimizations The model has been trained to recognize and optimize common Java patterns: - **Switch Expressions**: Converting verbose switch statements to switch expressions - **Collection Operations**: Replacing manual iterator removal with `removeIf()` - **String Handling**: Optimizing string concatenation with `StringBuilder` - **Loop Optimizations**: Improving iterative constructs - **And more...** ## Training Details The model was fine-tuned using: - **Base Model**: Salesforce/codet5-small - **Dataset**: nlpctx/java_optimisation from Hugging Face - **Training Framework**: Seq2SeqTrainer with DataParallel - **Hardware**: Kaggle T4×2 (dual GPU) - **Approach**: Standard supervised fine-tuning on Java optimization pairs ## License This model is licensed under the **Apache 2.0** license, matching the original [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) model. ## Acknowledgements - Model based on [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) - Training data from [nlpctx/java_optimisation](https://huggingface.co/datasets/nlpctx/java_optimisation) dataset - Built with [HuggingFace Transformers](https://github.com/huggingface/transformers)