Tags:ChatGPT, CNN, Copilot, Deep Learning, Gemini, Keras, LLaMA, LLMs for code generation, TensorFlow, Time Efficiency, Transfer Learning and Vision Transformer
Abstract:
This study presents a comparative analysis of large language models (LLMs) in automating the development of deep learning solutions for image classification. In Experiment 1, we investigate the capabilities of LLMs-ChatGPT, Copilot, Gemini, and LLaMA-for generating TensorFlow-based deep learning code using Keras. Generated code is evaluated on test accuracy, execution time, and quality metrics such as readability and maintainability. Building on this, Experiment 2 introduces the MultiLLM CodeEval Hub, an innovative framework that automates the evaluation of LLM-generated code across various metrics. This system employs multiple LLMs operating in parallel to assess aspects like robustness, performance, and code quality. Our results highlight the diverse strengths of LLMs in generating and optimizing deep learning models while uncovering gaps in error handling and architectural efficiency. The MultiLLM CodeEval Hub streamlines code evaluation, offering developers and organizations a structured, data-driven approach to selecting optimal deep learning architectures. By automating both code generation and evaluation, this research demonstrates the transformative potential of LLMs in advancing machine learning workflows. Our findings underscore the role of AI in fostering innovation, reducing development time, and improving model reliability for real-world applications.
Towards Autonomous Deep Learning: Comparative Analysis of AI-Generated and AI-Evaluated Code Using LLMs for Computer Vision Tasks