
Suoritettu
Julkaistu
Maksettu toimituksen yhteydessä
(Note: Promptfoo preferred) I have a fixed bank of question-answering prompts and want to put 4-6 large language models through their paces (the free but latest versions the models below): - ChatGPT - Copilot - Gemini - DeepSeek Use promptfoo if you can (preferable) ,or build the harness in Python with tools such as LangChain, OpenAI SDK, or the Hugging Face ecosystem. I will supply the prompt set with answers. If additional scoring logic is required, please design it so I can tweak thresholds later. Deliverables • Well-documented scripts/notebooks that call each chosen model, log raw outputs, and capture latency and pricing data • An automated scoring module that grades answers against ground truth and produces aggregate metrics • A concise comparison report (tables or charts are fine) highlighting strengths, weaknesses, and any notable trade-offs between accuracy, speed, and cost • Use parameters for the prompts and models, so I can easily test new prompts and new models Acceptance criteria • Use a laptop I'll provide (connect via AnyDesk details that I'll supply) or produce proper documentation that can be easily followed to setup the experiment on my local laptop (i.e. if you run experiments on your computer). • Code is modular enough to swap in an additional model with minimal edits If this sounds like your kind of project, let’s talk through the models you’d like to include and any API keys you may already have on hand. (See attached "Sample [login to view URL]" )
Projektin tunnus (ID): 40318455
33 ehdotukset
Etäprojekti
Aktiivinen 22 päivää sitten
Aseta budjettisi ja aikataulu
Saa maksu työstäsi
Kuvaile ehdotustasi
Rekisteröinti ja töihin tarjoaminen on ilmaista

Hello, I can work on this project.I am AI Engineer with 8 years of experience in LLM,Prompt Engineering,Python,Data [login to view URL]'s connect
$90 USD 2 päivässä
6,4
6,4
33 freelancerit tarjoavat keskimäärin $130 USD tätä projektia

Hi there, I’m Ivaylo, and I’m excited to help you design a rigorous, reusable benchmarking harness for multi-LLM accuracy and cost. I’ll implement a modular Python workflow (preferably leveraging PromptFoo when available) that orchestrates 4-6 large models against your fixed Q&A prompt bank: ChatGPT, Copilot, Gemini, and DeepSeek, with room to add more later. The solution will produce: - Well-documented scripts/notebooks that call each model, log raw outputs, measure latency, and capture pricing data. - A pluggable scoring module that compares answers to ground truth and emits per-model and aggregate metrics. - A concise, visual comparison report to highlight accuracy vs speed and cost trade-offs. - Parameterized prompts and model selections to enable quick retests with new prompts or models. Deliverables will be clean, modular, and easy to run on a local laptop, with setup instructions suitable for your environment or AnyDesk-driven access. I’ll design the code to accept additional models with minimal edits, and I’ll include clear README notes on how to adjust thresholds later. If you’re ready, I’m happy to discuss model inclusion, your API keys on hand, and any constraints you want prioritized. Best regards, Ivaylo
$155 USD 6 päivässä
5,3
5,3

Hello, I bring over 7 years of experience in Data Visualization, Statistical Analysis, and Python to this project. I have carefully reviewed the project requirements and am confident in my ability to deliver a comprehensive solution. To execute this project, I propose creating a Python-based harness utilizing tools such as LangChain, OpenAI SDK, or the Hugging Face ecosystem to test the language models - ChatGPT, Copilot, Gemini, and DeepSeek. I will develop well-documented scripts/notebooks to call each model, log outputs, capture latency and pricing data, and create an automated scoring module for grading answers against ground truth. The deliverables will include a concise comparison report highlighting the strengths, weaknesses, and trade-offs between accuracy, speed, and cost. I am keen on discussing further details and the selection of models for the project. Please connect in chat for a detailed discussion. You can visit my Profile: https://www.freelancer.com/u/HiraMahmood4072 Thank you.
$100 USD 2 päivässä
5,3
5,3

⭐⭐⭐⭐⭐ ✅Hi there, hope you are doing well! I have worked on benchmarking multiple large language models using Python and frameworks like LangChain, enabling quick and clear performance comparisons based on prompt accuracy and cost efficiency. From my experience, the key to successfully completing your project is building a modular, well-documented scoring and logging system that allows easy tweaking and model interchange. Approach: ⭕ Utilize promptfoo or build custom Python scripts with LangChain and OpenAI SDK. ⭕ Implement modular code to integrate 4-6 LLMs, supporting ease of adding new models. ⭕ Capture raw outputs, latency, and cost details for each prompt. ⭕ Develop flexible scoring logic with adjustable thresholds for accuracy. ⭕ Generate insightful comparison reports with clear tables and visual charts. ⭕ Provide thorough documentation or remote setup assistance via AnyDesk. ❓ ❓ Do you have API keys ready for all models you want tested? ❓ Would you like me to include additional models beyond the four listed? ❓ What format do you prefer for the final comparison report? I am confident in delivering a robust, extensible benchmarking solution tailored to your needs with clear insights into accuracy, speed, and cost trade-offs. Looking forward to collaborating with you. Best regards, Nam
$200 USD 3 päivässä
3,8
3,8

Hi, Solid brief — this is exactly the kind of structured eval work I enjoy. Promptfoo is my first choice here too; it's purpose-built for this. Here's my approach: • Harness: Promptfoo config covering ChatGPT, Copilot, Gemini, and DeepSeek. Models and prompts defined as parameters — swapping in a new model is a one-line config change. • Scoring: Ground-truth assertions via Promptfoo's built-in graders plus a custom Python scoring module for threshold-based logic you can tune without touching the core harness. • Logging: Raw outputs, latency, and cost-per-query captured automatically per run and written to CSV/JSON for audit trail. • Report: Auto-generated comparison table (accuracy, speed, cost) with charts. Clear winner/weakness summary per model. • Docs: Step-by-step setup guide so you can reproduce everything on your laptop independently, plus AnyDesk walkthrough if preferred. All scripts modular and well-commented — adding a fifth model later won't require restructuring anything. Just bring your prompt set and API keys and we're ready to run. Available this week — want to connect and walk through the config? Best, Ken
$140 USD 7 päivässä
3,9
3,9

Hi there, I see you’re looking to benchmark multiple large language models using a set of question-answering prompts. With 4+ years of experience in Python and data analysis, I can help you set up scripts that will log the outputs, latency, and cost metrics for each model you mentioned. I’ll use Promptfoo if possible, or build the harness in Python with tools like LangChain or the OpenAI SDK. I'll ensure the code is modular, making it easy for you to swap in new models or prompts. Additionally, I’ll create an automated scoring module that allows you to tweak thresholds as needed and a concise comparison report to summarize the results. One question I have is whether there are specific metrics you prioritize most in the comparison report, like accuracy over cost or speed? Best regards, Arslan Shahid
$30 USD 3 päivässä
3,7
3,7

**DO NOT PAY ME UNTIL I COMPLETE! :)** Hello my valuable client :) My profile is new over here but I have 7 years of experience in this field. I have completely understood about your project. Also I will provide you free maintenance on your project for 1 year after project completion. I can definitely complete this in your timeframe. Give me one chance to prove myself. Hit the chat button to get started. If you will not like my work then you dont need to pay me any money so dont worry and have faith in me :) I am eagerly waiting for your message.
$150 USD 7 päivässä
2,9
2,9

Hi, I can efficiently set up and execute your experiment across the specified language models using either Promptfoo or a Python-based solution with LangChain and the OpenAI SDK. My extensive experience with model evaluations ensures I will deliver a robust, well-documented framework that logs raw outputs, captures latency, and tracks pricing effectively. I’ll implement an automated scoring module to grade responses against your provided ground truth, allowing for adjustable thresholds to fit your needs. The comparison report will clearly outline each model's strengths and weaknesses, providing you with actionable insights on accuracy, speed, and cost. To ensure modularity, the code will be designed for easy model swapping with minimal edits. I’m prepared to connect via AnyDesk or to provide comprehensive documentation for your local setup. Let’s discuss the models you want to include and any API keys you might have. Thank you.
$156,50 USD 7 päivässä
2,7
2,7

Greetings, I read your job post and I understand you want to benchmark multiple LLMs (ChatGPT, Copilot, Gemini, DeepSeek) using a fixed prompt set, capturing outputs, latency, cost, and accuracy, preferably via Promptfoo or a Python-based modular harness. I’ve carefully reviewed the attached file and understand the expected output format and scoring requirements. Here’s how I’ll approach it: Modular experiment harness: Build a Python script/notebook (or Promptfoo config) that calls each model, logs raw outputs, latency, and pricing Automated scoring: Compare model outputs against your ground-truth answers, with adjustable thresholds for accuracy or partial credit Data capture & visualization: Aggregate metrics, generate tables/charts to highlight accuracy, speed, and cost trade-offs Flexible design: Parameters for prompts and models, making it easy to swap models or add new prompts with minimal edits Documentation & setup: Provide clear instructions or live AnyDesk guidance so the experiment can run on your laptop seamlessly With 5+ years of Python, LLM integration, prompt engineering, and statistical analysis experience, I can make this framework fully reproducible, easy to extend, and ready for benchmarking. Do you want me to start by setting up a Promptfoo harness, or first build a Python notebook version so you can see logging and scoring in action? Regards, Shoaib
$149 USD 4 päivässä
2,8
2,8

Hello Visionary, I read your brief and I’m confident I can deliver a repeatable, modular benchmark for ChatGPT, Copilot, Gemini and DeepSeek using promptfoo where possible, and falling back to a clean Python harness (LangChain/OpenAI/HF) when needed. I’ll instrument each model call to log raw outputs, latency, and cost-data, and build a scoring module that compares answers to your supplied ground truth with configurable thresholds so you can tweak sensitivity later. My approach is pragmatic: implement a model adapter layer so new models plug in with minimal code, use prompt parameters and model-config files, and produce reproducible notebooks/scripts that log everything and generate concise comparison visuals and aggregate metrics. I’ll include clear AnyDesk-ready setup steps and standalone documentation so you can run experiments locally. If this sounds good I’ll sketch the exact model adapters I plan to include and a short runbook for the laptop access. Which of the listed models (ChatGPT, Copilot, Gemini, DeepSeek) do you already have API access keys for, and are there any version constraints (e.g., gpt-4o vs gpt-4)? Thanks, Daniel
$200 USD 4 päivässä
2,2
2,2

Hello, I’ve read your Multi-LLM benchmarking brief and I’m confident I can deliver a clean, reusable harness that runs ChatGPT, Copilot, Gemini and DeepSeek against your ground-truth prompts. I prefer promptfoo and will use it where supported; otherwise I’ll supply a Python alternative using LangChain/Hugging Face/OpenAI SDK. My focus will be maintainability: modular code, clear parameterization for prompts/models, and an extensible scoring module with tweakable thresholds. I’ll log raw outputs, latency, and cost, and produce concise charts/tables that highlight accuracy, speed, and cost trade-offs. I can work on your laptop via AnyDesk or provide step-by-step setup docs. Next step: confirm which model APIs/keys you have and whether you want promptfoo-first or Python-first delivery. Which model APIs and keys do you already have available, and would you prefer I prioritize promptfoo or a Python-based harness? Sincerely, Cindy Viorina
$30 USD 18 päivässä
2,2
2,2

Hello, I specialize in data analysis and predictive modeling using Python, helping businesses turn raw data into meaningful insights and informed decisions. What I Can Do: - Clean and prepare raw datasets (e.g., missing values, outliers, feature engineering) - Perform exploratory data analysis (e.g., trends, correlations) - Build machine learning models (e.g., classification, regression, forecasting) - Create clear visualizations and analytical reports - Explain results in business-friendly terms What I Need From You: - Dataset access or sample data - Business objective or key questions - Expected output format (e.g., report, notebook, dashboard) Let’s turn your data into actionable insights. Best Regards, Oleh
$140 USD 7 päivässä
0,6
0,6

Hey, I am ready when you are.✅ I’ve worked on something very similar. What really matters here is the challenge of benchmarking multiple large language models accurately and efficiently. The tricky part is usually ensuring the scoring logic is flexible for future adjustments. I recently built a similar system using Python with the Hugging Face ecosystem, evaluating model performance against a set of prompts and ground truth data. While I haven't specifically used Promptfoo, I have experience with designing automated scoring modules and generating comparison reports. I would approach this project by setting up well-documented scripts to call each model, capture relevant data, and create a modular system for easy model integration. Let's chat! -Ruslan
$140 USD 7 päivässä
0,0
0,0

Hey , I just went through your job description and noticed you need someone skilled in Statistical Analysis, Prompt Engineering, Data Visualization, Large Language Model and Python. That’s right up my alley. You can check my profile —I’m Software engineer working at large-scale apps as a lead developer with U.S. and European teams. I’ve handled several projects using these exact tools and technologies. Before we proceed, I’d like to clarify a few things: Are these all the project requirements or is there more to it? Do you already have any work done, or will this start from scratch? What’s your preferred deadline for completion? Why Work With Me? 1) Over 230 successful projects completed. 2) I have not received a single bad feedback since the last 5-6 years. 3) You will find 5 star feedback on the last 100+ major projects which shows my clients are happy with my work. 4) Long-term track record of happy clients and repeat work. I prioritize quality, deadlines, and clear communication. Availability: 9am – 9pm Eastern Time (Full-time freelancer) I can share recent examples of similar projects in chat. Let’s connect and discuss your vision in detail. Kind Regards, Imran Haider
$30 USD 12 päivässä
0,0
0,0

Hello there What are your main concerns about balancing accuracy and cost when benchmarking these large language models? How do you want to handle scoring threshold adjustments to keep the system flexible? Ensuring modular code to swap models easily requires clear abstraction layers. Automating scoring against ground truth is tricky because scoring must be both accurate and adaptable. I will build modular, well-documented Python scripts using promptfoo as preferred. They will log outputs, latency, and pricing, and support plug-and-play of prompts and models. I will deliver the scoring module and a clear comparison report with charts. We can connect via AnyDesk or I can provide detailed setup instructions. I would be happy to discuss further on chat. Best regards. Dorofii
$180 USD 2 päivässä
0,0
0,0

Hi there, It sounds like you're running into challenges building a clean, comparable benchmark across ChatGPT, Copilot, Gemini, and DeepSeek. I've spent the past 5 years building evaluation harnesses just like this, and I can set up a promptfoo workflow or a Python-based system that logs raw outputs, latency, token costs, and performs automated scoring. I'll modularize each model wrapper, implement a tunable scoring engine, and document the full setup so you can run everything on your provided laptop or locally. Best regards,
$155 USD 6 päivässä
0,0
0,0

Hello, Thank you for outlining your multi-LLM benchmarking needs. At DemiVision LLC, we specialize in designing robust, modular evaluation frameworks for language models, leveraging Python, prompt engineering, and visualization tools. We’ve worked extensively with LLMs (ChatGPT, Copilot, Gemini, DeepSeek, and more), and are experienced in both promptfoo and custom Python harnesses using LangChain, OpenAI SDK, and Hugging Face. We understand your goal: a reproducible, easily extensible pipeline for benchmarking model accuracy, latency, and cost against a fixed set of prompts and answers. Our approach will center on a well-documented script or notebook that logs every model call, captures timing and pricing data, and supports parameter-driven configuration for flexibility. The scoring module will be modular, allowing you to adjust thresholds as needed. Results will be presented through clear tables and charts, highlighting trade-offs between models. We’re happy to work directly on your laptop via AnyDesk or deliver step-by-step setup documentation for seamless local execution. Adding new models or prompt sets will be straightforward, ensuring long-term usability. Let’s discuss which models you’d like prioritized, and how we can align on API access. Looking forward to collaborating on this insightful benchmarking project!
$140 USD 5 päivässä
0,0
0,0

I can build a modular benchmarking framework to evaluate multiple LLMs on accuracy, latency, and cost using Promptfoo (preferred) or a Python-based stack (LangChain/OpenAI SDK/Hugging Face) depending on your setup. I will create a configurable pipeline that runs your prompt set across selected models (ChatGPT, Copilot, Gemini, DeepSeek), logs raw outputs, measures response time, and estimates cost per request. A flexible scoring module will compare outputs against ground truth using adjustable thresholds (exact match, semantic similarity, or custom rules). You’ll receive clean, well-documented scripts/notebooks, parameterized configs for easy model/prompt swapping, and a concise report with tables/visualizations highlighting performance trade-offs. Setup will be reproducible on your machine, or I can assist remotely. victoria behei
$100 USD 7 päivässä
0,0
0,0

Hello, With over 9 years of experience in Python, Data Visualization, and Prompt Engineering, I am well-equipped to tackle your Multi-LLM Accuracy & Cost Benchmarking project. I understand your requirement to test 4-6 large language models using a fixed bank of question-answering prompts and to deliver well-documented scripts/notebooks, an automated scoring module, and a concise comparison report. I will ensure effective communication throughout the project and am excited to collaborate with you on this task. I look forward to discussing the models to include and any API keys needed for the project. Thanks.
$250 USD 7 päivässä
0,0
0,0

Hello there, I am excited about the opportunity to work on your project involving testing 4-6 large language models using fixed bank question-answering prompts, specifically ChatGPT, Copilot, Gemini, and DeepSeek. I propose to leverage Promptfoo for this task, or alternatively, build the harness in Python using tools like LangChain, OpenAI SDK, or the Hugging Face ecosystem as per your preference. I will ensure the deliverables include well-documented scripts/notebooks, an automated scoring module, and a concise comparison report to highlight key insights. I am confident in meeting the acceptance criteria and delivering high-quality results for your project. Regards, Daniel
$120 USD 5 päivässä
0,0
0,0

Hi there, I’ve read your request on benchmarking 4-6 large language models (ChatGPT, Copilot, Gemini, DeepSeek) using Promptfoo or a Python harness. I’ll quickly diagnose common bottlenecks like prompt alignment, API/version mismatches, and cost tracking, then build a modular pipeline to run prompts, log outputs, latency, and pricing. My plan: • set up a parameterized harness (Promptfoo or LangChain) • implement a robust scoring module against ground truth • generate a concise accuracy-speed-cost report • document setup for easy local replication. I recently solved a similar multi-model benchmarking task with clear, tweakable thresholds. Can you share which API keys are already available and preferred models to start with? Thando
$120 USD 3 päivässä
0,0
0,0

Edinburgh, United Kingdom
Maksutapa vahvistettu
Liittynyt marrask. 14, 2013
$30-250 USD
$30-250 USD
$30-250 USD
$30-250 USD
£20-250 GBP
$15-25 AUD/ tunnissa
$25-50 AUD/ tunnissa
$250-750 USD
$750-1500 CAD
$10-30 CAD
$10-30 CAD
$750-1500 CAD
₹600-1500 INR
₹600-1500 INR
₹100-400 INR/ tunnissa
$30-250 USD
$8-15 USD/ tunnissa
$30-250 USD
₹100-400 INR/ tunnissa
$750-1500 USD
₹1500-12500 INR
$30-250 USD
₹100-400 INR/ tunnissa
€30-250 EUR
$30-250 USD