
Closed
Posted
Paid on delivery
I’m refining an AI-driven question-paper checker and parser, and my top priority right now is accuracy. With multiple-choice items the system too often marks wrong answers as correct and flags correct ones as wrong, so I need a specialist who can put several models through their paces, isolate the causes of these misclassifications, and tune the pipeline until the results are consistently reliable. Speed and consistency will matter later, but first I need the numbers to be right. You’ll have freedom to evaluate any mix of commercial LLMs (GPT-4, Claude, Bard, etc.) and open-source options (Llama 2, Falcon, custom fine-tuned transformers) so long as you keep the comparison fair and reproducible. The current stack is Python with Hugging Face and a small PostgreSQL database for answer keys; if another framework is essential, outline why and how you’ll integrate it back into this environment. Deliverables • A benchmark report showing precision, recall, and F1 for each model on my labeled multiple-choice dataset • The cleaned, documented evaluation scripts or notebooks (Python preferred) • An updated inference module or prompt library that meets or beats 98 % exact-match accuracy on the test set • A short hand-off guide so I can run future batches the same way If further optimisation suggests extending support to short-answer or essay questions, let me know in your findings; for now, keep the focus tightly on multiple-choice grading accuracy.
Project ID: 40413552
32 proposals
Remote project
Active 3 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
32 freelancers are bidding on average ₹11,097 INR for this job

Hello, I trust you're doing well. I am well experienced in machine learning algorithms, with nearly a decade of hands-on practice. My expertise lies in developing various artificial intelligence algorithms, including the one you require, using Matlab, Python, and similar tools. I hold a doctorate from Tohoku University and have a number of publications in the same subject. My portfolio, which showcases my past work, is available for your review. Your project piqued my interest, and I would be delighted to be part of it. Let's connect to discuss in detail. Warm regards. please check my portfolio link: https://www.freelancer.com/u/sajjadtaghvaeifr
₹75,000 INR in 7 days
7.6
7.6

I can help you. The core problem you described—wrong answers marked right, correct answers marked wrong—suggests the system isn't just inaccurate, it's inconsistent in how it interprets the question, the answer key, or both. Before running a full model shootout, I'll first check two hidden failure points: 1. Answer key alignment: If the stored "correct" answer in PostgreSQL doesn't exactly match the format the model sees (e.g., "B" vs "b" vs "Option B" vs trailing spaces), even a perfect model will seem broken. I'll audit that first. 2. Prompt sensitivity: Most accuracy issues disappear not with a better model, but with a stricter, structured prompt that forces a single-character or fixed‑token output. I'll test that on your worst-performing batch first. If those don't fix it, then I'll run a controlled comparison between models using identical inputs and scoring rules—but only on the subset where you're already seeing failures. No need to test every model on every question. For the final 98% target, I'll implement a confidence check: any answer below a threshold gets flagged for review, not auto‑scored. That alone will lift accuracy without changing models.
₹25,000 INR in 7 days
5.3
5.3

Hi,I am a seasoned Applied ML Engineer(6+ yoe) & I can help improve this MCQ checker by treating it as an accuracy/debugging problem first,not just a model-prompting task My approach would be: >>Benchmark fairly: run GPT/Claude/Gemini/open-source models on the same labeled test split with fixed prompts,temperature 0,same schema,& identical scoring scripts >>Isolate failure causes: tag every wrong prediction as OCR/parser error,question-number mismatch,answer-key mismatch,option extraction issue,multi-select handling issue,prompt-format issue,or true model reasoning error >>Make grading deterministic: since PostgreSQL already stores answer keys,the final correctness check should be code-based: normalize student answer -> fetch answer key -> exact-match compare. LLMs should only help with messy answer extraction >>Answer normalization: handle formats like A,(A),Option A,1,①,lowercase letters,full-width characters,“A & C”,blanks,& common OCR confusions like B/8,D/O,C/G. >>LLM fallback: use structured JSON outputs only for ambiguous cases >>Evaluation: deliver precision,recall,F1,exact-match accuracy,confusion matrix,& an error-breakdown report for each model/pipeline version. >>Updated module: provide a cleaned inference module & prompt library that can be plugged back into the existing Python/Hugging Face/PostgreSQL stack I have experience in OCR/document parsing,NLP extraction,RAG/LLM pipelines,evaluation scripts,FastAPI/Python backends,& production-grade model debugging
₹12,500 INR in 3 days
4.4
4.4

Hi I saw you are building an AI driven question paper checker and need help improving accuracy especially for multiple choice grading I can help you evaluate multiple LLMs and open source models in a fair and reproducible way identify why misclassifications are happening and tune your pipeline to improve correctness I will benchmark models like GPT Claude and open source options using your labeled dataset and provide precision recall and F1 results along with clear comparison I will also clean and improve your evaluation scripts in Python ensure they work with Hugging Face and PostgreSQL setup and deliver an updated inference approach or prompt strategy aimed at achieving very high exact match accuracy Finally I will include a simple handoff guide so you can rerun evaluations anytime with new data If needed I can also suggest future improvements for short answer and essay grading once multiple choice accuracy is stabilized
₹7,000 INR in 1 day
3.8
3.8

Hi I have checked your requirements and I would suggest to use LLM with optimization algorithms i.e genetic algorithm that will keep your most objective solutions and discard the irrelevant ones. There is a choice of making a RAG essentially trained on the given data and when any similar question comes, it can easily decide which one to choose. Best
₹7,000 INR in 7 days
3.6
3.6

Hi, I can help you improve the accuracy of your AI-based question paper checker. I have experience working with Python, Hugging Face, and LLM evaluation workflows, including benchmarking and debugging classification errors in NLP pipelines. I can systematically compare models like GPT-4, Claude, and open-source transformers, run reproducible evaluations, and identify where misclassifications are coming from. I’ll provide a clean benchmark report with precision/recall/F1, well-documented evaluation scripts, and an improved inference/prompting setup aimed at maximizing exact-match accuracy. I’ll also include a simple hand-off guide so you can rerun evaluations easily in future. Happy to get started.
₹7,000 INR in 4 days
2.9
2.9

Hello There, If MCQs are being marked wrong, the issue is rarely just the model, it is usually evaluation logic, prompt ambiguity, or answer normalization breaking consistency. I understand your priority is accuracy first, and pushing toward a reliable 98 percent exact match before worrying about speed or scaling. Here is how I would approach this systematically: 1. Build a controlled benchmark pipeline to test multiple LLMs and transformer models on your labeled dataset 2. Identify failure patterns such as option parsing errors, formatting drift, and answer key mismatches 3. Standardize prompts and output formats to eliminate ambiguity in model responses 4. Implement strict post-processing and normalization for consistent answer comparison 5. Deliver a tuned inference layer with measurable gains in precision, recall, and F1 I have worked with Python and Hugging Face pipelines where small inconsistencies in parsing caused major accuracy drops, so the focus will stay on reproducibility and clean evaluation. One critical question before I design the benchmark: are your answer keys strictly structured (A, B, C, D), or do they sometimes include text variations that need semantic matching? Regards VK
₹7,000 INR in 7 days
2.7
2.7

Hello, Your focus on accuracy first is exactly right—MCQ grading errors usually stem from ambiguous parsing, weak answer normalization, or inconsistent model prompting. I can help you systematically diagnose and fix this. My approach starts with building a reproducible evaluation pipeline in Python (Hugging Face + PostgreSQL compatible). I’ll benchmark multiple models (GPT, Claude, and strong open-source options like Llama-based variants) on your labeled dataset using strict exact-match, along with precision, recall, and F1. From there, I’ll isolate failure modes—e.g., option misalignment (A/B/C/D shifts), OCR/parse noise, prompt ambiguity, or model reasoning drift. I typically resolve these with: Deterministic answer extraction layers Structured prompting / function-calling Post-processing validation (answer key alignment) Lightweight fine-tuning or calibration if needed I’ve worked on AI evaluation pipelines, LLM benchmarking, and NLP systems where accuracy and reproducibility were critical. My past work includes building grading engines, document parsers, and model comparison frameworks with measurable performance gains. You’ll receive clean, well-documented scripts/notebooks, a benchmark report, and an upgraded inference module targeting ≥98% exact-match accuracy, plus a clear handoff guide for future runs. Happy to share relevant portfolio samples and discuss your dataset specifics. Best Regards, JP
₹7,000 INR in 7 days
2.4
2.4

I'm Zainab, a highly accomplished Python developer and AI enthusiast, thrilled by this opportunity to enhance your AI-driven grading system. Drawing from my profound experience in this domain, I will carefully evaluate the suitability and performance of a diverse range of LLMs and open-source models like Hugging Face, Falcon, and even custom fine-tuned transformers. This comprehensive approach ensures we make an informed decision about the precise fits for your system and ultimately deliver impeccable outputs. Using my strong command over Python coupled with my massive data management skills, I guarantee impactful insights through detailed benchmark reports featuring precision, recall, and F1 scores for each model on your labeled dataset. I won't just offer you statistical results; but clean,easy-to-understand Jupyter notebooks where you can effortlessly follow every step and reproduce any outcomes. Moreover, my commitment extends beyond just the evaluation phase. I will furnish an updated inference module or prompt library that not only exhibits a 98% exact-match accuracy on the test set but also when running future batches. With me onboard, expect a top-notch quality delivery based on meticulous research and systematic intervention throughout. Reach out today for proficiency that optimizes your AI grading accuracy!
₹5,000 INR in 7 days
1.9
1.9

Hi, I can fix your AI grading accuracy. I've solved this exact problem many times. Here is what I will do: Benchmark GPT, Claude, and open-source models on your labeled MCQ set with fair, reproducible tests. Trace misclassifications to isolate parsing, prompt, or model issues and tune the pipeline. Deliver clean Python evaluation scripts, an updated inference module, and a hand-off guide to reach 98%+ exact-match accuracy. 10 days free support after delivery Milestone-based payment Reply "YES" and Best regards, syed ribal
₹12,500 INR in 3 days
0.0
0.0

I recently completed a project improving the accuracy of an AI-driven grading system by refining model evaluation and tuning the processing pipeline, which boosted precision and reduced misclassifications significantly. I am new to Freelancer, but I bring real experience from working on large-scale AI projects with companies like Meta and TSMC, where I focused on creating efficient, reliable solutions for complex problems. Your need for a clean, reproducible benchmarking process aligns well with my approach. I will deliver a seamless integration of evaluation scripts and ensure the inference module reliably achieves high exact-match accuracy, focusing on your Python and Hugging Face stack. I work with simplicity and structure, building solutions that are reliable and maintainable long term, avoiding unnecessary complexity from the start to ensure your pipeline is robust and easy to extend. I am ready to begin refining your system for consistent, accurate multiple-choice grading with clear, actionable benchmarks. If this aligns with your project, feel free to reach out. Regards Patrick.
₹9,400 INR in 4 days
0.0
0.0

Hi, Your focus on accuracy first is exactly the right approach for an AI grading system—and this is something I’ve worked on in similar evaluation and model-tuning pipelines. I’m experienced with Python, Hugging Face, and LLM evaluation workflows, including benchmarking across both commercial (GPT, Claude) and open-source models (Llama, Falcon). I can help diagnose why your system is misclassifying MCQs and bring it to a reliable, production-ready accuracy level. How I’ll approach this: Audit your current pipeline (prompting, parsing, answer matching logic) Build a reproducible benchmarking setup (precision, recall, F1) across multiple models Identify error patterns (format ambiguity, option mapping, reasoning gaps) Improve prompts and/or implement structured output validation Add fallback logic or ensemble methods if needed for higher accuracy Deliverables: Detailed benchmark report comparing models on your dataset Clean, well-documented Python evaluation scripts/notebooks Optimized inference pipeline or prompt set targeting 98%+ exact-match accuracy Simple hand-off guide for running future evaluations If needed, I can also suggest improvements like constrained decoding, answer normalization, or lightweight fine-tuning to further boost consistency. I focus on reproducibility, clarity, and measurable improvement—not guesswork. Available to start immediately. Best regards,
₹10,000 INR in 7 days
0.0
0.0

I am Haziq Hasnol from Malacca, Malaysia excited about the opportunity to work on refining an AI-driven question-paper checker and parser. My expertise in AI model development, machine learning, and Python programming, combined with experience in multiple-choice answer validation, positions me well to tackle the current challenges in accuracy. The current problem, as mentioned, involves the misclassification of correct answers, and I am confident that through a systematic evaluation of various models and fine-tuning, I can isolate the causes of these misclassifications and deliver a solution that ensures high precision and recall.
₹7,000 INR in 7 days
0.0
0.0

Hi! I have experience working with Python, Hugging Face, and model evaluation pipelines, and I can help improve the accuracy of your MCQ checker. I’ll benchmark multiple models (LLMs + open-source), identify where misclassifications happen, and refine the pipeline to reach high reliability. I’ll provide clear evaluation metrics (precision, recall, F1), clean scripts, and an improved inference setup targeting ~98% accuracy. I focus on practical, reproducible solutions and clean documentation so you can run it easily later
₹9,000 INR in 6 days
0.0
0.0

I will fix your misclassifications on MCQs before touching speed or features. My approach: Audit your dataset – I'll first identify whether errors come from ambiguous answer keys, tokenization mismatches, or model confidence thresholds. Benchmark 5+ models – GPT-4, Claude 3, Gemini, Llama 3, and a fine-tuned DeBERTa (best for MCQ logic). Reproducible scripts with fixed seeds. Root-cause isolation – Per-question confusion matrices to see which items fail (e.g., negation, "all of the above," similar distractors). Tuning pipeline – Prompt ensembling + logit calibration + rule-based overrides. Target: 98% exact-match accuracy. Deliverables – Clean Python notebooks + inference module + hand-off guide. PostgreSQL integration retained. Why me: I've solved MCQ grading errors for e-learning platforms before. I know when to use an LLM vs. a lightweight classifier. I will not move to short-answer until your MCQs are flawless. Timeline: 5–7 days for benchmark + tuned pipeline. Let’s talk numbers: I will show you precision/recall/F1 per model before you pay for final integration.
₹7,000 INR in 7 days
0.0
0.0

Hey — I'll skip the pitch and get straight to what I noticed: your misclassification issue almost certainly isn't a model problem. It's upstream — answer normalization not handling format variation ("B" vs "b)" vs "(B)"), prompts that ask the model to reason instead of compare, or OCR noise getting blamed on the LLM. Swapping models without diagnosing that first just moves the problem around. Here's what I'd deliver in 3 days: Day 1 — forensics on your labeled dataset, grouping every misclassified example by failure type so you know exactly where the pipeline breaks. Day 2 — evaluation harness built around your PostgreSQL schema, fair model comparison (GPT-4, Claude, open-source) under identical conditions, benchmark report with precision, recall, and F1. Day 3 — updated inference module and prompt library hitting ≥98% exact-match, plus a hand-off guide you can run future batches from without reverse-engineering anything. I'm a self-taught developer who ships production tools with real users. I find root-cause work like this genuinely interesting — which probably shows. Happy to start with a small trial on a dataset subset if you'd like to see the approach before committing fully.
₹2,000 INR in 3 days
0.0
0.0

Hello, I am Rada Rabadi, an AI Engineer with a Bachelor’s degree in Computer Science and a Master’s degree in Artificial Intelligence. I have strong experience in Python, machine learning, NLP, model evaluation, and prompt optimization. For your AI-driven question paper checker project, I can help improve the accuracy of MCQ answer validation by evaluating multiple LLMs and open-source models such as GPT-4, Claude, Llama, and Hugging Face transformers. My focus will be on identifying the root causes of misclassification, improving prompt logic, optimizing parsing methods, and refining the inference pipeline to achieve highly reliable results. I will provide: * Benchmark report with Precision, Recall, and F1-score comparison * Clean and documented Python scripts / notebooks * Improved inference module or prompt system targeting 98%+ exact-match accuracy * Clear handoff guide for future usage and scalability I am detail-oriented, analytical, and committed to delivering accurate and practical AI solutions. I would be glad to discuss your current system and help optimize it efficiently. Looking forward to working with you. Best regards, Rada Rabadi
₹7,000 INR in 7 days
0.0
0.0

Hello, I am a Data Scientist with experience in NLP, LLM evaluation, and AI pipeline optimization. I can help benchmark commercial and open-source models, identify misclassification issues, and improve your multiple-choice grading system for higher precision, recall, and F1 performance. I work with Python, Hugging Face, and database-integrated workflows, and can deliver: • Reproducible benchmark reports • Clean evaluation scripts/notebooks • Optimized inference or prompt pipelines • Clear hand-off documentation My focus is on accuracy, reliability, and measurable model improvement. Best regards,
₹10,000 INR in 7 days
0.0
0.0

I’m refining an AI-driven question-paper checker and parser, and my top priority right now is accuracy. With multiple-choice items the system too often marks wrong answers as correct and flags correct ones as wrong, so I need a specialist who can put several models through their paces, isolate the causes of these misclassifications, and tune the pipeline until the results are consistently reliable. Speed and consistency will matter later, but first I need the numbers to be right. You’ll have freedom to evaluate any mix of commercial LLMs (GPT-4, Claude, Bard, etc.) and open-source options (Llama 2, Falcon, custom fine-tuned transformers) so long as you keep the comparison fair and reproducible. The current stack is Python with Hugging Face and a small PostgreSQL
₹7,000 INR in 7 days
2.0
2.0

Hi, I can help you check and improve the accuracy of your multiple-choice answer checker. From your description, I would not start by only switching models. The issue may come from the prompt, answer-key matching, option formatting, parser output, label issues, or the evaluation logic. I would first build a simple reproducible benchmark, review the wrong cases, and then improve the pipeline based on the actual error patterns. I can help with a Python evaluation script or notebook, baseline metrics such as exact-match accuracy, precision, recall, and F1, error analysis of wrong predictions, comparison of a small fair set of models or prompt strategies, and improvements to the prompt or inference logic. I have experience with AI system design, LLM workflows, prompt engineering, structured outputs, Python, Hugging Face, model evaluation, and backend AI integration. Your Python + Hugging Face + PostgreSQL stack should be fine, so I would avoid adding another framework unless it is really needed. I can work toward the 98% accuracy target, but I would first need to review the dataset, answer-key format, and failed examples to confirm how realistic it is. Could you share a small dataset sample, current model output, and a few wrong cases?
₹9,700 INR in 7 days
0.0
0.0

NOIDA, India
Payment method verified
Member since Oct 29, 2014
₹12500-37500 INR
₹1500-12500 INR
₹12500-37500 INR
₹100-400 INR / hour
$750-1500 USD
$250-750 USD
$750-1500 USD
₹1500-12500 INR
$15-25 USD / hour
$5000-10000 USD
$250-750 USD
$10-30 USD
₹35000-50000 INR
$2-8 USD / hour
$250-750 USD
$8-15 USD / hour
₹50000-100000 INR
min $50 USD / hour
$10-50 AUD
₹1500-12500 INR
$250-750 USD
₹750-1250 INR / hour
$250-750 USD
$30-250 USD
$10-30 USD