Introduction
Research asks, “Can this work?” Engineering asks, “Can this keep working for real users under real constraints?”
That transition changes the project. A promising notebook is not yet a product capability. It still needs a user problem, reliable data, evaluation, deployment path, monitoring, cost control, and a maintenance plan.
This post is a pre-project checklist for AI and ML work. It is meant to be used before the team commits to a build.
Start With Impact
Before choosing a model, define the value of the system.
Ask:
- Who is the user or stakeholder?
- What decision or workflow will change?
- What does success look like in business, product, or research terms?
- What happens if the model is wrong?
- How much latency, cost, or uncertainty can the user tolerate?
- Is machine learning necessary, or would rules, search, analytics, or a simpler workflow solve the problem?
The last question matters. ML is expensive operationally. If a simpler system solves the problem, use the simpler system.
Define the Product Contract
Turn the idea into a contract the team can evaluate.
Goal
Write one sentence that describes the system:
Given [input], the system should produce [output] for [user] so that [decision or workflow] improves.
Examples:
- Given a support ticket, classify its routing queue so the support team can respond faster.
- Given a document collection and a user query, retrieve relevant passages so an assistant can answer with sources.
- Given transaction features, estimate fraud risk so suspicious activity can be reviewed.
Constraints
Capture constraints early:
- Latency budget.
- Throughput target.
- Availability target.
- Cost ceiling.
- Privacy and compliance requirements.
- Human review requirements.
- Supported languages, regions, or user groups.
- Deployment environment.
These constraints shape the model and infrastructure choices more than model preference does.
Failure Modes
List unacceptable failures:
- Harmful or unsafe outputs.
- Unfair treatment of important user groups.
- Silent data leakage.
- Hallucinated citations.
- High-confidence wrong predictions.
- Slow responses during peak traffic.
- Model behavior that cannot be audited.
If the failure mode is severe, design the fallback before the model is deployed.
Data Readiness
Data is usually the real project.
Check:
- Source: where the data comes from and who owns it.
- Access: whether the team can legally and technically use it.
- Definition: what each field means.
- Coverage: which users, time periods, languages, or domains are missing.
- Quality: missing values, duplicates, noise, outliers, and inconsistent labels.
- Freshness: how often the data changes.
- Versioning: how datasets and feature definitions are tracked.
- Privacy: whether sensitive fields need removal, masking, or access control.
For supervised learning, label quality deserves its own plan. Define label guidelines, review disagreement, and measure inter-annotator consistency when humans provide labels.
For LLM systems, data readiness includes prompts, retrieval corpora, documents, chunking, metadata, and source trust.
Evaluation Plan
A project without an evaluation plan is a demo, not an engineering effort.
Define:
- Offline metrics.
- Slice metrics.
- Baseline model or baseline workflow.
- Human evaluation criteria, if needed.
- Regression tests.
- Production guardrail metrics.
Examples:
- Classification: precision, recall, F1, calibration, and confusion matrix by slice.
- Ranking or retrieval: recall@k, MRR, nDCG, and qualitative failure review.
- Generation: task success, factuality, citation accuracy, refusal behavior, latency, and cost.
- Forecasting: error by horizon, seasonality, and segment.
Avoid optimizing only the headline metric. Real systems fail in slices.
Model and Algorithm Choice
Start with a baseline.
A good baseline is:
- Easy to train and debug.
- Cheap enough to run often.
- Strong enough to reveal whether the data has signal.
- Simple enough to compare against future models.
Then decide how much complexity the problem deserves:
- Use rules when requirements are deterministic.
- Use classical ML when tabular signal is strong and interpretability matters.
- Use deep learning when representation learning is necessary.
- Use retrieval or search when the system needs grounding in changing knowledge.
- Use LLMs when language reasoning, transformation, or generation is central.
Model choice should follow the product contract, not the hype cycle.
System Design
An ML system usually contains more than the model:
- Data ingestion.
- Feature or document processing.
- Training or fine-tuning.
- Evaluation.
- Model registry.
- Serving.
- Logging.
- Monitoring.
- Feedback collection.
- Retraining or update workflow.
For more detail, see:
- Testing in Machine Learning
- Deep Learning Training: A Practical Guide
- MLOps Post-Training Considerations
Infrastructure Questions
Answer these before implementation:
- Where will training run?
- Where will inference run?
- What hardware is required?
- What is the expected request volume?
- How will model artifacts be stored and promoted?
- How will configs and secrets be managed?
- How will logs and metrics be collected?
- How will the system roll back?
- What happens if a dependency is unavailable?
For many teams, the right early answer is not a large platform. It is a small, reproducible pipeline with clear ownership.
Human Interface
The interface determines how users experience model uncertainty.
Design for:
- Clear inputs and outputs.
- Confidence or uncertainty when useful.
- Editable outputs when users remain responsible.
- Explanations that match the user’s actual decision.
- Feedback capture.
- Safe fallback paths.
For high-stakes use cases, avoid pretending the model is an oracle. Make review, override, and audit paths obvious.
Production Risks
Data Drift
Input distributions change. Users change behavior. Vendors change schemas. External events shift patterns.
Plan drift monitoring before launch.
Feedback Loops
If model outputs influence future data, the model can train on its own effects. Recommendation, moderation, pricing, and ranking systems are especially exposed.
Hidden Coupling
Models depend on feature definitions, data pipelines, schemas, prompts, retrieval indexes, and downstream consumers. A small upstream change can break behavior without changing model code.
Cost Growth
Inference cost can grow faster than expected. Track cost per request, cost per successful task, and cost by user segment.
Ownership Gaps
Someone must own:
- Data quality.
- Model quality.
- Deployment.
- Monitoring.
- Incident response.
- Retraining.
- Documentation.
If ownership is unclear, the system will decay.
Launch Checklist
Before launch, the team should be able to answer “yes” to these:
- The target user and workflow are clear.
- There is a baseline.
- Evaluation covers important slices.
- The model can be reproduced from code, config, and data version.
- The deployment path is tested.
- Logs and metrics are in place.
- The system has fallback behavior.
- Rollback is possible.
- Privacy and compliance requirements are reviewed.
- The team knows who owns incidents and updates.
Closing
The best ML projects start with engineering humility. Define the user problem, prove the data has signal, measure what matters, and make the system observable before it becomes important.
Research quality asks whether an idea is promising. Engineering quality asks whether the promise survives contact with users, infrastructure, cost, and time.