GitHub Copilot is one of the most advanced AI developer tools and certainly the most widely adopted one. This post explores how GitHub transitioned from testing to fine-tuning, and finally to scaling this innovative AI tool in a production environment.
A very important part of being an AI engineer is being equally good at software and product engineering. This post contains equal parts of AI, software and product engineering to give you a wholistic view.
The GitHub Next team took over 3 years to build the tool. To go from idea to production they used the Find it, Nail it and Scale it methodology.
Find it: Identify and Isolate
In the current AI landscape, it is very easy to come up with a solution looking for a problem. It is imperative to instead identify, isolate and scope a single problem.
Identifying the target market: For Copilot, these are developers at all stages. For beginners to be able to quickly learn syntax and make a splash, as well as for experts to speed up their workflows with predictive text.
Copilot was always an option on the minds of the GitHub Next team. Only after the launch of OpenAI GPT3 did they consider it good enough to begin considering how a code generation tool might work.
Focusing on just one problem: Instead of addressing all developer issues, with OpenAI APIs, Copilot initially aimed to address the issue of just predicting code in the IDE.
There is a very important point here on why did they specifically choose the IDE as the initial problem? For 3 reasons:
1. The chat format was not yet as popular: Chatting as a modality was made really familiar post the launch of ChatGPT.
2. IDEs are the primary source of input for each developer: Having predictive text show up in an IDE made the interaction almost seamless and instantly interactive.
3. Constant switching back and forth between a web browser and the IDE is time consuming and bad UX.
Balancing product ambition with quality: Start by evaluating what an LLM model can actually do. Comprehensive LLM evaluation is a task in itself but imperative before launching a product.
The GitHub Next team crowdsourced self-contained problems to help test the model. They explored the generation of entire commits but on realising the state of art couldn’t support that functionality at the time, they landed on generating at the function level.
Nail it: Iterate
Building quick iteration cycles into the product development process allows teams to fail and learn fast. The first model GitHub got from OpenAI was a Python only model. Then a javascript model and finally a multilingual one.
GitHub finally collaborated with OpenAI to launch Codex, a 170b parameter model. The 3rd iteration of the codex model is where the team finally could feel the improvement.
The model was available as an API which businesses could build on. While the base model was good, the team wanted to make improvements before going live.
Model Improvements
Prompt Crafting
With LLMs, you have to be very specific with the inputs and the expected output. Thinking of an LLM as a document completion model, the art of prompt crafting is really all about creating a ‘pseudo-document’ that will lead the model to a completion that benefits the customer.
Context is a very important part of prompting a LLM. Initially, Copilot was provided with the context of the current file and the user’s prompt. They noted that the results varied a lot and were not consistent.
The secret sauce: The team soon realised that additional context was valuable and improved outputs.
Where did they get additional context from? Similar lines in user’s neighbouring editor tabs!
The concept of neighbouring tabs helped increase the acceptance rate of Copilot’s suggestions by 5%.
A side benefit of this, and why Copilot really makes you 55% faster is cause it can just show you relevant code across tabs which might require you to flip back and forth between tabs multiple times.
Fine-Tuning
Fine-tuning involves training a pre-trained base model on a specific, smaller subset of data that is relevant to a particular use-case. This allows the model to capture the nuances of the new data and improving performance for that specific task.
When fine-tuning it is extremely important to define what constitutes a statistically “good” response. At the time, training on a 170b parameter model like Codex, came with its own inherent complexities.
Why did GitHub want to fine-tune?
There is not much information provided on this but here are a couple of reasons why they possibly wanted to:
Personalised suggestions: Training the base model on a specific user’s codebase could provide more focused, customised completions and in the style the user codes in.
Continual learning: On deploying, Copilot would collect gold standard data in terms of whether a suggestion was retained or not. This could serve as basis to improve the model.
Enterprise specific: Copilot is available to enterprises on training on internal languages, might have been a path GitHub could’ve considered.
Language Specific: The GitHub and OpenAI teams believed every coding language would require it’s own fine-tuned model. But the field of generative AI was rapidly advancing and there was no need for this.
Scale it: Optimise quality, usability and costs
Ensuring consistent results
LLM’s are inherently probabilistic. Hence they won’t always product the same output. However, in a production setting, the same prompt and same context should produce the same suggestions from the AI model. The GitHub team setup a pipeline to address this challenge:
- Caching the responses: Not only did this reduce the variability in suggestions, it also reduced latency, improving performance.
- Randomness parameters: Parameters like temperature are used to control the randomness of outputs from a LLM. The GitHub team tuned these parameters to reduce the randomness.
Metrics
Metrics serve as the compass in the iterative process of refining and scaling AI technologies. GitHub Copilot's development highlighted how vital it is to choose and track the right metrics to ensure the AI not only functions as intended but also aligns with user expectations and needs.
One base metric that was tracked was the acceptance rate of suggestions. Along the same lines, code retention rate was measured, which measured how much of the original code was kept or edited by a developer.
When gathering feedback, the team gathered that the model’s performance had decreased. This led to the implementation of further guardrail metrics to ensure only the best models are launched. One example is the percent of single line vs multi line suggestions. Another possible metric could be error reduction rate.
Optimising Costs
- Moving to the cloud:
Originally, GitHub Copilot indeed worked on the back of the OpenAI API. As the product grew, they scaled to use Microsoft Azure infrastructure. This move would bring about reduced costs and more autonomy on implementing guardrails, releases and iterative improvements.
- Just one prediction:
Before GitHub Copilot decided to use ghost text-the grey text that flashes as a coding suggestion - the tool would generate 10 suggestions and display them all at once. Not only was this bad UX, there was a major compute cost attached to it. Further, they measured that most people chose the first response and hence decided to do away with 10 suggestions for just one.
Other Takeaways
A neat insight from Idan Gazit when designing Copilot was, “We have to design apps not only for models whose outputs need evaluation by humans, but also for humans who are learning how to interact with AI.”
Integrate experimentation and tight feedback loops into the design process: This is especially critical when working with LLMs, where outputs are probabilistic and most end users are just learning how to interact with AI models.
As you scale, continue to leverage user feedback and prioritise user needs. Doing so will ensure that your product is built to deliver consistent results and real value.
Copilot in VSCode
A small section on actually running GitHub Copilot. It’s extremely simple where you just download the extension from VSCode marketplace. The extension is paid but a lesser known fact is that it’s free for students!
For those who don’t want to spend money on this, or want to run a LLM locally, consider using Continue with Ollama.
Need help building a ML Product?
Book a 1-on-1 session with me!