Engineering a Machine Learning Product

sun-through-trees

Our goal felt hidden, out of reach.

Our product, Frontier, focuses on building better student writers in grades 4-8 by challenging them with engaging narrative, informational, and argumentative writing projects: how does severe weather affect our lives? What does it take to be a gymnast? Do cats make good pets?

The goal to focus on building better writers through fast feedback grew out of our observations in teacher interviews and product research. We noticed state and national education standards increasingly emphasizing writing. We also discovered difficulties assessing student writing in a fast and complete way. Teachers were feeling the mounting pressure of teaching writing better, but lacked the opportunity to actually fulfill those requirements.

We thought it might be possible to provide fast, specific, actionable writing feedback to students using machine learning assisted automation. However, we were only three people in a small company that could only spare a limited time for open-ended exploration. We had to quickly identify what was actually worth delivering, and then successfully deliver the product feature.

This story is about how our team was able to rapidly discover and iterate with the help of certain tools, design decisions, and practical software engineering—ultimately delivering features that nurture students to be more confident, engaged writers.

Different Kinds of Risk

As Priya, the product manager I was working with on our team along with our CTO Luke, sums up here: we focused first on product risk: the risk that we’re building the wrong product. We identified a valuable output we wanted: specific, actionable feedback for student writing. We defined a data gathering strategy: using human graders and a custom rubric of specific questions to assess student writing in sufficient detail. And we also experimented with metrics for keeping us on track: for example, bad feedback we don’t give is less troublesome than bad feedback we do give (we’ll talk more about evaluating success later).

But product risk is one among three overall kinds of risk I observed.

The second risk area was skill risk. Machine learning was a new capability we were exploring in our team. We had no specific, recent experience in the machine learning world of tools between us. If we rushed in without some knowledge, conversations, and skill building, we might spend too much time exploring possibilities or end exploration too early and be overconfident in our solution ideas. There’s a neat analogy to the Explore/Exploit tradeoff idea that comes up in machine learning – how many options do you explore before taking an action?

The last risk area was the familiar technology risk. The reason many advocate iterations, prototyping, and continuous delivery is to account for the uncertainty of how long it will take to build something. To launch and run an ML based solution, we needed to learn certain languages or libraries deeply, maintain and grow a codebase that supported new processes and needed to be adaptable, and also manage data annotation and exploration in new ways.

To manage these risks, we organized our work into three successive phases:

Definition: what should we build?
Discovery: what should we deliver?
Delivery: how should we actually deliver the product?

Chapter 1: Definition

Product and technology research

While the two engineering roles focused on the technology and skill problems, the product role led the effort to solve the product problem: interviewing, prototyping, researching, and overall continually maintaining focus on product and business value. It’s useful to do this first, because without this piece, you end up spinning forever on amazing possibilities that aren’t useful to your users. For more, you can read this post, where Priya distills the key activities you should be doing as a Product Manager.

This step also involved identification and definition of the inputs into the product: data about student writing. Unlike pre-defined academic problems I had encountered, this was the real world of undefined information that we had to contextualize and understand from scratch. In many machine learning projects, curating and annotating a dataset is a foundational requirement– it’s what allows you train models, validate their accuracy, and measure your solution’s real-world performance. Priya helped us define and iterate on our data curation strategy by first defining a custom rubric of questions with which to assess student writing on very specific points. She consulted existing rubrics from state and national assessments, distilled common questions, and then tested feedback that we could provide based on grading done using that rubric. We eventually had questions like: “Does the writing include an introductory sentence or paragraph?” Our first such rubric helped us all understand the pieces of student writing in much more detail.

Rapidly Ramping on Practical Machine Learning

From the perspective of a senior software engineer without prior, rigorous exposure to machine learning beyond college courses, learning ML is a daunting task. Especially because the field is so broad and active.

The first step was simple to learn more about the state of the field. What we zeroed in on was the fast.ai Deep Learning 1 course. It’s free, self paced, pragmatic, non-academic in focus, and it exposes you directly to the cutting edge approaches.

I would recommend this course over all others. Also, I’d recommend Paperspace — it’s a hosted environment with cheap access to good GPUs.

Another very useful resource is Hands on Learning with Scikit Learn & Tensorflow. The first two chapters broadly introduce machine learning, build an overall conceptual framework, and then provide direct experience in doing a real machine learning project.

What all three of us also did was to create one-on-one meetings, dedicated Slack room, and more regular meeting times to regularly talk in person. By building a regular rhythm of pairing work sessions, meetings, and discussions, we had many opportunities to talk and learn from each other. Also, physically having the same small office helped too!

Making some hard decisions

After we wet our feet with dives into product risk reduction and skill risk reduction, we started to focus our efforts and define what we’re building. This meant doing the following:

Choosing against deep learning

Based on what we learned by talking to others, from our experience figuring out our input data, and from how explainable we wanted to make our solution, we decided not to entertain fancy deep learning algorithms.

One factor in this decision was that our input data was not that rich, up front. Since we support many kinds of writing projects, our use case is very different from, say, a standardized test agency. While we had thousands of submissions, when split among different writing projects and further split among more salient sub-questions, we had an order of magnitude fewer submissions.

Another factor was that we had observed non-deep learning algorithms to be a lot more explainable. It was easier to reason about a simplistic regression algorithm versus a black box “universal language model” that categorizes text in some fashion. Also, we want to deeply (ha!) understand what contributed to a categorization made by any of our solution ideas.

So, our decision meant using “classic machine learning” algorithms, and using Scikit Learn as our go-to library. It fit our data availability and need-for-explanation well. We couldn’t have made such a decision without having ramped up a bit on machine learning earlier, or without understanding our input data!

Setting an aggressive timeline to ship something

We wanted to keep ourselves accountable, so we set an aggressive agenda to ship something to students within three months. This was informed by both the business risk reduction efforts and also engineering deep dives and experiments. This kind of time constraint was actually freeing — it let us focus our energies on what we could feasibly do, and deflected aspirational divergences.

Chapter 2: Discovery

Identifying core needs

One of the big unknowns we had was: how do we structure our code and data to enable us to iteratively learn, build, and ship something?

We tackled this problem by thinking through our core needs:

We needed to pull in student writing data and play with it without interrupting other teams or products.
We needed some basic web based tools for internal users for annotation, exploration, and product discovery.
We needed to allow ourselves to rapidly switch directions in infrastructure (scale, technology, architecture) since we’d already seen many shifts.
We wanted a flexible local Jupyter Notebook environment with access to our entire dataset so we could learn, discover, and experiment with convenient and fast hardware.

Tools, Technologies, and Decisions

Given our core needs, we decided to use:

Python 3 — great ecosystem and support, and not a big jump for an otherwise Ruby shop.
hug — a Python 3 library for making JSON APIs that has a lot of batteries included. Plus, it means our minimal API was just a module of plain functions, and was accessible easily via an HTTP API, a programmatic interface (you know, import and so on), and even as a CLI if desired.
Docker — Docker is a way to selectively virtualize a system so it can be consistently built and used the same way no matter the underlying computer. With Docker, we had to solve the configuration of all our necessary machine learning and other dependencies and libraries only once. After we figured it out once, we could then run it the same regardless of our local computers or cloud server architecture. We could run one command to setup a new developer’s computer or to tear it down.
Convox on AWS –– Convox is like Heroku, but running on your own AWS infrastructure. Convox uses Docker, which allows us to support basically all language or OS choices, and to approximate the production environment locally. It also has a killer advantage if you’re using AWS already: it uses AWS services in fairly standard ways under the hood. ELB, Elasticache, ECS, RDS — we had years of familiarity with these services, and they were more reliable than a custom-run cluster. Infrastructure you don’t understand is quite the liability! Also of note: automatic rollbacks, built in Workflows that let us avoid Jenkins/other build automation tools, built in Slack and Github/Gitlab integrations, simple promotion of any prior release, and automated-or-simplified updates for security.
Separate database, and simplified data ingestion — we just made our own Postgres database, separate from the main product database. Instead of investing in data pipelines or AWS Glue or similar, we just made a simple cron job (powered by an AWS Lambda function under the hood of Convox’s high-level implementation) to import data in an idempotent way into a custom table that denormalized a whole bunch of otherwise hard to manage relationships. This allowed us to get away with just a handful of tables with any meaningful schema, and it imposed no requirements on the live product (we only read from a private read replica of the main database). This way, we were free to screw up and we had all the data we needed in one context.
In-built Jupyter Notebook server with full access to our own data — we provided out-of-the-box support for Jupyter Notebooks in the local development environment. What’s great is that the Project Jupyter folks have ready-made Docker images and documentation. We now had a way to do one-off explorations with lots of comments and commentary embedded right in readily presentable documents — they render in all their read-only glory right in Github!

And last but not least,

Pandas — the crown jewel of Python, as far as we’re concerned. This single library did more to make data accessible for us than anything else. Once you get the basics down, working with databases or CSVs or even your own cuts or combinations of them is seamless. When combined with Jupyter Notebook and its inline visualization support, your Notebook becomes a data playground. Python for Data Analysis is a great way to learn it!

Experimentation & more experimentation

Once we had a basic project setup with access to data and with configuration and dependencies all figured out, we started to spike on many new things. For most such spikes, it meant creation of new Jupyter Notebooks that both made some small thing work, but also documented how that spike was done.

Over time, we saw some commonalities and decided to make the process of spiking much easier.

There are a couple of high level concepts in Scikit Learn that were extremely useful: Pipelines and Grid Search.

A Scikit Learn Pipeline is basically a very high level generalization of a machine learning model. What you can do is specify a certain composition of various data-transforming functions, with a final “estimation” function. Then, you can train this model on some data using general functions that work on any Pipeline at all.

Scikit Learn’s Grid Search is even cooler. Hyperparameters are those variables that are relevant to the machine learning algorithm. For example, the number of clusters to form if using k-means clustering. Well, with Grid Search, you can specify very high level search criteria (ranges of possible hyperparameter values), provide a Pipeline and a dataset, select how to evaluate, and then it checks all combinations of those hyperparamters to suggest the optimal hyperparameters!

We built a whole host of our own Transformers and our own higher level evaluation functions. This meant that our explorations could trivially pit any combination of Transformers against each other. Plus, because of the pipe-like structure, we could print out and understand each step’s effect on the data.

Also, by using Scikit Learn’s library of already implemented algorithms and great documentation, we were able to avoid a bunch of work. It’s a very helpful library.

Finally, spaCy was a dream to use. There’s so much performant, well built, up to date functionality in that one library!

Chapter 3: Delivery

Norming on Evaluation

Delivery is hard enough without uncertainties about how a given component of the system will behave given new inputs. With machine learning based systems, we adopt new kinds of uncertainties.

A big thing that helped keep us grounded in the face of such uncertainties was discussing and defining how to evaluate the job of assessing a piece of student writing, no matter who did it. This is more than just making a custom rubric (something a human could do too), which we did. This is more than just “number of goals vs number of shots on goal.”

We focused on two metrics common in machine learning: precision and recall.

Precision: “how many of the selected items are relevant?” Or, the exactness of the estimation

Recall: “how many relevant items are selected?” Or, the completeness of the estimation.

Let’s say you’re trying to make a Sorting Machine that sorts apples and oranges into a bucket based on whether it’s an orange or not. You run an experiment by providing it with some already-known apples and oranges in random order. Let’s define some numbers too: 2 apples and 8 oranges.

Then, you assess the bucket, counting the number of apple and oranges. Let’s say you find 4 oranges and 1 apple in the “orange” bucket.

Now, the precision score for this Sorting Machine would be 4 True Positive oranges divided by 4 True Positive oranges plus 1 False Positive orange-that-was-actually-an-apple. 4/5, or 0.8.

However, the recall score for this Sorting Machine would be 4 True Positive oranges divided by 4 True Positive oranges plus 4 False Negative oranges. 4/8, or 0.5.

We asked ourselves questions like: do we care about covering all cases of such-and-such writing mistake? How sure should we be that we’re correctly asking students to correct such-and-such mistake we noticed?

And we answered such questions by doing explorations and measuring these values for our various solutions, along with other product research and also a dash of common sense. This also extended to determining when we considered something “production worthy,” and when considering future retrospectives.

Importance of Rapid Feedback

This last Delivery phase wasn’t entirely separate from Discovery — doing both at once is quite useful, actually!

As we were doing our exploration in the Discovery phase, we were continually building things in small spikes to get a sense for what we could actually deliver.

We learned a lot from a brief initial attempt to use Prodigy, from the makers of spaCy. We decided to make an annotation tool that only required a single action from a human to complete one unit of annotation, with a very clear rubric. And our thought was to then attempt to use that annotated data to produce feedback and see what how useful and effective that feedback actually was.

We would make a rubric, collect some data, spike on some machine learning algorithms applied to this data (or apply learnings from synthetic spikes on unrelated but more plentiful data), produce feedback from the results (available in an internal web app), and then try out the feedback.

Well, we did this and discovered a need to adjust the rubric. Several times. And we also revised our ideas about what kind of feedback we should give many times.

Once we started doing half-week attempts to annotate some student writings using a way more focused set of questions, and then exploring them quickly in a Jupyter Notebook, we discovered we could get a lot of quality data very quickly and change directions very quickly too.

So, I would suggest aggressively optimizing for faster feedback loops in the evaluation of your solution.

Cohesive Experience, Incrementally Improved

To deliver better student feedback, we couldn’t just slap something together. For a meaningful product change, we had to think holistically about the teacher and the student experience. We decided to adjust the student writing experience fundamentally. Our UI/UX research allowed us to refine our core writing experience for students, and also to test the student feedback in a meaningful test setting.

It was so great to visit actual classrooms, and talk to real teachers and students! I’d say that’s the part that I enjoyed even as an engineer. Direct access to the problem domain, and enough autonomy to actually effect change — this is a powerful combo.

From this experience, we decided to first implement a simpler version of our work as several “Smart Supports” — specific, actionable writing feedback delivered directly and immediately to students, and tested and monitored for effectiveness.

Adjusting for production use

We decided to build a new version of our data annotation system and teacher-email system, from scratch. And we decided to not use hug or Python in the backend, for most use cases. This cost us time in the short term, but it did allow us to not introduce a new production dependency on a new service. For all the benefits of micro service architecture, in a small team setting, ensuring site reliability given engineering resource constraints is a hard, continual challenge. What’s convenient for green fielding experimental work is also not always convenient for a production app meant to be maintainable by all of our engineers.

We decided to keep our internal, Python-dependent repository only internal, and to build everything user-facing using our existing tools: Rails on the backend, and React on the frontend.

This meant we specialized Python for machine learning needs only. We already had a solution for “web backend” that was sufficient: Rails. Using Python for those needs would just introduce yet another way to do the same job.

We also discovered we could deliver Smart Supports informed by our machine learning work that themselves didn’t need an active machine learning model. This did not work for every kind of Support we imagined, but when it did work, it meant significant reductions in complexity, coupled with data gathering that informed future upgrades to the Support.

Learnings

A big learning for me: how unreasonably useful it is to keep developer tooling flexible, powerful, but easy to change. There was one Pull Request, I remember, where I utterly changed the frontend app we had, and it didn’t feel hard at all! And there was another PR where Luke spiked on how to produce an entirely new kind of Smart Support, and I remember being amazed at how I could actually understand the process because I could read all the steps that led up to the final result in the Jupyter Notebook — and then later, I replicated that on my own with little additional effort!

Another big learning: there is a sense of exploding new possibilities in Natural Language Processing. It isn’t business as usual. ULMFiT is one of the more recent, exciting developments. But there’s so many more.

What this experience highlights is that while there are many unique challenges when you try to apply machine learning in a small-team, real-product setting, there are many practical lessons from Software Engineering that can help nurture and accelerate such an effort. There’s a lot of great work out there that we can build on, and practical ways to course correct if we find ourselves on the wrong path.

Learn more about Frontier, our solution for building better writers through inquiry-based learning.

You can read more about how we’re building better writers at eSpark Learning here.

And if you’re a Product Manager, you can read more about things to do before taking a machine learning course here.