December 2021
Feedback from Maran Nelson, Jason Laska, Waseem Daher, Jonas Gebhardt, Mackenzie Burnett and Dhruv Maheshwari.
Over the last ten years, many companies have created human-in-the-loop services that combine a mix of humans and algorithms. Now that some time has passed, we can tease out some patterns from their collective successes and failures. As someone who started a company in this space, my hope is that this retrospective can help prospective founders, investors, or companies navigating this space save time and fund more impactful projects.
A service is considered human-in-the-loop if it organizes its workflows with the intent to introduce models or heuristics that learn from the work of the humans executing the workflows. In this post, I will make reference to two common forms of human-in-the-loop:
‘Online’ and ‘Offline’ are two frequently used terms in the ML community to describe Human-in-the-loop systems. Online refers to when human labor is used to oversee an ML system for the immediate benefit of an end user -- that is consistent with ‘Worker-in-the-loop’ defined above. Offline refers to when human labor is used only to train an ML model (i.e. mechanical turkers annotating images), and the ML model is subsequently productionized once training is done.
While there is much to say about the broader role of human-in-the-loop in our transition to a machine intelligence infused economy; this post is mostly focused on learnings about worker-in-the-loop services.
In 2018, 80.2% of the US labor force was employed in the services sector. While there has been impressive progress in machine learning, the majority of jobs in the services sector (e.g. customer service, vehicle operators, analysts, accountants, etc.) continue to demand more intelligence than machines are able to provide alone. Some fraction of work will get fully automated, but most services will inevitably transition to human-in-the-loop. Within this category; we can outsource work to human-augmented services (worker-in-the-loop), or augment users directly with ML-driven software suggestions (user-in-the-loop).
When does each human-backed approach make sense? Worker-in-the-loop makes sense when you don’t want to put the burden of overseeing incorrect machine predictions on your end-user. Oftentimes, if the user was going to do the work anyway, they may be okay with it -- e.g. Tesla Autopilot or Gmail’s Smart Reply (user-in-the-loop). Where delegation of the entire workflow is preferable to the end-user, oversight defeats the purpose.
Worker-in-the-loop enables human quality services at machine scale and efficiency. With WITL, you systematize and automate the routine parts of the workflow, and get to the core ‘inference tasks’ that are now devoid of customer-specific context. In an ideal WITL system, any human anywhere (once trained) can help handle tasks for any customer of the service -- unlocking the labor liquidity that otherwise makes service businesses difficult to scale, and with better gross margins. For this reason, it’s a compelling approach in the creation of new venture-backed businesses which require scale. Additionally, selling ML-infused software to legacy services businesses may be more difficult than using it to directly compete with them. Owning the end-to-end experience can 1) make the experience better and 2) capture more of the value created.
Still, the most common question investors asked us while developing a worker-in-the-loop scheduling service (Clara Labs) was “how long until the humans are gone?” Look no further than the speculations as to when (not if) Uber and Lyft will replace all of their drivers with self-driving systems. Investors are seeking scalable, high gross-margin businesses; large human workforces don’t obviously fit the mold. Investors aren’t solely to blame: AI founders similarly pitch optimistic automation timelines, and Hollywood continues to produce runaway superintelligence films.
In practice, keeping humans in the loop has been one of the most effective ways of bringing machine intelligence to market for quality-critical applications. At Clara, we used worker-in-the-loop to make significant automation progress on a difficult problem (natural language scheduling). Tesla Autopilot (user-in-the-loop) was able to quickly ship self-driving car technology to a large number of customers, each of which improve the technology everytime they use it. Some companies (e.g. x.ai, Cruise) even start out as worker-in-the-loop with the ambition to transition to user-in-the-loop once their system reaches a certain level of quality and they’ve harvested enough training data.
A quick caveat before we proceed, successfully building a worker-in-the-loop system is unfortunately still a non-trivial R&D and operational effort, and although this post will share observations about what markets have worked and haven’t, there are many tactical obstacles to overcome that will likely be the subject of a subsequent post.
Horizontal WITL services (e.g. Magic, Facebook M, Fin, GoButler, etc..) that tried to support a wide range of customer requests (e.g. scheduling, shopping, travel booking, etc...) received a lot more public attention than their vertical counterparts. Notably, these projects were often funded by large technology companies or backed by investors seeking massive returns. However, the desire to offer broader service functionality to reach a larger market was the structural reason these projects have struggled. Here is why:
Vertical services can much more effectively deliver a high performing service at scale that customers know when to use and what to use it for. So, which vertical services have worked the best for WITL?
Overall, the fastest growing WITL services have been those that have supplied new, vertical demands for intelligence that companies also preferred to outsource. This makes logical sense, as the prospective customers tend not to have current providers that you need to compete with. For the same reason, this category may have more total failures -- as you have to both convince the person to outsource the task in the first place, and to find budget when they previously didn't have it allocated.
WITL services that attempted to convince customers to outsource workflows that they’re not used to outsourcing have had more difficulty. These companies have to convince people to both a) invest in trying outsourcing and b) believe they’ll be able to do a better job than their current way of getting the work done. One of the biggest challenges with delegation is ensuring the supplier has sufficient context to do a similar or better quality of work than you could have yourself.
This was most evident in our experience working on our original product at Clara Labs; we built a worker-in-the-loop assistant that could schedule meetings over natural language, just like a human assistant. Despite outperforming human-level support, we found that the average professional had a difficult time giving up the way they were used to getting the work done. Our best customers were people who had worked with an assistant before. We eventually shifted the product to support a more structured and commonly outsourced workflow: recruiting coordination.
In the same vein, many WITL companies have taken on verticals where they can differentiate on service performance as opposed to changing customer outsourcing preference. These services can meet customers where they are today and provide the abstraction they're already used to. The following are examples of companies that pursued typically-outsourced services in existing verticals:
Which of these verticals will work out the best for WITL? The opportunity to differentiate from existing legacy providers using WITL is most present where there is a high degree of similarity between customers in the vertical.
Even within some verticals, customers may have vastly differing requirements. The similarity of customers within a vertical is the best predictor of whether you can maximize the TAM (total addressable market) of a WITL service.
For example, consider tier 1 customer support. Every company’s customer service operation is unique; the inquiries from their customers and the answers to those inquiries are dissimilar across companies. As a consequence, there have not been successful WITL customer support services. Recruiting coordination, on the other hand, tends to be more similar across companies -- you’re not having fundamentally different scheduling conversations with candidates based on the company -- scheduling an interview is scheduling an interview.
For this reason, we may increasingly see segmentation within verticals -- (e.g. WITL accounting service for brick and mortar businesses, WITL customer support for e-commerce) where customers have greater similarity in their requirements, enabling a high degree of optimization and automation of the service. While logical businesses, this specialization comes with the tradeoff of a smaller TAM, so it’s unclear whether they’ll be able to scale to become billion-dollar companies. This TAM reduction is especially important as WITL companies will definitely have lower gross margins than most venture-backed companies (e.g. SaaS).
Though the tactical details of making WITL services scale are not deeply explored in this post, here are a few considerations that affect the difficulty of making WITL services work at all:
Overall, there’s been a lot more WITL activity servicing businesses than there has been for consumers. Consumers have tended to prefer augmentative interfaces (e.g. user-in-the-loop) where they retain agency, as opposed to businesses which are accustomed to delegation and outsourcing as core to work. Why is that?
Although less common, consumer WITL services have similarly found success in vertical focus. As an example, Operator developed a personal shopping service, but it turns out shopping on behalf of consumers is an incredibly broad and difficult experience to tackle. One of the more successful WITL consumer companies has been Stitch Fix, which took narrowing focus a step further by building a personal apparel stylist service. Stitch Fix used worker-in-the-loop to bring a previously premium, vertical service (stylists) to an internet-scale population. Notably, Stitch Fix also removed the speed constraint, as it’s a delivery service and the consumer’s expectation is on the order of days, not minutes or hours.
Even Google Assistant’s approach for offering highly intelligent services evolved into Google Duplex (a worker-backed AI service), which only helps consumers book reservations, a tiny fraction of the total functionality in Google Assistant.
On the other hand, large consumer internet companies have frequently developed worker-in-the-loop services for ‘behind-the-scenes’ operations often invisible to consumers: These can be thought of as new, vertical demands for intelligence, but that customers did not prefer to fully outsource.
As teams of technologists tasked themselves with augmenting large human workforces, they’ve uncovered a missing set of tools and infrastructure that makes building worker-in-the-loop systems easier. B12 built and open sourced Orchestra, a system that helps orchestrate a team of human experts and machines. Fin pivoted from a horizontal consumer-facing assistant to building a ‘game tape/analytics’ product (Fin Analytics) for operational workflows. Scale improved on Mechanical Turk, the historical labor alternative if you wanted to build a worker-in-the-loop system but didn’t want to manage your own labor force or work with a BPO (business process outsourcing).
There’s likely a lot more opportunity to both help large companies with existing human operations transform into WITL operations, and to make it easier for founders to build a new generation of worker-in-the-loop services.
Segmentation within already outsourced verticals, infrastructure for building WITL services, and new demands for human-level intelligence at scale are compelling areas for future investment. If you’re thinking about starting a worker-in-the-loop company or investing in some, I’d love to talk more about these ideas (DM me on Twitter), but first consider asking the following questions:
There are several additional topics worth expanding on, which I omitted for the sake of getting this post out -- so if you're interested in any of the following do let me know: