Michael Akilian

Worker-in-the-loop Retrospective

Feedback from Maran Nelson, Jason Laska, Waseem Daher, Jonas Gebhardt, Mackenzie Burnett and Dhruv Maheshwari.

Personal motivation

Over the last ten years, many companies have created human-in-the-loop services that combine a mix of humans and algorithms. Now that some time has passed, we can tease out some patterns from their collective successes and failures. As someone who started a company in this space, my hope is that this retrospective can help prospective founders, investors, or companies navigating this space save time and fund more impactful projects.

A service is considered human-in-the-loop if it organizes its workflows with the intent to introduce models or heuristics that learn from the work of the humans executing the workflows. In this post, I will make reference to two common forms of human-in-the-loop:

  • User-in-the-loop (UITL): The end-user is interacting with suggestions from a software heuristic/ML system.
  • Worker-in-the-loop (WITL): A worker is paid to monitor suggestions from a software heuristic/ML system developed by the same company that pays the worker, but for the ultimate benefit of an end-user.

‘Online’ and ‘Offline’ are two frequently used terms in the ML community to describe Human-in-the-loop systems. Online refers to when human labor is used to oversee an ML system for the immediate benefit of an end user – that is consistent with ‘Worker-in-the-loop’ defined above. Offline refers to when human labor is used only to train an ML model (i.e. mechanical turkers annotating images), and the ML model is subsequently productionized once training is done.

Human-in-the-loop visual

While there is much to say about the broader role of human-in-the-loop in our transition to a machine intelligence infused economy; this post is mostly focused on learnings about worker-in-the-loop services.

Why human-in-the-loop at all?

In 2018, 80.2% of the US labor force was employed in the services sector. While there has been impressive progress in machine learning, the majority of jobs in the services sector (e.g. customer service, vehicle operators, analysts, accountants, etc.) continue to demand more intelligence than machines are able to provide alone. Some fraction of work will get fully automated, but most services will inevitably transition to human-in-the-loop. Within this category; we can outsource work to human-augmented services (worker-in-the-loop), or augment users directly with ML-driven software suggestions (user-in-the-loop).

When does each human-backed approach make sense? Worker-in-the-loop makes sense when you don’t want to put the burden of overseeing incorrect machine predictions on your end-user. Oftentimes, if the user was going to do the work anyway, they may be okay with it – e.g. Tesla Autopilot or Gmail’s Smart Reply (user-in-the-loop). Where delegation of the entire workflow is preferable to the end-user, oversight defeats the purpose.

G-mail smart reply

Worker-in-the-loop enables human quality services at machine scale and efficiency. With WITL, you systematize and automate the routine parts of the workflow, and get to the core ‘inference tasks’ that are now devoid of customer-specific context. In an ideal WITL system, any human anywhere (once trained) can help handle tasks for any customer of the service – unlocking the labor liquidity that otherwise makes service businesses difficult to scale, and with better gross margins. For this reason, it’s a compelling approach in the creation of new venture-backed businesses which require scale. Additionally, selling ML-infused software to legacy services businesses may be more difficult than using it to directly compete with them. Owning the end-to-end experience can 1) make the experience better and 2) capture more of the value created.

Still, the most common question investors asked us while developing a worker-in-the-loop scheduling service (Clara Labs) was “how long until the humans are gone?” Look no further than the speculations as to when (not if) Uber and Lyft will replace all of their drivers with self-driving systems. Investors are seeking scalable, high gross-margin businesses; large human workforces don’t obviously fit the mold. Investors aren’t solely to blame: AI founders similarly pitch optimistic automation timelines, and Hollywood continues to produce runaway superintelligence films.

In practice, keeping humans in the loop has been one of the most effective ways of bringing machine intelligence to market for quality-critical applications. At Clara, we used worker-in-the-loop to make significant automation progress on a difficult problem (natural language scheduling). Tesla Autopilot (user-in-the-loop) was able to quickly ship self-driving car technology to a large number of customers, each of which improve the technology everytime they use it. Some companies (e.g. x.ai, Cruise) even start out as worker-in-the-loop with the ambition to transition to user-in-the-loop once their system reaches a certain level of quality and they’ve harvested enough training data.

A quick caveat before we proceed, successfully building a worker-in-the-loop system is unfortunately still a non-trivial R&D and operational effort, and although this post will share observations about what markets have worked and haven’t, there are many tactical obstacles to overcome that will likely be the subject of a subsequent post.

Worker-in-the-loop observations

Horizontal services have struggled

Horizontal WITL services (e.g. Magic, Facebook M, Fin, GoButler, etc..) that tried to support a wide range of customer requests (e.g. scheduling, shopping, travel booking, etc…) received a lot more public attention than their vertical counterparts. Notably, these projects were often funded by large technology companies or backed by investors seeking massive returns. However, the desire to offer broader service functionality to reach a larger market was the structural reason these projects have struggled. Here is why:

Service Breadth

  • Too much human-backed functionality is hard to scale:
    • Horizontal services can’t scale as they support too many types of customer requests to optimize any specific internal workflow. The R&D teams working on optimizing these services tend to just build extremely customized customer ticketing software for their human operators — a far cry from “workers as a bridge to automation”.
    • The only companies that effectively provide horizontal services are actually labor marketplaces (Upwork, Fiverr, Mechanical Turk, etc…). They serve as the routing layer and defer optimization to the individual suppliers in their marketplace. This allows customers a wider breath of options (the UX of a horizontal service), but at the expense of consistency and reliability of outcomes.
  • Reducing functionality to help with scaling horizontal services doesn’t work:
    • As horizontal services attempt to scale, they often need to stop offering the unlimited functionality that made them appealing in the first place.
    • Reducing functionality tends to lower the value of the service, as customers can no longer ‘ask for anything’ and now have to remember what the service can and can’t do.

Verticals have worked better

Vertical services can much more effectively deliver a high performing service at scale that customers know when to use and what to use it for. So, which vertical services have worked the best for WITL?

Service Breadth

Upper-right quadrant: New market, easy outsource

Overall, the fastest growing WITL services have been those that have supplied new, vertical demands for intelligence that companies also preferred to outsource. This makes logical sense, as the prospective customers tend not to have current providers that you need to compete with. For the same reason, this category may have more total failures – as you have to both convince the person to outsource the task in the first place, and to find budget when they previously didn’t have it allocated.

  • Localization/transcription/creation of internet content (e.g. Unbabel, Lilt, Rev, B12): Fueled by the insatiable demand for making internet content relevant to more audiences.
  • Training data for ML teams (e.g. Scale, Figure Eight, Mighty.AI): Fueled by the insatiable demand for well curated, high-quality training data. These services tend to be used to power ‘offline’ human-in-the-loop for their customers (i.e. training models), but these services themselves can be conceptualized as ‘Worker-in-the-loop’ services, albeit with lower response time SLAs.

Left quadrants: Hard outsource

WITL services that attempted to convince customers to outsource workflows that they’re not used to outsourcing have had more difficulty. These companies have to convince people to both a) invest in trying outsourcing and b) believe they’ll be able to do a better job than their current way of getting the work done. One of the biggest challenges with delegation is ensuring the supplier has sufficient context to do a similar or better quality of work than you could have yourself.

This was most evident in our experience working on our original product at Clara Labs; we built a worker-in-the-loop assistant that could schedule meetings over natural language, just like a human assistant. Despite outperforming human-level support, we found that the average professional had a difficult time giving up the way they were used to getting the work done. Our best customers were people who had worked with an assistant before. We eventually shifted the product to support a more structured and commonly outsourced workflow: recruiting coordination.

Bottom-right quadrant: Easy outsource, old market

In the same vein, many WITL companies have taken on verticals where they can differentiate on service performance as opposed to changing customer outsourcing preference. These services can meet customers where they are today and provide the abstraction they’re already used to. The following are examples of companies that pursued typically-outsourced services in existing verticals:

Which of these verticals will work out the best for WITL? The opportunity to differentiate from existing legacy providers using WITL is most present where there is a high degree of similarity between customers in the vertical.

Is pursuing an already outsourced vertical sufficient for success?

Even within some verticals, customers may have vastly differing requirements. The similarity of customers within a vertical is the best predictor of whether you can maximize the TAM (total addressable market) of a WITL service.

For example, consider tier 1 customer support. Every company’s customer service operation is unique; the inquiries from their customers and the answers to those inquiries are dissimilar across companies. As a consequence, there have not been successful WITL customer support services. Recruiting coordination, on the other hand, tends to be more similar across companies – you’re not having fundamentally different scheduling conversations with candidates based on the company – scheduling an interview is scheduling an interview.

For this reason, we may increasingly see segmentation within verticals – (e.g. WITL accounting service for brick and mortar businesses, WITL customer support for e-commerce) where customers have greater similarity in their requirements, enabling a high degree of optimization and automation of the service. While logical businesses, this specialization comes with the tradeoff of a smaller TAM, so it’s unclear whether they’ll be able to scale to become billion-dollar companies. This TAM reduction is especially important as WITL companies will definitely have lower gross margins than most venture-backed companies (e.g. SaaS).

Though the tactical details of making WITL services scale are not deeply explored in this post, here are a few considerations that affect the difficulty of making WITL services work at all:

  • Speed requirements: For example, at Clara, we would respond to all customer requests within 60 minutes at the 95th percentile. This meant building a lot of rigor around our supply and demand modeling and onboarding new workers quickly; especially as we scaled up the number of customers. On the other hand, a training data annotation service may have less extreme SLAs – on the order of days or weeks.
  • Liquify-able work: An implicit assumption for these services is that human jobs can be decomposed into sub-parts that are executable by either machine or human in isolation. This tends to be more feasible for repetitive and non-creative operations work.
  • Quality bar: Human-in-the-loop is generally used because machines alone can’t match the quality needed for a given application, but there are degrees of quality requirements beyond that bar. Though HITL can help a system overall outperform human quality, it may be an additional level of complexity to tackle.

Consumer services have been less common

Overall, there’s been a lot more WITL activity servicing businesses than there has been for consumers. Consumers have tended to prefer augmentative interfaces (e.g. user-in-the-loop) where they retain agency, as opposed to businesses which are accustomed to delegation and outsourcing as core to work. Why is that?

  • Justification: While focusing on competitive advantages and outsourcing ancillary functions has been a well-accepted management strategy for decades, consumers don’t have the same ruthless justification nor funds to pursue the same strategy.
  • Speed matters: Because consumers outsource less, they’re used to the alternative of “I’ll just do this myself right now”. The overhead in delegating to someone else and waiting for them to complete the task can be a difficult obstacle to overcome.
  • Less structured: Companies are incentivized to structure their workflows and processes, which makes them easier to delegate. Consumers, on the other hand, lead less predictable lives with less consistent structure.

Although less common, consumer WITL services have similarly found success in vertical focus. As an example, Operator developed a personal shopping service, but it turns out shopping on behalf of consumers is an incredibly broad and difficult experience to tackle. One of the more successful WITL consumer companies has been Stitch Fix, which took narrowing focus a step further by building a personal apparel stylist service. Stitch Fix used worker-in-the-loop to bring a previously premium, vertical service (stylists) to an internet-scale population. Notably, Stitch Fix also removed the speed constraint, as it’s a delivery service and the consumer’s expectation is on the order of days, not minutes or hours.

Service breadth comparison

Even Google Assistant’s approach for offering highly intelligent services evolved into Google Duplex (a worker-backed AI service), which only helps consumers book reservations, a tiny fraction of the total functionality in Google Assistant.

On the other hand, large consumer internet companies have frequently developed worker-in-the-loop services for ‘behind-the-scenes’ operations often invisible to consumers: These can be thought of as new, vertical demands for intelligence, but that customers did not prefer to fully outsource.

  • Fulfillment of internet-scale delivery: (e.g. Amazon’s warehouses): Fueled by the massive shift in consumer behavior to shopping online.
  • Online content moderation: (e.g. Facebook, Google, Twitter, etc.): Fueled by the need to moderate the explosion of online user-generated content.

Emerging category: infrastructure for WITL services

As teams of technologists tasked themselves with augmenting large human workforces, they’ve uncovered a missing set of tools and infrastructure that makes building worker-in-the-loop systems easier. B12 built and open sourced Orchestra, a system that helps orchestrate a team of human experts and machines. Fin pivoted from a horizontal consumer-facing assistant to building a ‘game tape/analytics’ product (Fin Analytics) for operational workflows. Scale improved on Mechanical Turk, the historical labor alternative if you wanted to build a worker-in-the-loop system but didn’t want to manage your own labor force or work with a BPO (business process outsourcing).

There’s likely a lot more opportunity to both help large companies with existing human operations transform into WITL operations, and to make it easier for founders to build a new generation of worker-in-the-loop services.

Future investment

Segmentation within already outsourced verticals, infrastructure for building WITL services, and new demands for human-level intelligence at scale are compelling areas for future investment. If you’re thinking about starting a worker-in-the-loop company or investing in some, I’d love to talk more about these ideas (DM me on Twitter), but first consider asking the following questions:

  • Is this a use case where quality is critical and customers would prefer to not have to oversee incorrect machine predictions (i.e. delegate the work)?
  • Is this a use case where customers already outsource this behavior? If not, is there sufficient reason to believe this service will change customer preference?
  • Is this a use case where a large number of customers will have a high degree of similarity in their requests to the service (i.e. a vertical or segment within a vertical), and the TAM is still meaningful?
  • Is this an emerging category of demand for intelligence with few alternatives, or an existing category of demand where the claim is that worker-in-the-loop will outperform existing legacy providers?
  • If all the above are true, how challenging will it be to scale such a service (e.g. difficulty of quality bar, response time SLAs, and whether the type of work is liquify-able)?

Future topics

There are several additional topics worth expanding on, which I omitted for the sake of getting this post out – so if you’re interested in any of the following do let me know:

  • Worker-in-the-loop landscape: I’ve begun compiling a landscape of worker-in-the-loop companies and projects, which this post drew on for examples and intuition. Given the breadth of the industries in the services economy, I’ve likely missed examples from companies that may technically be classified as ‘worker-in-the-loop’. If there are other whole sub-categories I’ve missed, I’d love to know as I continue to compile the full landscape.
  • Worker-in-the-loop in practice: How do you actually automate human workflows, delivering on human quality services at machine scale? What do you buy vs build in house when attempting to build the human and machine parts of these services? What’s the difference between ‘worker-in-the-loop’ and employing an operations team that uses something like Zendesk?
  • Services resistant to automation: Why services jobs are and will largely continue to be resistant to full-automation, including the exceptions to the rule.