Partnering With AI

At the Intersection of machine and human behavior is virtual training avatar Millie Fit.

Roland Memisevic, CEO and chief scientist at artificial intelligence (AI) startup Twenty Billion Neurons, has a very expansive view of what constitutes a "superhuman" tool.

"In some sense, you could argue that any tool humans could use would be superhuman, because you wouldn't need it otherwise," Memisevic said. "For instance, you need a hammer because you can't pound a nail in the wall with your fist."

The tool Memisevic and his colleagues at Twenty Billion Neurons, which is based in Berlin and Toronto, are perfecting, though, is a little more complex than a claw hammer, and is also turning the prevalent perception of AI on its head. Millie, an avatar whose first commercial iteration is designed to intelligently interact with a user in exercise sessions, has been trained on more than 3 million pieces of proprietary video data to be less superhuman.

"From an economic point of view, most of the artificial intelligence efforts over past years have been focusing on superhuman capabilities like super-high speed and accuracy solving tasks better than humans, and that's of huge value in manufacturing, for example," Memisevic said.

"What's been overlooked for the most part is a huge part of the global economy. It's based on human-like capabilities, like following what you are doing, being a companion, remembering what you did yesterday and measuring progress as you go along; working with you as you go through exercises, in our case. That is a very different kind of artificial intelligence value proposition than what has been dominating the field."

The Intersection of machine and human behavior

Millie is the subject of a new ethnographic study by Andreas Sudmann, a media scholar and AI researcher at Ruhr University Bochum in Germany. The study is the fruition of a years-long friendship between Memisevic, who moved from his native Germany to Toronto to study under neural networking pioneer (and ACM A.M. Turing award recipient) Geoffrey Hinton, and Sudmann.

"We both pursued very different directions academically and intellectually, and are converging in some sense now, which is exciting," Memisevic said. "It's basically a slightly more formal or involved version of the loose dialogue that has been ongoing between us for years now."

"I have followed the development of Twenty Billion Neurons closely from the beginning, since 2015," Sudmann said. "And I quickly realized that his company was on the way to achieving something very extraordinary in the fields of deep learning and computer vision. Besides, Roland and I had discussions on machine learning and artificial intelligence in general at least since 2009, shortly after he successfully completed his Ph.D. in Toronto."

Sudmann's project, however, is based on a broader understanding of the term "media" than the traditional definition of mass communications, according to the statement announcing the study.

"Media does not just depict worlds; it literally creates them, or in the case of Twenty Billion Neurons, digital avatars," Sudmann said in announcing his study. "AI is the very field where machines can now learn to perceive their environments, process information, and apply that newfound knowledge to solving problems. To study all mediators involved in this socio-technological process is an important job for media and cultural studies."

While Memisevic said Millie is still at a very basic level of interactive cognition, opening the platform to rigorous examination could lead to widely applicable advances in understanding how such avatars work, and how continued interaction with people can lead to iterative improvements.

"We may be getting into a little bit of philosophical territory here, but there is a case to be made for human-like representations in those neural nets, just to make AI systems that are more like humans and are more easily heard and understood," Memisevic said. "Just by making the internal representation more human-like, it's going to be presumably much easier to get to those higher-level goals like interpretability, and being able to ask the system, 'why did you make that decision?'"

Added Sudmann, scholars must be particularly aware that in interacting with such systems, they may also be more than detached arbiters. "Being able to test the product at an early stage of its creation and discuss my impressions with the developers is an exciting, but also challenging, part of my current work. At these moments, you literally act as a participant observer, and you have to carefully reflect on what this means for your research."

Defining the field

Sudmann is not alone in exploring the boundaries of human-machine encounters or the amorphous definition of how the behavior of one actor influences the other. The authors of a paper published in the April 2019 edition of Machine behaviour, outlined a spectrum of human-machine interactions and the implications for defining how that should be studied:

"Furthering the study of machine behavior is critical to maximizing the potential benefits of AI for society," the authors wrote. "The consequential choices that we make regarding the integration of AI agents into human lives must be made with some understanding of the eventual societal implications of these choices. To provide this understanding and anticipation, we need a new interdisciplinary field of scientific study: machine behavior."

That paper's lead author, Iyad Rahwan, founder and director of the Center for Humans and Machines at the Max Planck Institute for Human Development in Berlin, said Sudmann's study could offer the research community visibility into the varied fields making up human-AI interaction. "We argued that many of the tools of behavioral science we have today can be repurposed for the study of behavior that is exhibited by machines," Rahwan said. "This includes quantitative observational and experimental methods, but also qualitative methods such as ethnography.

"It would be fascinating to see where the ethnographers get stuck. My hunch is that methods designed to study human or animal behavior, while promising for the study of machines, would need to be adjusted in response to the peculiarities of algorithms. For example, when an ethnographer ascribes a certain intent to the actions of an avatar, does that description carry the same meaning as with a human? Or is the ethnographer reading too much into the situation? How can we avoid excessive anthropomorphization? And so on."

Rahwan was careful, too, not to discount one research approach for another.

"I don't want this to be a turf war. If you ethnographically or quantitatively study how an AI avatar interacts with a human in a game or a fitness coaching session, you are doing a bit of machine behavior, a bit of human behavior, and a bit of human-computer interaction. These fields are now overlapping, because humans and machines interact in ever-increasing ways, so it's hard to set clear boundaries about which field the study falls under. It is a true multi-disciplinary science."

End-to-end development

In an article on Medium's Toward Data Science, Memisevic and colleagues at Twenty Billion Neurons laid out Millie's development methodology in both video and audio processing. Rather than break each learning task down into two or more discrete steps, the avatar is trained to process data in an "end-to-end" fashion.

On the video side, the benefits of that approach include the ability to recognize the actions of more than one person on screen at a time; efficient processing for multi-person recognition and action classification, or constant complexity; and the ability to recognize and classify human actions simultaneously. This last ability, they said, enables the platform to run on edge devices such as the iPad Pro.

This end-to-end approach, Memisevic said, requires much more front-loading of computation and training data, but pays off with a system that understands more. He used Millie's end-to-end analysis of a squat exercise compared to the prevalent two-step approach as an example.

"This is contrarian to the way it has typically been done for many years, which is that people do a skeleton system on images, figure out what the angles are, and then try to write a program that says, 'If this angle changes from that to this and at the same time your left leg is in that position, then you did a good squat..' We are basically bypassing all of this and going straight to the answer, almost straight to what the avatar is supposed to say about how you've been doing over the past few seconds or minutes, and that is conceptually much simpler."

This ability to simultaneously process video and audio input from people will aid the avatar in assuming more human-like properties, he said, but with the added benefit of having an encyclopedic memory, "not just being able to react in the moment, but remembering how you did yesterday and the day before, just seeing the whole longitudinal journey.

"That's clearly part of the value of AI specifically. It's superhuman again because it has a very good memory. It has a spreadsheet baked into its head with some sense of how you have done in the past, and it can make inferences about what the right thing to do right now is, given all the longitudinal journeys it has accompanied you on. It'll get really enabled to make inferences that would be much harder to make for a human."

Conversely, there will be "in the moment" elements, such as the subtle difference inherent in the way a person who is feeling good about having accomplished an exercise might breathe in deeply, then exhale slowly and say "Ahhh," while someone who injures themselves might flinch reflexively and scream "AHHH!" in pain. Discerning the difference in those scenarios, Memisevic said, "is where we want to get, and in order to do so, it's hard in one sense and easy in another. What you need is the data. Once you have the data, you can make all sorts of distinctions happen easily."

Though he concedes that hardware advances have contributed to technologists' ability to create platforms such as Millie, he also said that creating and labeling a plethora of data for those machines to recognize has also played a prominent role. "We could have figured that out in the 1980's, had somebody tried with enough stamina," he said. "We would have known it for longer."

While Sudmann was careful not to postulate whether or not having access to the large amount of training data Twenty Billion Neurons has amassed for Millie might help him study to what degree the company's approach lends itself to more transparent AI, he was eager to get a chance to explore the possibilities.

"In principle, ethnographic approaches, including those in the field of science and technology studies, are not guided by a priori assumptions concerning which aspects of the field are more important than others," he said. "However, the quality and, even more so, the quantity of training data is decisive for how a neural network can learn effectively. Hence, it's of course very interesting for me to be able to observe in detail how Twenty Billion Neurons is training their networks with video data."

Gregory Goth is an Oakville, CT-based writer who specializes in science and technology.