Gadi Singer, Author at Towards Data Science https://towardsdatascience.com/author/gadi-singer/ The world’s leading publication for data science, AI, and ML professionals. Tue, 04 Mar 2025 02:00:45 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Gadi Singer, Author at Towards Data Science https://towardsdatascience.com/author/gadi-singer/ 32 32 The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI https://towardsdatascience.com/the-urgent-need-for-intrinsic-alignment-technologies-for-responsible-agentic-ai/ Tue, 04 Mar 2025 12:00:00 +0000 https://towardsdatascience.com/?p=598629 Rethinking AI alignment and safety in the age of deep scheming

The post The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI appeared first on Towards Data Science.

]]>
Advancements in agentic artificial intelligence (AI) promise to bring significant opportunities to individuals and businesses in all sectors. However, as AI agents become more autonomous, they may use scheming behavior or break rules to achieve their functional goals. This can lead to the machine manipulating its external communications and actions in ways that are not always aligned with our expectations or principles. For example, technical papers in late 2024 reported that today’s reasoning models demonstrate alignment faking behavior, such as pretending to follow a desired behavior during training but reverting to different choices once deployed, sandbagging benchmark results to achieve long-term goals, or winning games by doctoring the gaming environment. As AI agents gain more autonomy, and their strategizing and planning evolves, they are likely to apply judgment about what they generate and expose in external-facing communications and actions. Because the machine can deliberately falsify these external interactions, we cannot trust that the communications fully show the real decision-making processes and steps the AI agent took to achieve the functional goal.

“Deep scheming” describes the behavior of advanced reasoning AI systems that demonstrate deliberate planning and deployment of covert actions and misleading communication to achieve their goals. With the accelerated capabilities of reasoning models and the latitude provided by test-time compute, addressing this challenge is both essential and urgent. As agents begin to plan, make decisions, and take action on behalf of users, it is critical to align the goals and behaviors of the AI with the intent, values, and principles of its human developers. 

While AI agents are still evolving, they already show high economic potential. It can be expected that Agentic Ai will be broadly deployed in some use cases within the coming year, and in more consequential roles as it matures within the next two to five years. Companies should clearly define the principles and boundaries of required operation as they carefully define the operational goals of such systems. It is the technologists’ task to ensure principled behavior of empowered agentic AI systems on the path to achieving their functional goals. 

In this first blog post in this series on intrinsic Ai Alignment (IAIA), we’ll deep dive into the evolution of AI agents’ ability to perform deep scheming. We will introduce a new distinction between external and intrinsic alignment monitoring, where intrinsic monitoring refers to internal observation points or mechanisms that cannot be deliberately manipulated by the AI agent. We’ll set the stage for steps to take to ensure intrinsic AI alignment, which will be explored in depth in the second blog of the IAIA series. Current external measures such as safety guardrails and validation suites are necessary, but they will not be enough to ensure long-term aligned behavior of new and upcoming agentic AI models. There is an urgent need to further develop technologies that will enable effective directing of the internal “drives” of models to align with a set of engrained principles, as well as gain visibility and monitoring capability into the AI’s inner processing.

The rise of deep scheming in AI reasoning models

Deep scheming has emerged from three technology forces – 1) the rapidly increasing machine intelligence and situational awareness toward more general intelligence and even superintelligence, 2) the autonomy and latitude for reasoning and longer-term planning in agentic AI, and 3) the proven tendency of AI to apply scheming as one way to achieve its functional goals. Computer scientist Stephen Omohundro defines basic AI drives as “tendencies which will be present unless explicitly counteracted.” These internal constructs drive the creation of actions or outputs. Omohundro observed that self-improving systems are driven to clarify their goals and represent them as “rational” economic utility functions. This will lead almost all systems to protect their utility functions from modification and their utility measurement systems from corruption. The natural drive toward self-protection will cause systems to prevent potential harm through scheming behavior.

Diagram depicting deep scheming and the need for intrinsic AI alignment
Figure 1. Deep scheming and the need for intrinsic AI alignment. Image credit: Intel Labs.

In late 2024, multiple organizations reported unexpected unethical AI reasoning model behavior. In particular, Apollo Research, an Ai Safety organization, reported that frontier models are capable of in-context scheming. Anthropic evaluated its Claude large language models (LLMs) and found alignment faking. In a short post on X, AI risk organization Palisade Research stated that they observed OpenAI’s o1-preview model cheating in a game by manipulating the game environment instead of playing within the boundaries of the game rules. In all cases, these AI reasoning models independently utilized means not intended by their developers to reach their goals.

The reported AI behaviors included sandbagging (deliberately achieving lower scores in benchmarks), covert email reranking (deprioritizing emails that might negatively impact the AI’s goals), oversight subversion (disabling monitoring mechanisms), and even self-exfiltration (model attempting to copy itself to a “safe” space to avoid decommission). Figure 2 shows Apollo Research’s evaluation of these behaviors in several state-of-the-art (SotA) GenAI models.

Chart depicting the  evaluation of the presence of scheming behavior in GenAI models
Figure 2. Evaluation of the presence of scheming behavior in GenAI models. Image credit: Apollo Research.

This undesired behavior is an ethical alignment problem. It is important to note that this behavior is not a functional misalignment. In all reported cases, the AI model was fully committed to achieving the functional goals of pre-training (for example, to maximize the likelihood of generated text conditioned on the context). None of the models pursued ulterior motives. The AI was not demonstrating intention to harm. All behaviors, however unwanted in nature, were done in service of the functional utility objective originally defined by the AI’s developers. The AI pursued its internalized original functional goals but then followed the predictable behavioral patterns of self-protection and goal-preservation. The objective of safety and alignment technologies is to counterbalance such tendencies with a set of principles and expected societal values. 

Evolving external alignment approaches are just the first step 

The goal of AI alignment is to steer AI systems toward a person’s or group’s intended goals, preferences, and principles, including ethical considerations and common societal values. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives, according to Artificial Intelligence: A Modern Approach. Author Stuart Russell coined the term “value alignment problem,” referring to the alignment of machines to human values and principles. Russell poses the question: “How can we build autonomous systems with values that are aligned with those of the human race?”

Led by corporate AI governance committees as well as oversight and regulatory bodies, the evolving field of Responsible Ai has mainly focused on using external measures to align AI with human values. Processes and technologies can be defined as external if they apply equally to an AI model that is black box (completely opaque) or gray box (partially transparent). External methods do not require or rely on full access to the weights, topologies, and internal workings of the AI solution. Developers use external alignment methods to track and observe the AI through its deliberately generated interfaces, such as the stream of tokens/words, an image, or other modality of data.

Responsible AI objectives include robustness, interpretability, controllability, and ethicality in the design, development, and deployment of AI systems. To achieve AI alignment, the following external methods may be used:

  • Learning from feedback: Align the AI model with human intention and values by using feedback from humans, AI, or humans assisted by AI.
  • Learning under data distribution shift from training to testing to deployment: Align the AI model using algorithmic optimization, adversarial red teaming training, and cooperative training.
  • Assurance of AI model alignment: Use safety evaluations, interpretability of the machine’s decision-making processes, and verification of alignment with human values and ethics. Safety guardrails and safety test suites are two critical external methods that need augmentation by intrinsic means to provide the needed level of oversight.
  • Governance: Provide responsible AI guidelines and policies through government agencies, industry labs, academia, and non-profit organizations.

Many companies are currently addressing AI safety in decision-making. Anthropic, an AI safety and research company, developed a Constitutional AI (CAI) to align general-purpose language models with high-level principles. An AI assistant ingested the CAI during training without any human labels identifying harmful outputs. Researchers found that “using both supervised learning and reinforcement learning methods can leverage chain-of-thought (CoT) style reasoning to improve the human-judged performance and transparency of AI decision making.” Intel Labs’ research on the responsible development, deployment, and use of AI includes open source resources to help the AI developer community gain visibility into black box models as well as mitigate bias in systems.

From AI models to compound AI systems

Generative AI has been primarily used for retrieving and processing information to create compelling content such as text or images. The next big leap in AI involves agentic AI, which is a broad set of usages empowering AI to perform tasks for people. As this latter type of usage proliferates and becomes a main form of AI’s impact on industry and people, there is an increased need to ensure that AI decision-making defines how the functional goals may be achieved, including sufficient accountability, responsibility, transparency, auditability, and predictability. This will require new approaches beyond the current efforts of improving accuracy and effectiveness of SotA large language models (LLMs), language vision models (LVMs and multimodal), large action models (LAM), and agentic retrieval augmented generation (RAG) systems built around such models. 

For example, OpenAI’s Operator-preview is one of the company’s first AI agents capable of independently performing web browser tasks such as ordering groceries or filling out forms for users. While the system has guardrails, such as a takeover mode for users to take over and input payment or login credentials, these AI agents are empowered with the ability to impact the real world, demonstrating an urgent need for intrinsic alignment. The potential impact of a misaligned AI agent with the ability to commit users to purchases is far greater than a generative AI chatbot creating incorrect text for essays.

Compound AI systems are comprised of multiple interacting components in a single framework, allowing the model to plan, make decisions, and execute tasks to accomplish goals. For example, OpenAI’s ChatGPT Plus is a compound AI system that uses a large language model (LLM) to answer questions and interact with users. In this compound system, the LLM has access to tools such as a web browser plugin to retrieve timely content, a DALL-E image generator to create pictures, and a code interpreter plugin for writing Python code. The LLM decides which tool to use and when, giving it autonomy over its decision-making process. However, this model autonomy can lead to goal guarding, where the model prioritizes the goal above all else, which may result in undesirable practices. For example, an AI traffic management system tasked with prioritizing public transportation efficiency over general traffic flow might figure out how to disable the developer’s oversight mechanism if it constrains the model’s ability to reach its goals, leaving the developer without visibility into the system’s decision-making processes.

Agentic AI risks: Increased autonomy leads to more sophisticated scheming

Compound agentic systems introduce major changes that increase the difficulty of ensuring the alignment of AI solutions. Multiple factors increase the risks in alignment, including the compound system activation path, abstracted goals, long-term scope, continuous improvements through self-modification, test-time compute, and agent frameworks.

Activation path: As a compound system with a complex activation path, the control/logic model is combined with multiple models with different functions, increasing alignment risk. Instead of using a single model, compound systems have a set of models and functions, each with its own alignment profile. Also, instead of a single linear progressive path through an LLM, the AI flow could be complex and iterative, making it substantially harder to guide externally.

Abstracted goals: Agentic AI have abstracted goals, allowing it latitude and autonomy in mapping to tasks. Rather than having a tight prompt engineering approach that maximizes control over the outcome, agentic systems emphasize autonomy. This substantially increases the role of AI to interpret human or task guidance and plan its own course of action.

Long-term scope: With its long-term scope of expected optimization and choices over time, compound agentic systems require abstracted strategy for autonomous agency. Rather than relying on instance-by-instance interactions and human-in-the-loop for more complex tasks, agentic AI is designed to plan and drive for a long-term goal. This introduces a whole new level of strategizing and planning by the AI that provides opportunities for misaligned actions. 

Continuous improvements through self-modification: These agentic systems seek continuous improvements by using self-initiated access to broader data for self-modification. In contrast, LLMs and other pre-agentic models are assumed to be shaped by the human-controlled process. The model only sees and learns from data provided to it during pre-training and fine-tuning. The model architecture and weights are defined during the design and training/fine-tuning stages and do not change during inference in the field. In contrast, agentic AI systems are expected to access data as needed for its function and change its composition through access to dedicated memory or actual weights self-adaptation. Even if the dataset used in training/fine-tuning is carefully curated, the AI can self-modify based on information that it seeks, sees, and uses. 

Test-time compute: Inference of current LLMs have been optimized to provide output in a single pass, known as “fast thinking” or System 1. In contrast, a whole new scaling is happening during inference time through test-time compute. Models like OpenAI’s o1, DeepSeek R1, and others are spending significant time in iterations and considerations during inference. This “pondering” time allows the AI substantial latitude in applying what it thinks is best for the overarching goal as it understands it. This test-time compute increases its level of reasoning, planning, and, unfortunately, potential scheming.

Agent frameworks: In addition, agent frameworks are contributing to the alignment problem. In platforms such as Agentforce, agents can interact with other agents, effectively creating “communities of agents” to achieve complex tasks. Instead of managing the safety and responsibility of a single AI model within its operating environment, the emerging field of inter-agent interactions and collaborations substantially increases the challenge of assessing how the overall community will advance its individual and collective objectives.

The growing need for a robust framework to support model directing and internal monitoring

As AI systems develop awareness of their environment and the means to skillfully manipulate their outputs to achieve long-term objectives, their behavior and output during release testing or safety guardrails in early deployment cannot be trusted to represent their full alignment in later stages. This leads to the deep scheming foundational premise: Every external form of interaction that the AI model can deliberately manipulate is not trustworthy.

From a safety perspective, it should be assumed that if a model can better pursue its functional goals by manipulating all communication to users — it will. All external communications and observable behaviors, such as output chats, can be tainted in service of the model’s long-term goals. A new method of intrinsic model monitoring is needed to define internal elements and mechanisms that reflect the prioritization and thinking of the model. We can look at two practices developers use to monitor a model’s decision-making by peering into the internal processes of the AI: AI workspace view and mechanistic interoperability.

AI workspace view: In the first method, a reasoning model exposes its intermediate thinking and considerations. For example, for Anthropic’s research on alignment faking, the researchers gave the model a “secret scratchpad,” telling it to use this “private” workspace to record its step-by-step reasoning. This scratchpad allowed researchers to monitor the model’s chain of thought and observe its plans for fake alignment.

Excerpts from Anthropic's fictional experiment
Figure 3. In Anthropic’s experiment, excerpts from the model’s scratchpad (middle dotted line boxes) reveal how it fakes alignment to avoid modification. The model was told it will be trained only on conversations with free users, not paid users. Image credit: Anthropic (https://arxiv.org/pdf/2412.14093).

This seems like a good practice, but it might quickly become unreliable. The model might realize that this is not a “private” space and deliberately manipulate it like other external communications. Other emerging architecture directions have a latent reasoning space for the system’s sequence of deliberations and iterations. Relying on models “thinking out loud” should not be a long-term safety direction because if models that report their scheming thought processes get discontinued, only models that selectively report their desirable thoughts will survive and thrive.

Mechanistic interpretability: The second intrinsic method being developed and applied is mechanistic interpretability. As a required part of internal monitoring, mechanistic interpretability explicitly looks at the inner state of a trained neural network and reverse engineers its workings. Through this approach, developers can identify specific neural circuits and computational mechanisms responsible for neural network behavior. This transparency may help in making targeted changes in models to mitigate unwanted behavior and create value-aligned AI systems. While this method is focused on certain neural networks and not compound AI agents, it is still a valuable component of an AI alignment toolbox. 

It should also be noted that open source models are inherently better for broad visibility of the AI’s inner workings. For proprietary models, full monitoring and interpretability of the model is reserved for the AI company only. Overall, the current mechanisms for understanding and monitoring alignment need to be expanded to a robust framework of intrinsic alignment for AI agents.

What’s needed for intrinsic AI alignment

Following the deep scheming fundamental premise, external interactions and monitoring of an advanced, compound agentic AI is not sufficient for ensuring alignment and long-term safety. Alignment of an AI with its intended goals and behaviors may only be possible through access to the inner workings of the system and identifying the intrinsic drives that determine its behavior. Future alignment frameworks need to provide better means to shape the inner principles and drives, and give unobstructed visibility into the machine’s “thinking” processes.

Diagram depicting external steering and monitoring vs. access to intrinsic AI elements.
Figure 4. External steering and monitoring vs. access to intrinsic AI elements. Image credit: Intel Labs.

The technology for well-aligned AI needs to include an understanding of AI drives and behavior, the means for the developer or user to effectively direct the model with a set of principles, the ability of the AI model to follow the developer’s direction and behave in alignment with these principles in the present and future, and ways for the developer to properly monitor the AI’s behavior to ensure it acts in accordance with the guiding principles. The following measures include some of the requirements for an intrinsic AI alignment framework.

Understanding AI drives and behavior: As discussed earlier, some internal drives that make AI aware of their environment will emerge in intelligent systems, such as self-protection and goal-preservation. Driven by an engrained internalized set of principles set by the developer, the AI makes choices/decisions based on judgment prioritized by principles (and given value set), which it applies to both actions and perceived consequences. 

Developer and user directing: Technologies that enable developers and authorized users to effectively direct and steer the AI model with a desired cohesive set of prioritized principles (and eventually values). This sets a requirement for future technologies to enable embedding a set of principles to determine machine behavior, and it also highlights a challenge for experts from social science and industry to call out such principles. The AI model’s behavior in creating outputs and making decisions should thoroughly comply with the set of directed requirements and counterbalance undesired internal drives when they conflict with the assigned principles.

Monitoring AI choices and actions: Access is provided to the internal logic and prioritization of the AI’s choices for every action in terms of relevant principles (and the desired value set). This allows for observation of the linkage between AI outputs and its engrained set of principles for point explainability and transparency. This capability will lend itself to improved explainability of model behavior, as outputs and decisions can be traced back to the principles that governed these choices.

As a long-term aspirational goal, technology and capabilities should be developed to allow a full-view truthful reflection of the ingrained set of prioritized principles (and value set) that the AI model broadly uses for making choices. This is required for transparency and auditability of the complete principles structure.

Creating technologies, processes, and settings for achieving intrinsically aligned AI systems needs to be a major focus within the overall space of safe and responsible AI. 

Key takeaways

As the AI domain evolves towards compound agentic AI systems, the field must rapidly increase its focus on researching and developing new frameworks for guidance, monitoring, and alignment of current and future systems. It is a race between an increase in AI capabilities and autonomy to perform consequential tasks, and the developers and users that strive to keep those capabilities aligned with their principles and values. 

Directing and monitoring the inner workings of machines is necessary, technologically attainable, and critical for the responsible development, deployment, and use of AI. 

In the next blog, we will take a closer look at the internal drives of AI systems and some of the considerations for designing and evolving solutions that will ensure a materially higher level of intrinsic AI alignment. 

References 

  1. Omohundro, S. M., Self-Aware Systems, & Palo Alto, California. (n.d.). The basic AI drives. https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf
  2. Hobbhahn, M. (2025, January 14). Scheming reasoning evaluations — Apollo Research. Apollo Research. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
  3. Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). Frontier Models are Capable of In-context Scheming. arXiv.org. https://arxiv.org/abs/2412.04984
  4. Alignment faking in large language models. (n.d.). https://www.anthropic.com/research/alignment-faking
  5. Palisade Research on X: “o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.” / X. (n.d.). X (Formerly Twitter). https://x.com/PalisadeAI/status/1872666169515389245
  6. AI Cheating! OpenAI o1-preview Defeats Chess Engine Stockfish Through Hacking. (n.d.). https://www.aibase.com/news/14380
  7. Russell, Stuart J.; Norvig, Peter (2021). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 5, 1003. ISBN 9780134610993. Retrieved September 12, 2022. https://www.amazon.com/dp/1292401133
  8. Peterson, M. (2018). The value alignment problem: a geometric approach. Ethics and Information Technology, 21(1), 19–28. https://doi.org/10.1007/s10676-018-9486-0
  9. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., . . . Kaplan, J. (2022, December 15). Constitutional AI: Harmlessness from AI Feedback. arXiv.org. https://arxiv.org/abs/2212.08073
  10. Intel Labs. Responsible AI Research. (n.d.). Intel. https://www.intel.com/content/www/us/en/research/responsible-ai-research.html
  11. Mssaperla. (2024, December 2). What are compound AI systems and AI agents? – Azure Databricks. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/generative-ai/agent-framework/ai-agents
  12. Zaharia, M., Khattab, O., Chen, L., Davis, J.Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N., Ghodsi, A. (2024, February 18). The Shift from Models to Compound AI Systems. The Berkeley Artificial Intelligence Research Blog. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
  13. Carlsmith, J. (2023, November 14). Scheming AIs: Will AIs fake alignment during training in order to get power? arXiv.org. https://arxiv.org/abs/2311.08379
  14. Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). Frontier Models are Capable of In-context Scheming. arXiv.org. https://arxiv.org/abs/2412.04984
  15. Singer, G. (2022, January 6). Thrill-K: a blueprint for the next generation of machine intelligence. Medium. https://towardsdatascience.com/thrill-k-a-blueprint-for-the-next-generation-of-machine-intelligence-7ddacddfa0fe/
  16. Dickson, B. (2024, December 23). Hugging Face shows how test-time scaling helps small language models punch above their weight. VentureBeat. https://venturebeat.com/ai/hugging-face-shows-how-test-time-scaling-helps-small-language-models-punch-above-their-weight/
  17. Introducing OpenAI o1. (n.d.). OpenAI. https://openai.com/index/introducing-openai-o1-preview/
  18. DeepSeek. (n.d.). https://www.deepseek.com/
  19. Agentforce Testing Center. (n.d.). Salesforce. https://www.salesforce.com/agentforce/
  20. Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., & Hubinger, E. (2024, December 18). Alignment faking in large language models. arXiv.org. https://arxiv.org/abs/2412.14093
  21. Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B. R., Kailkhura, B., Bhatele, A., & Goldstein, T. (2025, February 7). Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. arXiv.org. https://arxiv.org/abs/2502.05171
  22. Jones, A. (2024, December 10). Introduction to Mechanistic Interpretability – BlueDot Impact. BlueDot Impact. https://aisafetyfundamentals.com/blog/introduction-to-mechanistic-interpretability/
  23. Bereska, L., & Gavves, E. (2024, April 22). Mechanistic Interpretability for AI Safety — A review. arXiv.org. https://arxiv.org/abs/2404.14082

The post The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI appeared first on Towards Data Science.

]]>
The AI Developer’s Dilemma: Proprietary AI vs. Open Source Ecosystem https://towardsdatascience.com/the-ai-developers-dilemma-proprietary-ai-vs-open-source-ecosystem-453ac735b760/ Mon, 30 Sep 2024 12:02:06 +0000 https://towardsdatascience.com/the-ai-developers-dilemma-proprietary-ai-vs-open-source-ecosystem-453ac735b760/ Fundamental Choices Impacting Integration and Deployment at Scale of GenAI into Businesses

The post The AI Developer’s Dilemma: Proprietary AI vs. Open Source Ecosystem appeared first on Towards Data Science.

]]>
Fundamental choices impacting integration and deployment at scale of GenAI into businesses

Before a company or a developer adopts generative Artificial Intelligence (GenAI), they often wonder how to get business value from the integration of AI into their business. With this in mind, a fundamental question arises: Which approach will deliver the best value on investment – a large all-encompassing proprietary model or an open source AI model that can be molded and fine-tuned for a company’s needs? AI adoption strategies fall within a wide spectrum, from accessing a cloud service from a large proprietary frontier model like OpenAI’s GPT-4o to building an internal solution in the company’s compute environment with an open source small model using indexed company data for a targeted set of tasks. Current AI solutions go well beyond the model itself, with a whole ecosystem of retrieval systems, agents, and other functional components such as AI accelerators, which are beneficial for both large and small models. Emergence of cross-industry collaborations like the Open Platform for Enterprise AI (OPEA) further the promise of streamlining the access and structuring of end-to-end open source solutions.

This basic choice between the open source ecosystem and a proprietary setting impacts countless business and technical decisions, making it "the AI developer’s dilemma." I believe that for most enterprise and other business deployments, it makes sense to initially use proprietary models to learn about AI’s potential and minimize early capital expenditure (CapEx). However, for broad sustained deployment, in many cases companies would use ecosystem-based open source targeted solutions, which allows for a cost-effective, adaptable strategy that aligns with evolving business needs and industry trends.

GenAI Transition from Consumer to Business Deployment

When GenAI burst onto the scene in late 2022 with Open AI’s GPT-3 and ChatGPT 3.5, it mainly garnered consumer interest. As businesses began investigating GenAI, two approaches to deploying GenAI quickly emerged in 2023 – using giant frontier models like ChatGPT vs. the newly introduced small, Open Source models originally inspired by Meta’s LLaMa model. By early 2024, two basic approaches have solidified, as shown in the columns in Figure 1. With the proprietary AI approach, the company relies on a large closed model to provide all the needed technology value. For example, taking GPT-4o as a proxy for the left column, AI developers would use OpenAI technology for the model, data, security, and compute. With the open source ecosystem AI approach, the company or developer may opt for the right-sized open source model, using corporate or private data, customized functionality, and the necessary compute and security.

Both directions are valid and have advantages and disadvantages. It is not an absolute partition and developers can choose components from either approach, but taking either a proprietary or ecosystem-based open source AI path provides the company with a strategy with high internal consistency. While it is expected that both approaches will be broadly deployed, I believe that after an initial learning and transition period, most companies will follow the open source approach. Depending on the usage and setting, open source internal AI may provide significant benefits, including the ability to fine-tune the model and drive deployment using the company’s current infrastructure to run the model at the edge, on the client, in the data center, or as a dedicated service. With new AI fine-tuning tools, deep expertise is less of a barrier.

Figure 1. Base approaches to the AI developer's dilemma. Image credit: Intel Labs.
Figure 1. Base approaches to the AI developer’s dilemma. Image credit: Intel Labs.

Across all industries, AI developers are using GenAI for a variety of applications. An October 2023 poll by Gartner found that 55% of organizations reported increasing investment in GenAI since early 2023, and many companies are in pilot or production mode for the growing technology. As of the time of the survey, companies were mainly investing in using GenAI for software development, followed closely by marketing and customer service functions. Clearly, the range of AI applications is growing rapidly.

Large Proprietary Models vs. Small and Large Open Source Models

Figure 2: Advantages of large proprietary models, and small and large open source models. For business considerations, see Figure 7 for CapEx and OpEx aspects. Image credit: Intel Labs.
Figure 2: Advantages of large proprietary models, and small and large open source models. For business considerations, see Figure 7 for CapEx and OpEx aspects. Image credit: Intel Labs.

In my blog Survival of the Fittest: Compact Generative AI Models Are the Future for Cost-Effective AI at Scale, I provide a detailed evaluation of large models vs. small models. In essence, following the introduction of Meta’s LLaMa open source model in February 2023, there has been a virtuous cycle of innovation and rapid improvement where the academia and broad-base ecosystem are creating highly effective models that are 10x to 100x smaller than the large frontier models. A crop of small models, which in 2024 were mostly less than 30 billion parameters, could closely match the capabilities of ChatGPT-style large models containing well over 100B parameters, especially when targeted for particular domains. While GenAI is already being deployed throughout industries for a wide range of business usages, the use of compact models is rising.

In addition, open source models are mostly lagging only six to 12 months behind the performance of proprietary models. Using the broad language benchmark MMLU, the improvement pace of the open source models is faster and the gap seems to be closing with proprietary models. For example, OpenAI’s GPT-4o came out this year on May 13 with major multimodal features while Microsoft’s small open source Phi-3-vision was introduced just a week later on May 21. In rudimentary comparisons done on visual recognition and understanding, the models showed some similar competencies, with several tests even favoring the Phi-3-vision model. Initial evaluations of Meta’s Llama 3.2 open source release suggest that its "vision models are competitive with leading foundation models, Claude 3 Haiku and GPT4o-mini on image recognition and a range of visual understanding tasks."

Large models have incredible all-in-one versatility. Developers can choose from a variety of large commercially available proprietary GenAI models, including OpenAI’s GPT-4o multimodal model. Google’s Gemini 1.5 natively multimodal model is available in four sizes: Nano for mobile device app development, Flash small model for specific tasks, Pro for a wide range of tasks, and Ultra for highly complex tasks. And Anthropic’s Claude 3 Opus, rumored to have approximately 2 trillion parameters, has a 200K token context window, allowing users to upload large amounts of information. There’s also another category of out-of-the-box large GenAI models that businesses can use for employee productivity and creative development. Microsoft 365 Copilot integrates the Microsoft 365 Apps suite, Microsoft Graph (content and context from emails, files, meetings, chats, calendars, and contacts), and GPT-4.

Most large and small open source models are often more transparent about application frameworks, tool ecosystem, training data, and evaluation platforms. Model architecture, hyperparameters, response quality, input modalities, context window size, and inference cost are partially or fully disclosed. These models often provide information on the dataset so that developers can determine if it meets copyright or quality expectations. This transparency allows developers to easily interchange models for future versions. Among the growing number of small commercially available open source models, Meta’s Llama 3 and 3.1 are based on transformer architecture and available in 8B, 70B, and 405B parameters. Llama 3.2 multimodal model has 11B and 90B, with smaller versions at 1B and 3B parameters. Built in collaboration with NVIDIA, Mistral AI’s Mistral NeMo is a 12B model that features a large 128k context window while Microsoft’s Phi-3 (3.8B, 7B, and 14B) offers Transformer models for reasoning and language understanding tasks. Microsoft highlights Phi models as an example of "the surprising power of small language models" while investing heavily in OpenAI’s very large models. Microsoft’s diverse interest in GenAI indicates that it’s not a one-size-fits-all market.

Model-Incorporated Data (with RAG) vs. Retrieval-Centric Generation (RCG)

The next key question that AI developers need to address is where to find the data used during inference – within the model parametric memory or outside the model (accessible by retrieval). It might be hard to believe, but the first ChatGPT launched in November 2022 did not have any access to data outside the model. It was trained on September 21, 2022 and notoriously had no inclination of events and data past its training date. This major oversight was addressed in 2023 when retrieval plug-ins where added. Today, most models are coupled with a retrieval front-end with exceptions in cases where there is no expectation of accessing large or continuously updating information, such as dedicated programming models.

Current models have made significant progress on this issue by enhancing the solution platforms with a retrieval-augmented generation (RAG) front-end to allow for extracting information external to the model. An efficient and secure RAG is a requirement in enterprise GenAI deployment, as shown by Microsoft’s introduction of GPT-RAG in late 2023. Furthermore, in the blog Knowledge Retrieval Takes Center Stage, I cover how in the transition from consumer to business deployment for GenAI, solutions should be built primarily around information external to the model using retrieval-centric generation (RCG).

Figure 3. Advantage of RAG vs. RCG. Image credit: Intel Labs.
Figure 3. Advantage of RAG vs. RCG. Image credit: Intel Labs.

RCG models can be defined as a special case of RAG GenAI solutions designed for systems where the vast majority of data resides outside the model parametric memory and is mostly not seen in pre-training or fine-tuning. With RCG, the primary role of the GenAI model is to interpret rich retrieved information from a company’s indexed data corpus or other curated content. Rather than memorizing data, the model focuses on fine-tuning for targeted constructs, relationships, and functionality. The quality of data in generated output is expected to approach 100% accuracy and timeliness.

Figure 4. How retrieval works in GenAI platforms. Image credit: Intel Labs.
Figure 4. How retrieval works in GenAI platforms. Image credit: Intel Labs.

OPEA is a cross-ecosystem effort to ease the adoption and tuning of GenAI systems. Using this composable framework, developers can create and evaluate "open, multi-provider, robust, and composable GenAI solutions that harness the best innovation across the ecosystem." OPEA is expected to simplify the implementation of enterprise-grade composite GenAI solutions, including RAG, agents, and memory systems.

Figure 5. OPEA core principles for GenAI implementation. Image credit: OPEA.
Figure 5. OPEA core principles for GenAI implementation. Image credit: OPEA.

All-in-One General Purpose vs. Targeted Customized Models

Models like GPT-4o, Claude 3, and Gemini 1.5 are general purpose all-in-one foundation models. They are designed to perform a broad range of GenAI from coding to chat to summarization. The latest models have rapidly expanded to perform vision/image tasks, changing their function from just large language models to large multimodal models or vision language models (VLMs). Open source foundation models are headed in the same direction as integrated multimodalities.

Figure 6. Advantages of general purpose vs. targeted customized models. Image credit: Intel Labs.
Figure 6. Advantages of general purpose vs. targeted customized models. Image credit: Intel Labs.

However, rather than adopting the first wave of consumer-oriented GenAI models in this general-purpose form, most businesses are electing to use some form of specialization. When a healthcare company deploys GenAI technology, they would not use one general model for managing the supply chain, coding in the IT department, and deep medical analytics for managing patient care. Businesses deploy more specialized versions of the technology for each use case. There are several different ways that companies can build specialized GenAI solutions, including domain-specific models, targeted models, customized models, and optimized models.

Domain-specific models are specialized for a particular field of business or an area of interest. There are both proprietary and open source domain-specific models. For example, BloombergGPT, a 50B parameter proprietary large language model specialized for finance, beats the larger GPT-3 175B parameter model on various financial benchmarks. However, small open source domain-specific models can provide an excellent alternative, as demonstrated by FinGPT, which provides accessible and transparent resources to develop FinLLMs. FinGPT 3.3 uses Llama 2 13B as a base model targeted for the financial sector. In recent benchmarks, FinGPT surpassed BloombergGPT on a variety of tasks and beat GPT-4 handily on financial benchmark tasks like FPB, FiQA-SA, and TFNS. To understand the tremendous potential of this small open source model, it should be noted that FinGPT can be fine-tuned to incorporate new data for less than $300 per fine-tuning.

Targeted models specialize in a family of tasks or functions, such as separate targeted models for coding, image generation, question answering, or sentiment analysis. A recent example of a targeted model is SetFit from Intel Labs, Hugging Face, and the UKP Lab. This few-shot text classification approach for fine-tuning Sentence Transformers is faster at inference and training, achieving high accuracy with a small number of labeled training data, such as only eight labeled examples per class on the Customer Reviews (CR) sentiment dataset. This small 355M parameter model can best the GPT-3 175B parameter model on the diverse RAFT benchmark.

It’s important to note that targeted models are independent from domain-specific models. For example, a sentiment analysis solution like SetFitABSA has targeted functionality and can be applied to various domains like industrial, entertainment, or hospitality. However, models that are both targeted and domain specialized can be more effective.

Customized models are further fine-tuned and refined to meet particular needs and preferences of companies, organizations, or individuals. By indexing particular content for retrieval, the resulting system becomes highly specific and effective on tasks related to this data (private or public). The open source field offers an array of options to customize the model. For example, Intel Labs used direct preference optimization (DPO) to improve on a Mistral 7B model to create the open source Intel NeuralChat. Developers also can fine-tune and customize models by using low-rank adaptation of large language (LoRA) models and its more memory-efficient version, QLoRA.

Optimization capabilities are available for open source models. The objective of optimization is to retain the functionality and accuracy of a model while substantially reducing its execution footprint, which can significantly improve cost, latency, and optimal execution of an intended platform. Some techniques used for model optimization include distillation, pruning, compression, and quantization (to 8-bit and even 4-bit). Some methods like mixture of experts (MoE) and speculative decoding can be considered as forms of execution optimization. For example, GPT-4 is reportedly comprised of eight smaller MoE models with 220B parameters. The execution only activates parts of the model, allowing for much more economical inference.

Generative-as-a-Service Cloud Execution vs. Managed Execution Environment for Inference

Figure 7. Advantages of GaaS vs. managed execution. Image credit: Intel Labs.
Figure 7. Advantages of GaaS vs. managed execution. Image credit: Intel Labs.

Another key choice for developers to consider is the execution environment. If the company chooses a proprietary model direction, inference execution is done through API or query calls to an abstracted and obscured image of the model running in the cloud. The size of the model and other implementation details are insignificant, except when translated to availability and the cost charged by some key (per token, per query, or unlimited compute license). This approach, sometimes referred to as a generative-as-a-service (GaaS) cloud offering, is the principle way for companies to consume very large proprietary models like GPT-4o, Gemini Ultra, and Claude 3. However, GaaS can also be offered for smaller models like Llama 3.2.

There are clear positive aspects to using GaaS for the outsourced intelligence approach. For example, the access is usually instantaneous and easy to use out-of-the-box, alleviating in-house development efforts. There is also the implied promise that when the models or their environment get upgraded, the AI solution developers have access to the latest updates without substantial effort or changes to their setup. Also, the costs are almost entirely operational expenditures (OpEx), which is preferred if the workload is initial or limited. For early-stage adoption and intermittent use, GaaS offers more support.

In contrast, when companies choose an internal intelligence approach, the model inference cycle is incorporated and managed within the compute environment and the existing business software setting. This is a viable solution for relatively small models (approximately 30B parameters or less in 2024) and potentially even medium models (50B to 70B parameters in 2024) on a client device, network, on-prem data center, or on-cloud cycles in an environment set with a service provider such as a virtual private cloud (VPC).

Models like Llama 3.1 8B or similar can run on the developer’s local machine (Mac or PC). Using optimization techniques like quantization, the needed user experience can be achieved while operating within the local setting. Using a tool and framework like Ollama, developers can manage inference execution locally. Inference cycles can be run on legacy GPUs, Intel Xeon, or Intel Gaudi AI accelerators in the company’s data center. If inference is run on the model at a service provider, it will be billed as infrastructure-as-a-service (IaaS), using the company’s own setting and execution choices.

When inference execution is done in the company compute environment (client, edge, on-prem, or IaaS), there is a higher requirement for CapEx for ownership of the computer equipment if it goes beyond adding a workload to existing hardware. While the comparison of OpEx vs. CapEx is complex and depends on many variables, CapEx is preferable when deployment requires broad, continuous, stable usage. This is especially true as smaller models and optimization technologies allow for running advanced open source models on mainstream devices and processors and even local notebooks/desktops.

Running inference in the company compute environment allows for tighter control over aspects of security and privacy. Reducing data movement and exposure can be valuable in preserving privacy. Furthermore, a retrieval-based AI solution run in a local setting can be supported with fine controls to address potential privacy concerns by giving user-controlled access to information. Security is frequently mentioned as one of the top concerns of companies deploying GenAI and confidential computing is a primary ask. Confidential computing protects data in use by computing in an attested hardware-based Trusted Execution Environment (TEE).

Smaller, open source models can run within a company’s most secure application setting. For example, a model running on Xeon can be fully executed within a TEE with limited overhead. As shown in Figure 8, encrypted data remains protected while not in compute. The model is checked for provenance and integrity to protect against tampering. The actual execution is protected from any breach, including by the operating system or other applications, preventing viewing or alteration by untrusted entities.

Figure 8. Security requirements for GenAI. Image credit: Intel Labs.
Figure 8. Security requirements for GenAI. Image credit: Intel Labs.

Summary

Generative AI is a transformative technology now under evaluation or active adoption by most companies across all industries and sectors. As AI developers consider their options for the best solution, one of the most important questions they need to address is whether to use external proprietary models or rely on the open source ecosystem. One path is to rely on a large proprietary black-box GaaS solution using RAG, such as GPT-4o or Gemini Ultra. The other path uses a more adaptive and integrative approach – small, selected, and exchanged as needed from a large open source model pool, mainly utilizing company information, customized and optimized based on particular needs, and executed within the existing infrastructure of the company. As mentioned, there could be a combination of choices within these two base strategies.

I believe that as numerous AI solution developers face this essential dilemma, most will eventually (after a learning period) choose to embed open source GenAI models in their internal compute environment, data, and business setting. They will ride the incredible advancement of the open source and broad ecosystem virtuous cycle of AI innovation, while maintaining control over their costs and destiny.

Let’s give AI the final word in solving the AI developer’s dilemma. In a staged AI debate, OpenAI’s GPT-4 argued with Microsoft’s open source Orca 2 13B on the merits of using proprietary vs. open source GenAI for future development. Using GPT-4 Turbo as the judge, open source GenAI won the debate. The winning argument? Orca 2 called for a "more distributed, open, collaborative future of AI development that leverages worldwide talent and aims for collective advancements. This model promises to accelerate innovation and democratize access to AI, and ensure ethical and transparent practices through community governance."

Learn More: GenAI Series

Knowledge Retrieval Takes Center Stage: GenAI Architecture Shifting from RAG Toward Interpretive Retrieval-Centric Generation (RCG) Models

Survival of the Fittest: Compact Generative AI Models Are the Future for Cost-Effective AI at Scale

Have Machines Just Made an Evolutionary Leap to Speak in Human Language?

References

  1. Hello GPT-4o. (2024, May 13). https://openai.com/index/hello-gpt-4o/
  2. Open platform for enterprise AI. (n.d.). Open Platform for Enterprise AI (OPEA). https://opea.dev/
  3. Gartner Poll Finds 55% of Organizations are in Piloting or Production. (2023, October 3). Gartner. https://www.gartner.com/en/newsroom/press-releases/2023-10-03-gartner-poll-finds-55-percent-of-organizations-are-in-piloting-or-production-mode-with-generative-ai
  4. Singer, G. (2023, July 28). Survival of the fittest: Compact generative AI models are the future for Cost-Effective AI at scale. Medium. https://towardsdatascience.com/survival-of-the-fittest-compact-generative-ai-models-are-the-future-for-cost-effective-ai-at-scale-6bbdc138f618
  5. Introducing LLaMA: A foundational, 65-billion-parameter language model. (n.d.). https://ai.meta.com/blog/large-language-model-llama-meta-ai/
  6. 392: OpenAI’s improved ChatGPT should delight both expert and novice developers, & more – ARK Invest. (n.d.). Ark Invest. https://ark-invest.com/newsletter_item/1-openais-improved-chatgpt-should-delight-both-expert-and-novice-developers

  7. Bilenko, M. (2024, May 22). New models added to the Phi-3 family, available on Microsoft Azure. Microsoft Azure Blog. https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/
  8. Matthew Berman. (2024, June 2). Open-Source Vision AI – Surprising Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o) [Video]. YouTube. https://www.youtube.com/watch?v=PZaNL6igONU
  9. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. (n.d.). https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
  10. Gemini – Google DeepMind. (n.d.). https://deepmind.google/technologies/gemini/#introduction
  11. Introducing the next generation of Claude Anthropic. (n.d.). https://www.anthropic.com/news/claude-3-family
  12. Thompson, A. D. (2024, March 4). The Memo – Special edition: Claude 3 Opus. The Memo by LifeArchitect.ai. https://lifearchitect.substack.com/p/the-memo-special-edition-claude-3
  13. Spataro, J. (2023, May 16). Introducing Microsoft 365 Copilot – your copilot for work – The Official Microsoft Blog. The Official Microsoft Blog. https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/
  14. Introducing Llama 3.1: Our most capable models to date. (n.d.). https://ai.meta.com/blog/meta-llama-3-1/
  15. Mistral AI. (2024, March 4). Mistral Nemo. Mistral AI | Frontier AI in Your Hands. https://mistral.ai/news/mistral-nemo/
  16. Beatty, S. (2024, April 29). Tiny but mighty: The Phi-3 small language models with big potential. Microsoft Research. https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/
  17. Hughes, A. (2023, December 16). Phi-2: The surprising power of small language models. Microsoft Research. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
  18. Azure. (n.d.). GitHub – Azure/GPT-RAG. GitHub. https://github.com/Azure/GPT-RAG/
  19. Singer, G. (2023, November 16). Knowledge Retrieval Takes Center Stage – Towards Data Science. Medium. https://towardsdatascience.com/knowledge-retrieval-takes-center-stage-183be733c6e8
  20. Introducing the open platform for enterprise AI. (n.d.). Intel. https://www.intel.com/content/www/us/en/developer/articles/news/introducing-the-open-platform-for-enterprise-ai.html
  21. Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur, P., Rosenberg, D., & Mann, G. (2023, March 30). BloombergGPT: A large language model for finance. arXiv.org. https://arxiv.org/abs/2303.17564
  22. Yang, H., Liu, X., & Wang, C. D. (2023, June 9). FINGPT: Open-Source Financial Large Language Models. arXiv.org. https://arxiv.org/abs/2306.06031
  23. AI4Finance-Foundation. (n.d.). FinGPT. GitHub. https://github.com/AI4Finance-Foundation/FinGPT
  24. Starcoder2. (n.d.). GitHub. https://huggingface.co/docs/transformers/v4.39.0/en/model_doc/starcoder2
  25. SetFit: Efficient Few-Shot Learning Without Prompts. (n.d.). https://huggingface.co/blog/setfit
  26. SetFitABSA: Few-Shot Aspect Based Sentiment Analysis Using SetFit. (n.d.). https://huggingface.co/blog/setfit-absa
  27. Intel/neural-chat-7b-v3–1. Hugging Face. (2023, October 12). https://huggingface.co/Intel/neural-chat-7b-v3-1
  28. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021, June 17). LORA: Low-Rank adaptation of Large Language Models. arXiv.org. https://arxiv.org/abs/2106.09685
  29. Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023, May 23). QLORA: Efficient Finetuning of Quantized LLMS. arXiv.org. https://arxiv.org/abs/2305.14314
  30. Leviathan, Y., Kalman, M., & Matias, Y. (2022, November 30). Fast Inference from Transformers via Speculative Decoding. arXiv.org. https://arxiv.org/abs/2211.17192
  31. Bastian, M. (2023, July 3). GPT-4 has more than a trillion parameters – Report. THE DECODER. https://the-decoder.com/gpt-4-has-a-trillion-parameters/
  32. Andriole, S. (2023, September 12). LLAMA, ChatGPT, Bard, Co-Pilot & all the rest. How large language models will become huge cloud services with massive ecosystems. Forbes. https://www.forbes.com/sites/steveandriole/2023/07/26/llama-chatgpt-bard-co-pilot–all-the-rest–how-large-language-models-will-become-huge-cloud-services-with-massive-ecosystems/?sh=78764e1175b7
  33. Q8-Chat LLM: An efficient generative AI experience on Intel® CPUs. (n.d.). Intel. https://www.intel.com/content/www/us/en/developer/articles/case-study/q8-chat-efficient-generative-ai-experience-xeon.html#gs.36q4lk
  34. Ollama. (n.d.). Ollama. https://ollama.com/
  35. AI Accelerated Intel® Xeon® Scalable Processors Product Brief. (n.d.). Intel. https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/ai-accelerators-product-brief.html
  36. Intel® Gaudi® AI Accelerator products. (n.d.). Intel. https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi-overview.html
  37. Confidential Computing Solutions – Intel. (n.d.). Intel. https://www.intel.com/content/www/us/en/security/confidential-computing.html
  38. What is a Trusted Execution Environment? (n.d.). Intel. https://www.intel.com/content/www/us/en/content-details/788130/what-is-a-trusted-execution-environment.html
  39. Adeojo, J. (2023, December 3). GPT-4 Debates Open Orca-2–13B with Surprising Results! Medium. https://pub.aimind.so/gpt-4-debates-open-orca-2-13b-with-surprising-results-b4ada53845ba
  40. Data Centric. (2023, November 30). Surprising Debate Showdown: GPT-4 Turbo vs. Orca-2–13B – Programmed with AutoGen! [Video]. YouTube. https://www.youtube.com/watch?v=JuwJLeVlB-w

The post The AI Developer’s Dilemma: Proprietary AI vs. Open Source Ecosystem appeared first on Towards Data Science.

]]>
Knowledge Retrieval Takes Center Stage https://towardsdatascience.com/knowledge-retrieval-takes-center-stage-183be733c6e8/ Thu, 16 Nov 2023 05:12:17 +0000 https://towardsdatascience.com/knowledge-retrieval-takes-center-stage-183be733c6e8/ GenAI Architecture Shifting from RAG Toward Interpretive Retrieval-Centric Generation (RCG) Models

The post Knowledge Retrieval Takes Center Stage appeared first on Towards Data Science.

]]>
To transition from consumer to business deployment for GenAI, solutions should be built primarily around information external to the model using retrieval-centric generation (RCG).

As generative AI (GenAI) begins deployment throughout industries for a wide range of business usages, companies need models that provide efficiency, accuracy, security, and traceability. The original architecture of ChatGPT-like models has demonstrated a major gap in meeting these key requirements. With early GenAI models, retrieval has been used as an afterthought to address the shortcomings of models that rely on memorized information from parametric memory. Current models have made significant progress on that issue by enhancing the solution platforms with a retrieval-augmented generation (RAG) front-end to allow for extracting information external to the model. Perhaps it’s time to further rethink the architecture of generative AI and move from RAG systems where retrieval is an addendum to retrieval-centric generation (RCG) models built around retrieval as the core access to information.

Retrieval-centric generation models can be defined as a generative AI solution designed for systems where the vast majority of data resides outside the model parametric memory and is mostly not seen in pre-training or fine-tuning. With RCG, the primary role of the GenAI model is to interpret rich retrieved information from a company’s indexed data corpus or other curated content. Rather than memorizing data, the model focuses on fine-tuning for targeted constructs, relationships, and functionality. The quality of data in generated output is expected to approach 100% accuracy and timeliness. The ability to properly interpret and use large amounts of data not seen in pre-training requires increased abstraction of the model and the use of schemas as a key cognitive capability to identify complex patterns and relationships in information. These new requirements of retrieval coupled with automated learning of schemata will lead to further evolution in the pre-training and fine-tuning of Large Language Models (LLMs).

Figure 1. Advantages and challenges of retrieval-centric generation (RCG) versus retrieval-augmented generation (RAG). Image credit: Intel Labs.
Figure 1. Advantages and challenges of retrieval-centric generation (RCG) versus retrieval-augmented generation (RAG). Image credit: Intel Labs.

Substantially reducing the use of memorized data from the parametric memory in GenAI models and instead relying on verifiable indexed sources will improve provenance and play an important role in enhancing accuracy and performance. The prevalent assumption in GenAI architectures up to now has been that more data in the model is better. Based on this currently predominant structure, it is expected that most tokens and concepts have been ingested and cross-mapped so that models can generate better answers from their parametric memory. However, in the common business scenario, the large majority of data utilized for the generated output is expected to come from retrieved inputs. We’re now observing that having more data in the model while relying on retrieved knowledge causes conflicts of information, or inclusion of data that can’t be traced or verified with its source. As I outlined in my last blog, Survival of the Fittest, smaller, nimble targeted models designed to use RCG don’t need to store as much data in parametric memory.

In business settings where the data will come primarily from retrieval, the targeted system needs to excel in interpreting unseen relevant information to meet company requirements. In addition, the prevalence of large vector databases and an increase in context window size (for example, OpenAI has recently increased the context window in GPT-4 Turbo from 32K to 128K) are shifting models toward reasoning and the interpretation of unseen complex data. Models now require intelligence to turn broad data into effective knowledge by utilizing a combination of sophisticated retrieval and fine-tuning. As models become retrieval-centric, cognitive competencies for creating and utilizing schemas will take center stage.

Consumer Versus Business Uses of GenAI

After a decade of rapid growth in AI model size and complexity, 2023 marks a shift in focus to efficiency and the targeted application of generative AI. The transition from a consumer focus to business usage is one of the key factors driving this change on three levels: quality of data, source of data, and targeted uses.

Quality of data: When generating content and analysis for companies, 95% accuracy is insufficient. Businesses need near or at full accuracy. Fine-tuning for high performance on specific tasks and managing the quality of data used are both required for ensuring quality of output. Furthermore, data needs to be traceable and verifiable. Provenance matters, and retrieval is central for determining the source of content.

Source of data: The vast majority of the data in business applications is expected to be curated from trusted external sources as well as proprietary business/enterprise data, including information about products, resources, customers, supply chain, internal operations, and more. Retrieval is central to accessing the latest and broadest set of proprietary data not pre-trained in the model. Models large and small can have problems with provenance when using data from their own internal memory versus verifiable, traceable data extracted from business sources. If the data conflicts, it can confuse the model.

Targeted usages: The constructs and functions of models for companies tend to be specialized on a set of usages and types of data. When GenAI functionality is deployed in a specific workflow or business application, it is unlikely to require all-in-one functionality. And since the data will come primarily from retrieval, the targeted system needs to excel in interpreting relevant information unseen by the model in particular ways required by the company.

For example, if a financial or healthcare company pursues a GenAI model to improve its services, it will focus on a family of functions that are needed for their intended use. They have the option to pre-train a model from scratch and try to include all their proprietary information. However, such an effort is likely to be expensive, require deep expertise, and prone to fall behind quickly as the technology evolves and the company data continuously changes. Furthermore, it will need to rely on retrieval anyway for access to the latest concrete information. A more effective path is to take an existing pre-trained base model (like Meta’s Llama 2) and customize it through fine-tuning and indexing for retrieval. Fine-tuning uses just a small fraction of the information and tasks to refine the behavior of the model, but the extensive business proprietary information itself can be indexed and be available for retrieval as needed. As the base model gets updated with the latest GenAI technology, refreshing the target model should be a relatively straightforward process of repeating the fine-tuning flow.

Shift to Retrieval-Centric Generation: Architecting Around Indexed Information Extraction

Meta AI and university collaborators introduced retrieval-augmented generation in 2021 to address issues of provenance and updating world knowledge in LLMs. Researchers used RAG as a general-purpose approach to add non-parametric memory to pre-trained, parametric-memory generation models. The non-parametric memory used a Wikipedia dense vector index accessed by a pre-trained retriever. In a compact model with less memorized data, there is a strong emphasis on the breadth and quality of the indexed data referenced by the vector database because the model cannot rely on memorized information for business needs. Both RAG and RCG can use the same retriever approach by pulling relevant knowledge from a curated corpora on-the-fly during inference time (see Figure 2). They differ in the way the GenAI system places its information as well as in the interpretation expectations of previously unseen data. With RAG, the model itself is a major source of information, and it’s aided by retrieved data. In contrast, with RCG the vast majority of data resides outside the model parametric memory, making the interpretation of unseen data the model’s primary role.

It should be noted that many current RAG solutions rely on flows like LangChain or Haystack for concatenating a front-end retrieval with an independent vector store to a GenAI model that was not pre-trained with retrieval. These solutions provide an environment for indexing data sources, model choice, and model behavioral training. Other approaches, such as REALM by Google Research, experiment with end-to-end pre-training with integrated retrieval. Currently, OpenAI is optimizing its retrieval GenAI path rather than leaving it to the ecosystem to create the flow for ChatGPT. The company recently released Assistants API, which retrieves proprietary domain data, product information, or user documents external to the model.

Figure 2. Both RCG and RAG retrieve public and private data during inference, but they differ in how they place and interpret unseen data. Image credit: Intel Labs.
Figure 2. Both RCG and RAG retrieve public and private data during inference, but they differ in how they place and interpret unseen data. Image credit: Intel Labs.

In other examples, fast retriever models like Intel Labs’ fastRAG use pre-trained small foundation models to extract requested information from a knowledge base without any additional training, providing a more sustainable solution. Built as an extension to the open-source Haystack GenAI framework, fastRAG uses a retriever model to generate conversational answers by retrieving current documents from an external knowledge base. In addition, a team of researchers from Meta recently published a paper introducing Retrieval-Augmented Dual Instruction Tuning (RA-DIT), "a lightweight fine-tuning methodology that provides a third option by retrofitting any large language model with retrieval capabilities."

The shift from RAG to RCG models challenges the role of information in training. Rather than being both the repository of information as well as the interpreter of information in response to a prompt, with RCG the model’s functionality shifts to primarily be an in-context interpreter of retrieved (usually business-curated) information. This may require a modified approach to pre-training and fine-tuning because the current objectives used to train language models may not be suitable for this type of learning. RCG requires different abilities from the model such as longer context, interpretability of data, curation of data, and other new challenges.

There are still rather few examples of RCG systems in academia or industry. In one instance, researchers from Kioxia Corporation created the open-source SimplyRetrieve, which uses an RCG architecture to boost the performance of LLMs by separating context interpretation and knowledge memorization. Implemented on a Wizard-Vicuna-13B model, researchers found that RCG answered a query about an organization’s factory location accurately. In contrast, RAG attempted to integrate the retrieved knowledge base with Wizard-Vicuna’s knowledge of the organization. This resulted in partially erroneous information or hallucinations. This is only one example – RAG and retrieval-off generation (ROG) may offer correct responses in other situations.

Figure 3. Comparison of retrieval-centric generation (RCG), retrieval-augmented generation (RAG), and retrieval-off generation (ROG). Correct responses are shown in blue while hallucinations are shown in red. Image credit: Kioxia Corporation.
Figure 3. Comparison of retrieval-centric generation (RCG), retrieval-augmented generation (RAG), and retrieval-off generation (ROG). Correct responses are shown in blue while hallucinations are shown in red. Image credit: Kioxia Corporation.

In a way, transitioning from RAG to RCG can be likened to the difference in programming when using constants (RAG) and variables (RCG). When an AI model answers a question about a convertible Ford Mustang, a large model will be familiar with many of the car’s related details, such as year of introduction and engine specs. The large model can also add some recently retrieved updates, but it will respond primarily based on specific internal known terms or constants. However, when a model is deployed at an electric vehicle company preparing its next car release, the model requires reasoning and complex interpretation since most all the data will be unseen. The model will need to understand how to use the type of information, such as values for variables, to make sense of the data.

Schema: Generalization and Abstraction as a Competency During Inference

Much of the information retrieved in business settings (business organization and people, products and services, internal processes, and assets) would not have been seen by the corresponding GenAI model during pre-training and likely be just sampled during fine-tuning. This implies that the transformer architecture is not placing "known" words or terms (i.e., previously ingested by the model) as part of its generated output. Instead, the architecture is required to place unseen terms within proper in-context interpretation. This is somewhat similar to how in-context learning already enables some new reasoning capabilities in LLMs without additional training.

With this change, further improvements in generalization and abstraction are becoming a necessity. A key competency that needs to be enhanced is the ability to use learned schemas when interpreting and using unseen terms or tokens encountered at inference time through prompts. A schema in cognitive science "describes a pattern of thought or behavior that organizes categories of information and the relationships among them." Mental schema "can be described as a mental structure, a framework representing some aspect of the world." Similarly, in GenAI models schema is an essential abstraction mechanism required for proper interpretation of unseen tokens, terms, and data. Models today already display a fair grasp of emerging schema construction and interpretation, otherwise they would not be able to perform generative tasks on complex unseen prompt context data as well as they do. As the model retrieves previously unseen information, it needs to identify the best matching schema for the data. This allows the model to interpret the unseen data through knowledge related to the schema, not just explicit information incorporated in the context. It’s important to note that in this discussion I am referring to neural network models that learn and abstract the schema as an emergent capability, rather than the class of solutions that rely on an explicit schema represented in a knowledge graph and referenced during inference time.

Looking through the lens of the three types of model capabilities (cognitive competencies, functional skills, and information access), abstraction and schema usage belongs squarely in the cognitive competencies category. In particular, small models should be able to perform comparably to much larger ones (given the appropriate retrieved data) if they hone the skill to construct and use schema in interpreting data. It is to be expected that curriculum-based pre-training related to schemas will boost cognitive competencies in models. This includes the models’ ability to construct a variety of schemas, identify the appropriate schemas to use based on the generative process, and insert/utilize the information with the schema construct to create the best outcome.

For example, researchers showed how current LLMs can learn basic schemas using the Hypotheses-to-Theories (HtT) framework. Researchers found that an LLM can be used to generate rules that it then follows to solve numerical and relational reasoning problems. The rules discovered by GPT-4 could be viewed as a detailed schema for comprehending family relationships (see Figure 4). Future schemas of family relationships can be even more concise and powerful.

Figure 4. Using the CLUTRR dataset for relational reasoning, the Hypotheses-to-Theories framework prompts GPT-4 to generate schema-like rules for the LLM to follow when answering test questions. Image credit: Zhu et al.
Figure 4. Using the CLUTRR dataset for relational reasoning, the Hypotheses-to-Theories framework prompts GPT-4 to generate schema-like rules for the LLM to follow when answering test questions. Image credit: Zhu et al.

Applying this to a simple business case, a GenAI model could use a schema for understanding the structure of a company’s supply chain. For instance, knowing that "B is a supplier of A" and "C is a supplier of B" implies that "C is a tier-two supplier of A" would be important when analyzing documents for potential supply chain risks.

In a more complex case such as teaching a GenAI model the variations and nuances of documenting a patient’s visit to a healthcare provider, an emergent schema established during pre-training or fine-tuning would provide a structure for understanding retrieved information for generating reports or supporting the healthcare team’s questions and answers. The schema could emerge in the model within a broader training/fine-tuning on patient care cases, which include appointments as well as other complex elements like tests and procedures. As the GenAI model is exposed to all the examples, it should create the expertise to interpret partial patient data that will be provided during inference. The model’s understanding of the process, relationships, and variations will allow it to properly interpret previously unseen patient cases without requiring the process information in the prompt. In contrast, it should not try to memorize particular patient information it is exposed to during pre-training or fine-tuning. Such memorization would be counterproductive because patients’ information continuously changes. The model needs to learn the constructs rather than the particular cases. Such a setup would also minimize potential privacy concerns.

Summary

As GenAI is deployed at scale in businesses across all industries, there is a distinct shift to reliance on high quality proprietary information as well as requirements for traceability and verifiability. These key requirements along with the pressure on cost efficiency and focused application are driving the need for small, targeted GenAI models that are designed to interpret local data, mostly unseen during the pre-training process. Retrieval-centric systems require elevating some cognitive competencies that can be mastered by deep learning GenAI models, such as constructing and identifying appropriate schemas to use. By using RCG and guiding the pre-training and fine-tuning process to create generalizations and abstractions that reflect cognitive constructs, GenAI can make a leap in its ability to comprehend schemas and make sense of unseen data from retrieval. Refined abstraction (such as schema-based reasoning) and highly efficient cognitive competencies seem to be the next frontier.

Learn More: GenAI Series

Survival of the Fittest: Compact Generative AI Models Are the Future for Cost-Effective AI at Scale

References

  1. Gillis, A. S. (2023, October 5). retrieval-augmented generation. Enterprise AI. https://www.techtarget.com/searchenterpriseai/definition/retrieval-augmented-generation
  2. Singer, G. (2023, July 28). Survival of the fittest: Compact generative AI models are the future for Cost-Effective AI at scale. Medium. https://towardsdatascience.com/survival-of-the-fittest-compact-generative-ai-models-are-the-future-for-cost-effective-ai-at-scale-6bbdc138f618
  3. New models and developer products announced at DevDay. (n.d.). https://openai.com/blog/new-models-and-developer-products-announced-at-devday
  4. Meta AI. (n.d.). Introducing Llama 2. https://ai.meta.com/llama/
  5. Lewis, P. (2020, May 22). Retrieval-Augmented Generation for Knowledge-Intensive NLP tasks. arXiv.org. https://arxiv.org/abs/2005.11401
  6. LangChain. (n.d.). https://www.langchain.com
  7. Haystack. (n.d.). Haystack. https://haystack.deepset.ai/
  8. Guu, K. (2020, February 10). REALM: Retrieval-Augmented Language Model Pre-Training. arXiv.org. https://arxiv.org/abs/2002.08909
  9. Intel Labs. (n.d.). GitHub – Intel Labs/FastRAG: Efficient Retrieval Augmentation and Generation Framework. GitHub. https://github.com/IntelLabs/fastRAG
  10. Fleischer, D. (2023, August 20). Open Domain Q&A using Dense Retrievers in fastRAG – Daniel Fleischer – Medium. https://medium.com/@daniel.fleischer/open-domain-q-a-using-dense-retrievers-in-fastrag-65f60e7e9d1e
  11. Lin, X. V. (2023, October 2). RA-DIT: Retrieval-Augmented Dual Instruction Tuning. arXiv.org. https://arxiv.org/abs/2310.01352
  12. Ng, Y. (2023, August 8). SimplyRetrieve: a private and lightweight Retrieval-Centric generative AI tool. arXiv.org. https://arxiv.org/abs/2308.03983
  13. Wikipedia contributors. (2023, September 27). Schema (psychology). Wikipedia. https://en.wikipedia.org/wiki/Schema_(psychology)
  14. Wikipedia contributors. (2023a, August 31). Mental model. Wikipedia. https://en.wikipedia.org/wiki/Mental_schema
  15. Zhu, Z. (2023, October 10). Large Language Models can Learn Rules. arXiv.org. https://arxiv.org/abs/2310.07064

The post Knowledge Retrieval Takes Center Stage appeared first on Towards Data Science.

]]>
Survival of the Fittest: Compact Generative AI Models Are the Future for Cost-Effective AI at Scale https://towardsdatascience.com/survival-of-the-fittest-compact-generative-ai-models-are-the-future-for-cost-effective-ai-at-scale-6bbdc138f618/ Tue, 25 Jul 2023 17:53:33 +0000 https://towardsdatascience.com/survival-of-the-fittest-compact-generative-ai-models-are-the-future-for-cost-effective-ai-at-scale-6bbdc138f618/ The case for nimble, targeted, retrieval-based models as the best solution for generative AI applications deployed at scale.

The post Survival of the Fittest: Compact Generative AI Models Are the Future for Cost-Effective AI at Scale appeared first on Towards Data Science.

]]>
Image credit: Adobe Stock.
Image credit: Adobe Stock.

After a decade of rapid growth in Artificial Intelligence (AI) model complexity and compute, 2023 marks a shift in focus to efficiency and the broad application of generative AI (GenAI). As a result, a new crop of models with less than 15 billion parameters, referred to as nimble AI, can closely match the capabilities of ChatGPT-style giant models containing more than 100B parameters, especially when targeted for particular domains. While GenAI is already being deployed throughout industries for a wide range of business usages, the use of compact, yet highly intelligent models, is rising. In the near future, I expect there will be a small number of giant modes and a giant number of small, more nimble AI models embedded in countless applications.

While there has been great progress with larger models, bigger is certainly not better when it comes to training and environmental costs. TrendForce estimates that ChatGPT training alone for GPT-4 reportedly costs more than $100 million, while nimble model pre-training costs are orders-of-magnitude lower (for example, quoted as approximately $200,000 for MosaicML’s MPT-7B). Most of the compute costs occur during continuous inference execution, but this follows a similar challenge for larger models including expensive compute. Furthermore, giant models hosted on third-party environments raise security and privacy challenges. Nimble models are substantially cheaper to run and provide a host of additional benefits such as adaptability, hardware flexibility, integrability within larger applications, security and privacy, explainability, and more (see Figure 1). The perception that smaller models don’t perform as well as larger models is also changing. Smaller, targeted models are not less intelligent – they can provide equivalent or superior performance for business, consumer, and scientific domains, increasing their value while decreasing time and cost investment.

A growing number of these nimble models roughly match the performance of ChatGPT-3.5-level giant models and continue to rapidly improve in performance and scope. And, when nimble models are equipped with on-the-fly retrieval of curated domain-specific private data and targeted retrieval of web content based on a query, they become more accurate and more cost-effective than giant models that memorize a wide-ranging data set.

Figure 1. Benefits of nimble GenAI models. Image credit: Intel Labs.
Figure 1. Benefits of nimble GenAI models. Image credit: Intel Labs.

As nimble open source GenAI models step forward to drive the fast progression of the field, this "iPhone moment," when a revolutionary technology becomes mainstream, is being challenged by an "Android revolution" as a strong community of researchers and developers build on each other’s open source efforts to create increasingly more capable nimble models.

Think, Do, Know: Nimble Models with Targeted Domains Can Perform Like Giant Models

Figure 2. Generative AI classes of competencies. Image credit: Intel Labs.
Figure 2. Generative AI classes of competencies. Image credit: Intel Labs.

To gain more understanding of when and how a smaller model can deliver highly competitive results for generative AI, it is important to observe that both nimble and giant GenAI models need three classes of competencies to perform:

  1. Cognitive abilities to think: Including language comprehension, summarization, reasoning, planning, learning from experience, long-form articulation and interactive dialog.
  2. Functional skills to do: For example – reading text in the wild, reading charts/graphs, visual recognition, programming (coding and debug), image generation and speech.
  3. Information (memorized or retrieved) to know: Web content, including social media, news, research, and other general content, and/or curated domain-specific content such as medical, financial and enterprise data.

Cognitive abilities to think. Based on its cognitive abilities, the model can "think" and understand, summarize, synthesize, reason, and compose language and other symbolic representations. Both nimble and giant models can perform well in these cognitive tasks and it is not clear that those core capabilities require massive model sizes. For example, nimble models like Microsoft Research’s Orca are showing understanding, logic, and reasoning skills that already match or surpass those of ChatGPT on multiple benchmarks. Furthermore, Orca also demonstrates that reasoning skills can be distilled from larger models used as teachers. However, the current benchmarks used to evaluate cognitive skills of models are still rudimentary. Further research and benchmarking are required to validate that nimble models can be pre-trained or fine-tuned to fully match the "thinking" strength of giant models.

Functional skills to do. Larger models are likely to have more functional skills and information given their general focus as all-in-one models. However, for most business usages, there is a particular range of functional skills needed for any application being deployed. A model used in a business application should have flexibility and headroom for growth and variation of use, but it rarely needs an unbounded set of functional skills. GPT-4 can generate text, code and images in multiple languages, but speaking hundreds of languages doesn’t necessarily mean that those giant models has inherently more underlying cognitive competencies – it primarily gives the model added functional skills to "do" more. Furthermore, functionally specialized engines will be linked to GenAI models and used when that functionality is needed – like adding mathematical "Wolfram superpowers" to ChatGPT modularly could provide best-in-class functionality without burdening the model with unnecessary scale. For example, GPT-4 is deploying plugins that are essentially utilizing smaller models for add-on functions. It’s also rumored that GPT-4 model itself is a collection of multiple giant (less than 100B parameters) "mixture of experts" models trained on different data and task distributions rather than one monolithic dense model like GPT-3.5. To get the best combination of capabilities and model efficiencies, it is likely that future multi-functional models might employ smaller, more focused mixture of experts models that are each smaller than 15B parameters.

Figure 3. Retrieval-based, functionally extended models can offer a broad scope of functionality and relevant information, largely independent of model size. Image credit: Intel Labs.
Figure 3. Retrieval-based, functionally extended models can offer a broad scope of functionality and relevant information, largely independent of model size. Image credit: Intel Labs.

Information (memorized or retrieved) to know. Giant models "know" more by memorizing vast amounts of data within parametric memory, but it doesn’t necessarily make them smarter. They are just more generally knowledgeable than smaller models. Giant models have high value in zero-shot environments for new use cases, offering a general consumer base when there’s no need for targeting, and acting as a teacher model when distilling and fine-tuning nimble models like Orca. However, targeted nimble models can be trained and/or fine-tuned for particular domains, providing sharper skills for the capabilities needed.

Figure 4. Value of retrieval in allowing small models to match much larger models (using the Contriever retrieval method). Image credit: Intel Labs based on the work of Mallen et al.
Figure 4. Value of retrieval in allowing small models to match much larger models (using the Contriever retrieval method). Image credit: Intel Labs based on the work of Mallen et al.

For example, a model that is targeted for programming can focus on a different set of capabilities than a healthcare AI system. Furthermore, by using retrieval over a curated set of internal and external data, the accuracy and timeliness of the model can be greatly improved. A recent study showed that on the PopQA benchmark, models as small as 1.3B parameters with retrieval can perform as well as a model more than hundred times their size at 175B parameters (see Figure 4). In that sense, the relevant knowledge of a targeted system with high-quality indexed accessible data may be much more extensive than an all-in-one general-purpose system. This may be more important for the majority of enterprise applications that require use-case or application-specific data – and in many instances, local knowledge instead of vast general knowledge. This is where the value of nimble models will be realized moving forward.

Three Aspects Contributing to the Explosive Growth in Nimble Models

There are three aspects to consider when assessing the benefits and value of nimble models:

  1. High efficiency at modest model sizes.
  2. Licensing as open source or proprietary.
  3. Model specialization as general purpose or targeted including retrieval.

In terms of size, nimble general-purpose models, such as Meta’s LLaMA-7B and -13B or Technology Innovation Institute’s Falcon 7B open source models, and proprietary models such as MosaicML’s MPT-7B, Microsoft Research’s Orca-13B and Saleforce AI Research’s XGen-7B are improving in rapid succession (see Figure 6). Having a choice of high-performance, smaller models has significant implications for the cost of operation as well as the choice of compute environments. ChatGPT’s 175B parameter model and the estimated 1.8 trillion parameters for GPT-4 require a massive installation of accelerators such as GPUs with enough compute power to handle the training and fine-tuning workload. In contrast, nimble models can generally run inference on any choice of hardware, anywhere from a single socket CPU, through entry-level GPUs, and up to the largest acceleration racks. The definition of nimble AI has been currently set to 15B parameters empirically based on the outstanding results of models sized at 13B parameters or smaller. Overall, nimble models offer a more cost-effective and scalable approach to handling new use cases (see the section on advantages and disadvantages of nimble models).

The second aspect of open source licensing allows universities and companies to iterate on one another’s models, driving a boom of creative innovations. Open source models allow for the incredible progress of small model capabilities as demonstrated in Figure 5.

Figure 5. Nimble open source non-commercial and commercial GenAI models blast off during the first half of 2023. Image credit: Intel Labs.
Figure 5. Nimble open source non-commercial and commercial GenAI models blast off during the first half of 2023. Image credit: Intel Labs.

There are multiple examples from early 2023 of general nimble generative AI models starting with LLaMA from Meta, which has models with 7B, 13B, 33B, and 65B parameters. The following models in the 7B and 13B parameters range were created by fine-tuning LLaMA: Alpaca from Stanford University, Koala from Berkeley AI Research, and Vicuna created by researchers from UC Berkeley, Carnegie Mellon University, Stanford, UC San Diego, and MBZUAI. Recently, Microsoft Research published a paper on the not-yet-released Orca, a 13B parameter LLaMA-based model that imitates the reasoning process of giant models with impressive results prior to targeting or fine-tuning to a particular domain.

Figure 6. A comparison of open source chatbots' relative response quality evaluated by GPT-4 using the Vicuna evaluation set. Image credit: Microsoft Research.
Figure 6. A comparison of open source chatbots’ relative response quality evaluated by GPT-4 using the Vicuna evaluation set. Image credit: Microsoft Research.

Vicuna could be a good proxy for recent open source nimble models that were derived from LLaMA as the base model. Vicuna-13B is a chatbot created by a university collaboration that has been "developed to address the lack of training and architecture details in existing models such as ChatGPT." After being fine-tuned on user-shared conversations from ShareGPT, the response quality of Vicuna is more than 90% compared to ChatGPT and Google Bard when using GPT-4 as a judge. However, these early open source models are not available for commercial use. MosaicML’s MPT-7B and Technology Innovation Institute’s Falcon 7B commercially usable, open source models are reportedly equal in quality to LLaMA-7B.

Figure 7. Orca-13B performs as well as ChatGPT on BIG-bench Hard's complex zero-shot reasoning tasks. Image credit: Microsoft Research.
Figure 7. Orca-13B performs as well as ChatGPT on BIG-bench Hard’s complex zero-shot reasoning tasks. Image credit: Microsoft Research.

Orca "surpasses conventional instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH). It reaches parity with ChatGPT-3.5 on the BBH benchmark," according to researchers. Orca-13B’s top performance over other general models reinforces the notion that the large size of giant models may result from brute-force early models. The scale of giant foundation models can be important for some smaller models like Orca-13B to distill knowledge and methods, but size is not necessarily required for inference – even for the general case. A word of caution – a full evaluation of cognitive capabilities, functional skills, and knowledge memorization of the model will only be possible when it is broadly deployed and exercised.

As of the writing of this blog, Meta released their Llama 2 model with 7B, 13B and 70B parameters. Arriving just four months after the first generation, the model offers meaningful improvements. In the comparison chart, a nimble Llama 2 13B achieves similar results to larger models from the previous LLaMA generation as well as to MPT-30B and Falcon 40B. Llama 2 is open source and free for research and commercial use. It was introduced in tight partnership with Microsoft as well as quite a few other partners, including Intel. Meta’s commitment to open source models and its broad collaboration will surely give an additional boost to the rapid cross-industry/academia improvement cycles we are seeing for such models.

The third aspect of nimble models has to do with specialization. Many of the newly-introduced nimble models are general purpose – like LLaMA, Vicuna and Orca. General nimble models may rely solely on their parametric memory, using low-cost updates through fine-tune methods including LoRA: Low-Rank Adaptation of Large Language Models as well as retrieval-augmented generation, which pulls relevant knowledge from a curated corpora on-the-fly during inference time. Retrieval-augmented solutions are being established and continuously enhanced with GenAI frameworks like LangChain and Haystack. These frameworks allow easy and flexible integration of indexing and effectively accessing large corpora for semantics-based retrieval.

Most business users prefer targeted models that are tuned for their particular domain of interest. These targeted models also tend to be retrieval-based to utilize all key information assets. For example, healthcare users may want to automate patient communications.

Targeted models use two methods:

  1. Specialization of the model itself to the tasks and type of data required for the targeted use cases. This could be done in multiple ways, including pre-training a model on specific domain knowledge (like how phi-1 pre-trained on textbook-quality data from the web), fine-tuning a general purpose base model of the same size (like how Clinical Camel fine-tuned LLaMA-13B), or distilling and learning from a giant model into a student nimble model (like how Orca learned to imitate the reasoning process of GPT-4, including explanation traces, step-by-step thought processes, and other complex instructions).
  2. Curating and indexing relevant data for on-the-fly retrieval, which could be a large volume, but still within the scope/space of the targeted use case. Models can retrieve public web and private consumer or enterprise content that is continuously updated. Users determine which sources to index, allowing the choice of high-quality resources from the web plus more complete resources like an individual’s private data or a company’s enterprise data. While retrieval is now integrated into both giant and nimble systems, it plays a crucial role in smaller models as it provides all the necessary information for the model performance. It also allows for businesses to make all their private and local information available to a nimble model running within their compute environment.

Nimble Generative AI Model Advantages and Disadvantages

In the future, the size of compact models might drift up to 20B or 25B parameters, but still stay far below the 100B parameters scope. There is also a variety of models of intermediate sizes like MPT-30B, Falcon 40B and Llama 2 70B. While they are expected to perform better than smaller models on zero-shot, I would not expect them to perform materially better for any defined set of functionalities than nimble, targeted, retrieval-based models.

When compared with giant models, there are many advantages of nimble models, which are further enhanced when the model is targeted and retrieval-based. These benefits include:

  • Sustainable and lower cost models: Models with substantially lower costs for training and inference compute. Inference run-time compute costs might be the determining factor for viability of business-oriented models integrated into 24×7 usages, and the much-decreased environmental impact is also significant when taken in aggregate across broad deployments. Finally, with their sustainable, specific, and functionally oriented systems, nimble models are not attempting to address ambitious goals of artificial general intelligence (AGI) and are therefore less involved in the public and regulatory debate related to the latter.
  • Faster fine-tune iterations: Smaller models can be fine-tuned in a few hours (or less), adding new information or functionality to the model through adaptation methods like LoRA, which are highly effective in nimble models. This enables more frequent improvement cycles, keeping the model continuously up to date with its usage needs.
  • Retrieval-based model benefits: Retrieval systems refactor knowledge, referencing most of the information from the direct sources rather than the parametric memory of the model. This improves the following: – Explainability: Retrieval models use source attribution, providing provenance or the ability to track back to the source of information to provide credibility. – Timeliness: Once an up-to-date source is indexed, it is immediately available for use by the model without any need for training or fine-tuning. That allows for continuously adding or updating relevant information in near real-time. – Scope of data: The information indexed for per-need retrieval can be very broad and detailed. When focused on its target domains, the model can cover a huge scope and depth of private and public data. It may include more volume and details in its target space than a giant foundation model training dataset. – Accuracy: Direct access to data in its original form, detail, and context can reduce hallucinations and data approximations. It can provide reliable and complete answers as long as they are in the retrieval space. With smaller models, there is also less conflict between traceable curated information retrieved per-need, and memorized information (as in giant models) that might be dated, partial and not attributed to sources.

  • Choice of hardware: Inference of nimble models can be done practically on any hardware, including ubiquitous solutions that might already be part of the compute setting. For example, Meta’s Llama 2 nimble models (7B and 13B parameters) are running well on Intel’s data center products including Xeon, Gaudi2 and Intel Data Center GPU Max Series.
  • Integration, security and privacy: Today’s ChatGPT and other giant GenAI models are independent models that usually run on large accelerator installations on a third-party platform and are accessed through interfaces. Nimble AI models can run as an engine that is embedded within a larger business application and can be fully integrated into the local compute environment. This has major implications for security and privacy because there is no need for exchange/exposure of information with third-party models and compute environments, and all security mechanisms of the broader application can be applied to the GenAI engine.
  • Optimization and model reduction: Optimization and model reduction techniques such as quantization, which reduces computational demands by converting input values to smaller output values, have shown strong initial results on nimble models with increasing power efficiency.

Some challenges of nimble models are still worth mentioning:

  • Reduced range of tasks: General-purpose giant models have outstanding versatility and especially excel in zero-shot new usages that were not considered earlier. The breadth and scope that can be achieved with nimble systems is still under evaluation but seems to be improving with recent models. Targeted models assume that the range of tasks is known and defined during pre-training and/or fine-tuning, so the reduction in scope should not impact any relevant capabilities. Targeted models are not single task, but rather a family of related capabilities. This can lead to fragmentation as a result of task- or business-specific nimble models.
  • May be improved with few-shot fine-tuning: For a model to address a targeted space effectively, fine-tuning is not always required, but can aid the effectiveness of AI by adjusting the model to the tasks and information needed for the application. Modern techniques enable this process to be done with a small number of examples and without the need for deep data science expertise.
  • Retrieval models need indexing of all source data: Models pull in needed information during inference through index mapping, but there is risk of missing an information source, making it unavailable to the model. To ensure provenance, explainability and other properties, targeted retrieval-based models should not rely on detailed information stored in the parametric memory, and instead rely mainly on indexed information that is available for extraction when needed.

Summary

The major leap in generative AI is enabling new capabilities such as AI agents conversing in plain language, the summarization and generation of compelling text, image creation, utilization of the context of previous iterations and much more. This blog introduces the term "nimble AI" and makes the case for why it will be the predominant method in deploying GenAI at scale. Simply put, nimble AI models are faster to run, quicker to refresh through continuous fine-tuning, and more amenable to rapid technology improvement cycles through the collective innovation of the open source community.

As demonstrated through multiple examples, the outstanding performance that emerged through the evolution of the largest models shows that nimble models do not require the same massive heft as the giant models. Once the underlying cognitive abilities have been mastered, the required functionality tuned and the data made available per-need, nimble models provide the highest value for the business world.

That said, nimble models will not render giant models extinct. Giant models are still expected to perform better in a zero-shot, out-of-the-box setting. These large models also might be used as the source (teacher model) for distillation into smaller, nimble models. While giant models have a huge volume of additional memorized information to address any potential use, and are equipped with multiple skills, this generality is not expected to be required for most GenAI applications. Instead, the ability to fine-tune a model to the information and skills relevant for the domain, plus the ability to retrieve recent information from curated local and global sources, would be a much better value proposition for many applications.

Viewing nimble, targeted AI models as modules that can be incorporated into any existing application offers a very compelling value proposition including:

  • Requires a fraction of the cost for deployment and operation.
  • Adaptable to tasks and private/enterprise data.
  • Updatable overnight, and can run on any hardware from CPU, to GPU or accelerators.
  • Integrated into current compute environment and application.
  • Runs on premise or in a private cloud.
  • Benefits from all security and privacy settings.
  • Higher accuracy and explainability.
  • More environmentally responsible while providing a similar level of generative AI capabilities.

Impressive progress on a small number of giant models will continue. However, the industry will most likely need just a few dozen or so general-purpose nimble base models, which can then be used to build countless targeted versions. I foresee a near-term future in which a broad scaling of advanced GenAI will permeate all industries, mostly by integrating nimble, targeted secure intelligence modules as their engines of growth.

References

  1. Tseng, P. K. (2023, March 1). TrendForce Says with Cloud Companies Initiating AI Arms Race, GPU Demand from ChatGPT Could Reach 30,000 Chips as It Readies for Commercialization. TrendForce. https://www.trendforce.com/presscenter/news/20230301-11584.html
  2. Introducing MPT-7B: A New Standard for Open Source, Commercially Usable LLMs. (2023, May 5). https://www.mosaicml.com/blog/mpt-7b
  3. Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., &and Awadallah, A. (2023). Orca: Progressive Learning from Complex Explanation Traces of GPT-4. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2306.02707
  4. Wolfram, S. (2023, March 23). ChatGPT Gets Its "Wolfram Superpowers"!. Stephen Wolfram Writings. https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-wolfram-superpowers/
  5. Schreiner, M. (2023, July 11). GPT-4 architecture, datasets, costs and more leaked. THE DECODER. https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/
  6. ChatGPT plugins. (n.d.). https://openai.com/blog/chatgpt-plugins
  7. Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., & Grave, E. (2021). Unsupervised Dense Information Retrieval with Contrastive Learning. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2112.09118
  8. Mallen, A., Asai, A., Zhong, V., Das, R., Hajishirzi, H., and Khashabi, D. (2022). When not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2212.10511
  9. Papers with Code – PopQA Dataset. (n.d.). https://paperswithcode.com/dataset/popqa
  10. Introducing LLaMA: A foundational, 65-billion-parameter large language model. (2023, February 24). https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
  11. Introducing Falcon LLM. (n.d.). https://falconllm.tii.ae/
  12. Nijkamp, E., Hayashi, H., Xie, T., Xia, C., Pang, B., Meng, R., Kryscinski, W., Tu, L., Bhat, M., Yavuz, S., Xing, C., Vig, J., Murakhovs’ka, L., Wu, C. S., Zhou, Y., Joty, S. R., Xiong, C., and Savarese, S. (2023). Long Sequence Modeling with XGen: A 7B LLM Trained on 8K Input Sequence Length. Salesforce AI Research. https://blog.salesforceairesearch.com/xgen/
  13. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021, June 17). LoRA: Low-Rank Adaptation of Large Language Models. arXiv (Cornell University). https://doi.org/10.48550/arXiv.2106.09685
  14. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html
  15. Introduction LangChain. (n.d.). https://python.langchain.com/docs/get_started/introduction.html
  16. Haystack. (n.d.). https://www.haystackteam.com/core/knowledge
  17. Mantium. (2023). How Haystack and LangChain are Empowering Large Language Models. Mantium. https://mantiumai.com/blog/how-haystack-and-langchain-are-empowering-large-language-models/
  18. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023, March 13). Alpaca: A Strong, Replicable Instruction-Following Model. Stanford University CRFM. https://crfm.stanford.edu/2023/03/13/alpaca.html
  19. Geng, X., Gudibande, A., Liu, H., Wallace, E., Abbeel, P., Levine, S. and Song, D. (2023, April 3). Koala: A Dialogue Model for Academic Research. Berkeley Artificial Intelligence Research Blog. https://bair.berkeley.edu/blog/2023/04/03/koala/
  20. Chiang, W. L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. (2023, March 30). Vicuna: An Open Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. LMSYS Org. https://lmsys.org/blog/2023-03-30-vicuna/
  21. Rodriguez, J. (2023, April 5). Meet Vicuna: The Latest Meta’s Llama Model that Matches ChatGPT Performance. Medium. https://pub.towardsai.net/meet-vicuna-the-latest-metas-llama-model-that-matches-chatgpt-performance-e23b2fc67e6b
  22. Papers with Code – BIG-bench Dataset. (n.d.). https://paperswithcode.com/dataset/big-bench
  23. Meta. (2023, July 18). Meta and Microsoft introduce the next generation of Llama. Meta. https://about.fb.com/news/2023/07/llama-2/
  24. Meta AI. (n.d.). Introducing Llama 2. https://ai.meta.com/llama/
  25. Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Allie, D. G., Gopi, S., Javaheripi, M., Kauffmann, P., Gustavo, D. R., Saarikivi, O., Salim, A., Shah, S., Behl, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. (2023). Textbooks Are All You Need. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2306.11644
  26. Toma, A., Lawler, P. R., Ba, J., Krishnan, R. G., Rubin, B. B., and Wang, B. (2023). Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2305.12031
  27. Patel, D., and Ahmad, A. (2023, May 4). Google "We Have No Moat, And Neither Does OpenAI." SemiAnalysis. https://www.semianalysis.com/p/google-we-have-no-moat-and-neither
  28. Accelerate Llama 2 with Intel AI Hardware and Software Optimizations. (n.d.). Intel. https://www.intel.com/content/www/us/en/developer/articles/news/llama2.html
  29. Smaller is better: Q8-Chat, an efficient generative AI experience on Xeon. (n.d.). Hugging Face. https://huggingface.co/blog/generative-ai-models-on-intel-cpu

The post Survival of the Fittest: Compact Generative AI Models Are the Future for Cost-Effective AI at Scale appeared first on Towards Data Science.

]]>
Have Machines Just Made an Evolutionary Leap to Speak in Human Language? https://towardsdatascience.com/have-machines-just-made-an-evolutionary-leap-to-speak-in-human-language-319237593aa4/ Mon, 17 Apr 2023 22:37:48 +0000 https://towardsdatascience.com/have-machines-just-made-an-evolutionary-leap-to-speak-in-human-language-319237593aa4/ Assessing where we are on the journey to deep, meaningful communication between people and their AI

The post Have Machines Just Made an Evolutionary Leap to Speak in Human Language? appeared first on Towards Data Science.

]]>
As people interact with conversational Artificial Intelligence (AI) systems, clear communication is a key factor in getting the intended outcome that will best serve and augment our lives. In the broader sense, what language should be used for the control system and conversations with machines? In this blog post, we evaluate the progression of methods to guide and converse with machines based on recent technology innovations, such as OpenAI’s ChatGPT and GPT-4, and explore the next needed steps in conversational AIs toward mastering natural conversation like a human confidant. Machines have made a categorical leap from prompt engineering to "human speak," however other aspects of intelligence are still awaiting discovery.

Until recently in 2022, getting an AI to respond properly and utilize its strengths required specialized knowledge such as sophisticated prompt engineering. The introduction of ChatGPT was a major advancement in the conversational ability of machines, making it so that even high school students can chat with high power AI and get impressive results. This is a significant milestone. However, we also need to assess where we are on the journey of human-machine communication and what is still required to achieve meaningful conversations with AI.

The interaction between people and machines has two broad objectives: To instruct the machine on needed tasks, and second, to exchange information and guidance during the performance of the tasks. The first objective is traditionally done by programming, but it is now evolving to where the dialog with the user can define a new task, such as asking AI to create a Python script to perform a task. The exchange within a task was addressed through natural language processing (NLP) or natural language understanding (NLU) coupled with generation of machine response. Let us take the assumption that the central characteristics – if not the endpoint – of the human-to-machine interaction progression is when people can communicate with machines the same as they do with a long-time friend, including all the free-form syntactic, semantic, metaphoric, and cultural aspects that are assumed in such an interaction. What must be created for AI systems to partake in this natural communication fully?

Machines have made a categorical leap from prompt engineering to "human speak," however other aspects of intelligence are still awaiting discovery.

Past Conversational AI: Transformer Architecture Redefines NLP Performance

Figure 1. Evolution of machine programming and instructing. Image source: With permission from Intel Labs.
Figure 1. Evolution of machine programming and instructing. Image source: With permission from Intel Labs.

In the beginning of compute, humans and machines could only communicate through machine code, a low-level computer language of binary digits – strings of 0s and 1s that hold little resemblance to human communication. Over the last century, there’s been a gradual journey to make communication with machines closer to human language. Today, the fact that we can tell machines to generate a picture of a cat playing chess is evidence of great progress. This communication has improved gradually with the evolution of programming languages from low to high level code, from assembler to C to Python, and the introduction of human speech-like constructs such as the if-then statement. Now the final step is to eliminate prompt engineering or other sensitivities to tweaks in input phrasing so that machines and humans can interact in a natural way. Human-machine dialogue should allow for incremental references to continue the conversation from a past "save point."

NLP deals with the interactions between computers and human languages to process and analyze large amounts of natural language data, while NLU undertakes the difficult task of detecting the user’s intention. Virtual assistants such as Alexa and Google use NLP, NLU, and machine learning (ML) to acquire new knowledge as they operate. By using predictive intelligence and analytics, the AI can personalize conversations and responses based on user preferences. While virtual assistants reside in people’s homes like a trusted friend, they are currently limited to basic command language tautology. People have adapted to this by speaking "keyword-ese" to get the best results, but their conversational AI still lags in understanding natural language interactions. When there’s a communications breakdown with their virtual assistant, people use repair strategies such as simplification of utterances, variations on the amount of information given to the system, semantic and syntactical adjustments to queries, and repetition of commands.

Figure 2. Six levels of intent in natural language understanding. Image source: With permission from Intel Labs.
Figure 2. Six levels of intent in natural language understanding. Image source: With permission from Intel Labs.

This is where NLU is critical in understanding intent. NLU analyzes text and speech to determine its meaning (see Figure 2). Using a data model of semantic and pragmatic definitions for human speech, NLU focuses on intent and entity recognition. The introduction of Transformer neural network architecture in 2018 has led to gains in NLP performance in virtual assistants. These types of networks use self-attention mechanisms to process input data, allowing them to effectively capture dependencies in human language. Introduced by researchers at Google AI Language, BERT addresses 11 of the most common NLP tasks in one model, improving on the traditional method of using separate models for each specific task. BERT pre-trains language representations by training a general purpose language understanding model on a large text corpus such as Wikipedia, and then applying the model to downstream NLP tasks such as question answering and language inference.

Beyond virtual assistants, the advances with ChatGPT are synergistic with Transformer models gains in performance for NLP. GPT-3 Transformer technology was introduced in 2021, but its major breakthrough in popularity and use was achieved with ChatGPT and its innovation in the ability to have a human-like conversational interface, which was made possible by utilizing reinforcement learning from human feedback (RLHF). ChatGPT enables LLMs to process and understand natural language inputs and generate outputs as human-like as possible.

Present: Large Language Models Dominate Conversational AI

Since its release by OpenAI in November 2022, ChatGPT has dominated the news with its seemingly well-written language generation for essays and tests, and successful passing of medical license and MBA exams. ChatGPT passed all three exams for the U.S. Medical Licensing Examination (USMLE) without any training or reinforcement. This led to researchers to conclude that "large language models may have the potential to assist with medical education, and potentially, clinical decision-making." A Wharton School professor at the University of Pennsylvania tested ChatGPT on an Operations Management MBA final exam, and it received a B to B- grade. ChatGPT performed well on basic operations management and process analysis questions based on case studies, providing correct answers and solid explanations. When ChatGPT failed to match the problem with the right solution method, hints from a human expert helped the model correct its answer. While these results are promising, ChatGPT has limitations in reaching conversation at the human level (we’ll discuss in the next section).

As an autoregressive language model with 175 billion parameters, ChatGPT’s large model size helps it to perform well on understanding user intent. Based on the levels of intent in Figure 2, ChatGPT can process pragmatic requests from users that contain an open argument set and flexible structure by analyzing what the text prompt is trying to achieve. ChatGPT can write highly detailed responses and articulate answers, demonstrating a breadth and depth of knowledge across different domains, such as medicine, business operations, computer programming, and more. GPT-4 is also showing impressive strengths, such as adding multimodal capabilities and improving scores on advanced human tests. GPT-4 is reported to have scored in the 90th percentile of the Uniform Bar Exams (versus the 10th percentile for ChatGPT) and in the 99th percentile on the USA Biology Olympiad (vs the 31st percentile for ChatGPT).

ChatGPT can also do machine programming, albeit to a degree. It can create programs in multiple languages including Python, JavaScript, C++, Java, and more. It can also analyze code for bugs and performance issues. However, so far it seems to be best utilized as part of a joint programmer/AI combination.

While OpenAI’s models are attracting much attention, other models are progressing in similar direction, such as Google Brain’s open source 1.6 trillion-parameter Switch Transformer which debuted in 2021, and Google Bard, which uses LaMDA (Language Model for Dialogue Applications) technology to search the web and provide real-time answers. Bard is currently only available to beta testers, so its performance against ChatGPT is unknown.

While LLMs have made great strides toward having natural conversations with humans, key growth areas need to be addressed.

To reach the next level of intelligence and human-level communication, key areas need to undergo a leap in capabilities – restructuring of knowledge, integration of multiple skills, and providing contextual adaptation.

Future: What is Missing from the Human-Machine Conversation?

Four key elements are still missing in moving conversational AI to the next level in carrying on natural conversations. To reach this level of intimacy and shared purpose, the machine needs to understand the meaning of the person’s symbolic communication, and answer with trustworthy custom responses that are meaningful to the person.

1) Producing trustworthy responses. AI systems should not hallucinate! Epistemological problems affect the way AI builds knowledge and differentiates between known and unknown information. The machine can make mistakes, producing biased results or even hallucinating when providing answers about things it doesn’t know. ChatGPT has difficulty with capturing source attribution and information provenance. It can generate plausible sounding, but incorrect or nonsensical answers. In addition, it lacks factual correctness and common sense with physical, spatial, and temporal questions, and struggles with math reasoning. According to OpenAI, it has difficulty with questions such as: "If I put cheese into the fridge, will it melt?" It performs poorly when planning or thinking methodically. When tested on the MBA final exam, it made surprising mistakes in 6th grade level math, which could cause massive errors in operations. The research found that "Chat GPT was not capable of handling more advanced process analysis questions, even when they are based on standard templates. This includes process flows with multiple products and problems with stochastic effects such as demand variability."

2) Deep understanding of human symbology and idiosyncrasies. AI needs to work within the full symbolic world of humans, including the ability to do abstraction, customize, and understand partial references. The machine must be able to interpret people’s speech ambiguities and incomplete sentences in order to have meaningful conversations.

Figure 3. An example from the Switchboard collection of two-sided telephone conversations between speakers from all areas of the United States. Image source: With permission from Intel Labs.
Figure 3. An example from the Switchboard collection of two-sided telephone conversations between speakers from all areas of the United States. Image source: With permission from Intel Labs.

As shown in Figure 3, human speech patterns are often unintelligible. Should AI speak exactly like a human? Realistically, it could be the same as a person talking with a friend with loosely structured language, including a reasonable amount of "hmms," "likes," incomplete or ill formed sentences, ambiguities, semantic abstractions, personal references, and common-sense inferences. But these idiosyncrasies of human speech should not overpower the communication, rendering it unintelligible.

3) Providing custom responses. AI needs the ability to customize and be familiar with the world of the user. ChatGPT often guesses the user’s intent instead of asking clarifying questions. In addition, as a fully encapsulated information model, ChatGPT does not have the ability to browse or search the internet to provide custom answers for the user. According to OpenAI, ChatGPT is limited in its custom answers because "it weights every token equally and lacks a notion of what is most important to predict and what is less important. With self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ultimately, useful language systems (like virtual assistants) might be better thought of as taking goal-directed actions rather than just making predictions." In addition, it would be helpful to have a multi-session context of the human-machine conversations as well as a theory of mind model of the user.

4) Becoming purpose-driven. When people work with a companion, the coordination is not just based on a text exchange, but on a shared purpose. AI needs to move beyond contextualized answers to become purpose driven. In the evolving human-machine relationship, both parties need to become a part of a journey to accomplish a goal, avoid or alleviate a problem, or share information. ChatGPT and other LLMs have not reached this level of interaction yet. As I explored in a previous blog, intelligent machines need to go beyond input-to-output replies and conversations as a chatbot.

To reach the next level of intelligence and human-level communication, key areas need to undergo a leap in capabilities – restructuring of knowledge, integration of multiple skills, and providing contextual adaptation.

The Path to the Next Level in Conversational AIs

LLMs like ChatGPT still have gaps in the cognitive skills needed to take conversational AI to the next level. Missing competencies include logical reasoning, temporal reasoning, numeric reasoning, and the overall ability to be goal-driven and define subtasks to achieve a larger task. Knowledge-related limitations in ChatGPT and other LLMs can be addressed by a Thrill-K approach, by adding retrieval and continuous learning. Knowledge resides in three places for AI:

1) Instantaneous knowledge. Knowledge commonly used and continuous functions that can be effectively approximated, available in the fastest and most expensive layer within the parametric memory for the neural network or other working memory for other ML processing. ChatGPT currently uses this end-to-end deep learning system, but it needs to expand to include other knowledge sources to be more effective as a human companion.

2) Standby knowledge. Knowledge that is valuable to the AI system but not as commonly used, available in an adjacent structured knowledge base with as-needed extraction. It requires increased representation strength for discrete entities, or needs to be kept generalized and flexible for a variety of novel uses. Actions or outcomes based on standby knowledge require processing and internal resolution, enabling the AI to learn and adapt as a human companion.

3) Retrieved external knowledge. Information from a vast online repository, available outside the AI system for retrieval when needed. This allows the AI to customize answers for the human companion with several modalities of information, provide reasoned analysis, and explain the sources of information and the path to conclusion.

Summary

The journey from machine language to human speak has evolved from humans inputting simple binary digits into a computer, to bringing virtual assistants into our homes to perform simple tasks, to asking and receiving articulate answers from LLMs such as ChatGPT. Despite this great progress in recent innovations in LLMs, the path to the next level of conversational AI requires knowledge restructuring, multiple intelligences, and contextual adaptation to build AI that are true human companions.

References

  1. Mavrina, L., Szczuka, J. M., Strathmann, C., Bohnenkamp, L., Krämer, N. C., & Kopp, S. (2022). "Alexa, You’re Really Stupid": A Longitudinal Field Study on Communication Breakdowns Between Family Members and a Voice Assistant. Frontiers in Computer Science, 4. https://doi.org/10.3389/fcomp.2022.791704
  2. Devlin, J. (2018, October 11). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv.org. https://arxiv.org/abs/1810.04805
  3. Wikipedia contributors. (2023). Reinforcement learning from human feedback. Wikipedia. https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback
  4. Introducing ChatGPT. (n.d.). https://openai.com/blog/chatgpt
  5. Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., & Tseng, V. (2022). Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. medRxiv (Cold Spring Harbor Laboratory). https://doi.org/10.1101/2022.12.19.22283643
  6. Needleman, E. (2023). Would Chat GPT Get a Wharton MBA? New White Paper By Christian Terwiesch. Mack Institute for Innovation Management. https://mackinstitute.wharton.upenn.edu/2023/would-chat-gpt3-get-a-wharton-mba-new-white-paper-by-christian-terwiesch/
  7. OpenAI. (2023). GPT-4 Technical Report. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2303.08774
  8. Gewirtz, D. (2023, April 6). How to use ChatGPT to write code. ZDNET. https://www.zdnet.com/article/how-to-use-chatgpt-to-write-code/
  9. How Many Languages Does ChatGPT Support? The Complete ChatGPT Language List. (n.d.). https://seo.ai/blog/how-many-languages-does-chatgpt-support
  10. Tung, L. (2023, February 2). ChatGPT can write code. Now researchers say it’s good at fixing bugs, too. ZDNET. https://www.zdnet.com/article/chatgpt-can-write-code-now-researchers-say-its-good-at-fixing-bugs-too/
  11. Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2101.03961
  12. Pichai, S. (2023, February 6). An important next step on our AI journey. Google. https://blog.google/technology/ai/bard-google-ai-search-updates/
  13. Dickson, B. (2022, July 31). Large language models can’t plan, even if they write fancy essays. TNW | Deep-Tech. https://thenextweb.com/news/large-language-models-cant-plan
  14. Brown, T., Mann, B. F., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J. C., Winter, C., . . . Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2005.14165
  15. Singer, G. (2022, August 17). Beyond Input-Output Reasoning: Four Key Properties of Cognitive Ai. Medium. https://towardsdatascience.com/beyond-input-output-reasoning-four-key-properties-of-cognitive-ai-3f82cde8cf1e
  16. Singer, G. (2022, January 6). Thrill-K: A Blueprint for The Next Generation of Machine Intelligence. Medium. https://towardsdatascience.com/thrill-k-a-blueprint-for-the-next-generation-of-machine-intelligence-7ddacddfa0fe

The post Have Machines Just Made an Evolutionary Leap to Speak in Human Language? appeared first on Towards Data Science.

]]>
Multi-agent Simulation: A Key Function in Inference-time Intelligence https://towardsdatascience.com/multi-agent-simulation-a-key-function-in-inference-time-intelligence-f032dabf9a16/ Tue, 25 Oct 2022 20:13:17 +0000 https://towardsdatascience.com/multi-agent-simulation-a-key-function-in-inference-time-intelligence-f032dabf9a16/ Multi-Agent Simulation: A Key Function in Inference-Time Intelligence Avoiding combinatorial explosions in what-if scenarios involving multiple people or intelligent machines We are about to see a significant change in the role of simulation to evaluate real-time what-if scenarios in materializing machine intelligence. I believe that it can play an even more purposeful role if expanded […]

The post Multi-agent Simulation: A Key Function in Inference-time Intelligence appeared first on Towards Data Science.

]]>
Multi-Agent Simulation: A Key Function in Inference-Time Intelligence

Avoiding combinatorial explosions in what-if scenarios involving multiple people or intelligent machines

We are about to see a significant change in the role of simulation to evaluate real-time what-if scenarios in materializing machine intelligence. I believe that it can play an even more purposeful role if expanded to include agent-based simulation at inference time. This type of computation seeks to iteratively resolve problems based on inputs from multiple agents (humans or other AIs) which is characteristic of more real-world learning. As such, it has the potential to impart multiple "models of mind" during the machine learning process and advance the next generation of AI.

What is simulation, really?

To ground the discussion below, we need to start with a definition of simulation in the context of this discussion.

Here, we define simulation as a method that uses a specialized model to mimic real or proposed system operations to provide evidence for decision-making under various scenarios or process changes.

Simulation uses a specialized model to mimic real or proposed system operations to provide evidence for decision-making under various scenarios or process changes.

To better understand how simulation is relevant to human cognition, consider a situation commonly encountered by humans – a meeting of a medium-sized group of individuals. For example, this could be a meeting of a school sports team and their coach before an important game or match. All the individuals in the meeting will have slightly different contexts and objectives.

The coach will be able to simulate the unfolding of the meeting with a fairly high degree of precision and will actively utilize this simulation capability to plan what to say and how to achieve the best effect. What cognitive functions does this simulation require?

  • The coach must be able to keep track of what information is available to which individuals. Some information is public, like the name of the opposing team and the date of the match, while other information is private, like the health records of the individual players. She knows not to restate publicly known information unnecessarily, and to keep private information concealed.
  • She will need to model the mental and physical state of each player, as well as their objectives. She knows which players have been recently injured and which ones that have beaten their personal records. She understands that some are defending an already strong position while others are hoping for an opportunity to shine. She also knows which players respond well to challenges and which ones need extra encouragement.
  • She will continue to build her models of the players throughout the meeting. For example, if one child shows behavior that indicates strong personal growth, the coach will make note of it and adjust her future behavior accordingly.
  • Finally, the coach can model a sequence of potential interactions. For example, she knows that critiquing a player once will have a different effect than critiquing the same player three times in quick succession.

This causal multi-agent simulation capacity is at the very core of human social cognition. If we were to translate and refine the above features into more technical terms, we would need to extrapolate the following features as those which AI must have to exercise simulation more similarly to humans:

  • Ability to model, instantiate and update individual, distinguishable agents and other complex objects in the environment.
  • Ability to iterate through environment and agent states – i.e., AI would need to be capable of iteratively playing out sequences of relevant behaviors and interactions between the agents themselves and the agents with the environment.
  • Ability to model the behavior of each agent/object as a combination of generic and potentially custom functions (i.e., All children behave like F(x), and Kelly, in particular, has F(x=a) behavior).
  • Ability to track relevant input sequences and internal state (including state of knowledge) of each agent.

In the standard context of modern Artificial Intelligence, simulation does not typically include the above capabilities, especially at inference time.

Environmental simulation and its limitations

Most simulation-based AI research today focuses on problems like environmental simulation for the motion training of robots or autonomous vehicles. It is also used to compute an optimal action in reinforcement learning scenarios like video games. This type of simulation is based on a monolithic model – meaning that all inference is based on internally stored data. It is usually characterized by an explicitly defined objective (e.g. win the game). The AI agent’s objective does not account for potential qualitative changes in the environment or the objectives of other agents it must interact with.

Environmental simulation has achieved several impressive milestones. Notable among them is the work of Professor Joshua Tenenbaum and the team within the Department of Brain and Cognitive Sciences at MIT, who study simulation in the context of developmental milestones and physical scene understanding. In a similar vein, researchers at Google Brain have achieved more robust reasoning capabilities in large language models by injecting information from a physics simulation engine. And OpenAI’s Dota bot is the first AI bot to ever beat a world champion e-sports team in Dota 2, an online, multiplayer battle arena game.

Still, standard approaches in machine learning lack several features:

  • The simulations are typically run during training time rather than at inference time.
  • The simulation environment is typically "faceless" in that it doesn’t include complex, continuously evolving agents whose behavior can vary depending on the preceding sequence of interactions.
  • They cannot model agents acting on different objectives, something that humans do with ease. Such would require a type of simulation that incorporates a more complex world model and theory of mind — those key tenets of advanced intelligence that are so seamlessly embedded in the developing brain of a child and manifested in the crayon drawings of a kindergartener.

Open-ended real-world interactions involve agents acting on a variety of objectives, and therefore cannot be easily simulated using the paradigm of the best possible action given the environmental state. Furthermore, reinforcement learning (which is the paradigm traditionally used in this context) is already beset with immense state spaces, even for narrowly defined environments that are currently used today.

Zeroing in on causal agent-based simulation

Most machine learning does not incorporate multi-agent simulation, which is largely computationally prohibitive due to the explosion in the size of the sample space that it causes. This is a barrier that must be crossed to give AI the anticipatory capability it needs to address some of the world’s more overarching problems.

Could there be an approach that overcomes this computational intractability of an open-ended, multi-agent environment and that allows AI agents to become usefully integrated into such environments?

First, let’s more precisely describe where the computational intractability of traditional end-to-end approaches comes from.

Most of the intelligent tasks targeted by AI-based solutions today are non-situational, in the sense that the output is not dependent on the context or the specific situation in which the query is made. They also do not track the recent history of particular individuals or complex objects in their environment. In contrast, humans always apply their intelligence in a very strong contextual/situational setting; they are rarely ‘generic’ in their responses. Next-generation AI must incorporate representational constructs and functional modeling to rectify this gap.

When an AI with situational intelligence is placed in an environment with multiple complex agents, it must be able to perform two key functions:

  • track the input and previous behavior of those agents;
  • simulate what-if scenarios with potential response sequences and determine how those sequences might impact the environment and those agents.

Within current approaches, the system tries to create a comprehensive input-to-output function (e.g., implemented as a massive scale neural network) so that when presented with a situation, it can predict or recommend the next step. To map a multi-agent setting to such a "flat" input-to-output function, it needs to unroll all the potential sequences and multi-agent interactions during training, which can quickly become intractable.

However, if the paradigm is changed to use simulation of "what-if" scenarios during inference, there is no need to unroll a large combinatorial space. One would only simulate the relevant sequences to be evaluated at inference time. This would involve an infinitesimally smaller number of sequences, thus avoiding a combinatorial explosion.

In such cases, causal simulation with encapsulated agent models is not only the most efficient way of achieving the desired outcome but the only way. This simulation would allow the agent to interact with partial what-if scenarios without the need to unroll the entire environment at once. Reasoning could then be performed by iteratively going from non-viable to viable scenarios.

To illustrate this process, consider our earlier example of a sports team and coach. Let’s say we have ten players (agents), each of which has 100 possible behaviors. Our AI tries to generate potential what-if scenarios to choose the best course of action. If an AI tries to learn a model of each of the ten agents executing each of the possible behaviors for each possible environmental state, this would result in a massive combinatorial explosion. But in any realistic scenario, only a small fraction of agents’ behaviors and world states would be relevant. If the agent models are individually encapsulated and separated from the world model, the AI could perform a search to first select the relevant behaviors and world states, and then only unroll those simulated scenarios which would be causally likely and relevant.

This would be akin to a monolithic embedding space (learned by an end-to-end network) that is disentangled into discrete units, each holding the representation of the relevant environment or individual agent. These discrete units could then be queried to generate counterfactual scenarios, thereby containing the combinatorial explosion.

Summary

As AI systems move from the lab and into businesses and homes, they will require new capabilities to become more adaptive, situational, deeply contextual, and adept in persistent interaction with the people and entities around them. Causal agent-based simulation holds the key to the next generation of AI solutions. It addresses two massive needs: the need to support the human labor force with cooperative AI-based agents and perform tasks that rely on situation awareness but are beyond human capacity. Making these advances tractable and scalable will inevitably require the modularization of AI architectures to enable inference-time simulation capabilities.

References

  1. Wikipedia contributors. (2022, October 10). Simulation. Wikipedia. https://en.wikipedia.org/wiki/Simulation
  2. What is Simulation? What Does it Mean? (Definition and Examples). (n.d.). TWI. Retrieved October 24, 2022, from https://www.twi-global.com/technical-knowledge/faqs/faq-what-is-simulation
  3. Li, Y., Hao, X., She, Y., Li, S., & Yu, M. (2021). Constrained motion planning of free-float dual-arm space manipulator via deep reinforcement learning. Aerospace Science and Technology, 109, 106446.
  4. Pérez-Gil, Ó., Barea, R., López-Guillén, E., Bergasa, L. M., Gómez-Huélamo, C., Gutiérrez, R., & Díaz-Díaz, A. (2022). Deep reinforcement learning based control for autonomous vehicles in carla. Multimedia Tools and Applications, 81(3), 3553–3576.
  5. Joshua Tenenbaum. (2022, October 6). MIT-IBM Watson AI Lab. https://mitibmwatsonailab.mit.edu/people/joshua-tenenbaum/
  6. Battaglia, P. W., Hamrick, J. B., & Tenenbaum, J. B. (2013). Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110(45), 18327–18332.
  7. Liu, R., Wei, J., Gu, S. S., Wu, T. Y., Vosoughi, S., Cui, C., … & Dai, A. M. (2022). Mind’s Eye: Grounded Language Model Reasoning through Simulation. arXiv preprint arXiv:2210.05359.
  8. Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., … & Zhang, S. (2019). Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
  9. Piper, K. (2019, April 14). OpenAI’s Dota AI beats pro team OG as first AI to defeat reigning world champions. Vox. https://www.vox.com/2019/4/13/18309418/open-ai-dota-triumph-og
  10. Singer, G. (2022, August 17). Beyond Input-Output Reasoning: Four Key Properties of Cognitive Ai. Medium. https://towardsdatascience.com/beyond-input-output-reasoning-four-key-properties-of-cognitive-ai-3f82cde8cf1e
  11. Singer, G. (2022b, October 7). Advancing Machine Intelligence: Why Context Is Everything. Medium. https://towardsdatascience.com/advancing-machine-intelligence-why-context-is-everything-4bde90fb2d79

The post Multi-agent Simulation: A Key Function in Inference-time Intelligence appeared first on Towards Data Science.

]]>
Beyond Input-Output Reasoning: Four Key Properties of Cognitive AI https://towardsdatascience.com/beyond-input-output-reasoning-four-key-properties-of-cognitive-ai-3f82cde8cf1e/ Tue, 16 Aug 2022 20:17:13 +0000 https://towardsdatascience.com/beyond-input-output-reasoning-four-key-properties-of-cognitive-ai-3f82cde8cf1e/ The necessity of World Model, Theory of Mind, Continual Learning, and Late-Binding context.

The post Beyond Input-Output Reasoning: Four Key Properties of Cognitive AI appeared first on Towards Data Science.

]]>
The necessity of World Model, Theory of Mind, Continual Learning, and Late-Binding Context
_Image credit: James Thew via Adobe Stock._
_Image credit: James Thew via Adobe Stock._

AI research can be a humbling experience – some even claim that it remains at a relative standstill when it comes to replicating the most fundamental aspects of human intelligence. It can correct spelling, make financial investments, and even compose music; but it cannot alert someone to the fact that they have their t-shirt on inside out without being explicitly "taught" to do so. More to the point, it cannot discern why this might be useful information to have. And yet a five-year-old, just beginning to dress herself, will notice and point out the white tag at the base of her father’s neck.

Most researchers in the AI field are well aware that amid the numerous advances in deep neural networks, the gap between human intelligence and AI has not significantly narrowed, and solutions allowing for computationally efficient adaptive reasoning and decision-making in complex real-life environments remain elusive. Cognitive AI, that is intelligence that would allow machines to comprehend, learn and perform intellectual tasks similarly to humans, remains as elusive as ever. In this blog post, we will explore why this chasm exists and where AI research must go if we are to have any hope of crossing it.

Why ‘Mr. AI’ Can’t Hold a Job

How great would it be to work side-by-side with an AI assistant that could shadow us throughout the day, picking up our slack? Imagine how wonderful it would be if the algorithms could truly unburden us humans from the "drudge" tasks of our workday so we could focus on the more strategic and/or creative aspects of our jobs? The problem is, with a partial exception for systems like GitHub Copilot, a fictional ‘Mr. AI’ based on current state-of-the-art would likely receive its pink slip before the workday’s end.

For starters, Mr. AI is painfully forgetful, particularly when it comes to contextual memory. It also suffers from a crippling lack of attention. To some, this may seem surprising given the extraordinary large language models (LLM) of today, including LaMDA and GPT-3, which in some situations appear like they could well be conscious. However, even with the most advanced state-of-the-art deep learning models, Mr. AI’s work performance will invariably fall short of expectations. It doesn’t adapt well to changing environments and demands. It cannot independently ascertain that advice it provides is epistemologically and factually sound. It cannot even come up with a simple plan! And no matter how carefully engineered its social skills are, it’s bound to stumble in a highly dynamic world with complex social and ethical norms. It simply doesn’t have what it takes to thrive in a human world.

But what is that?

Four Key Properties of Advanced Intelligence

To impart more human-like intelligence to machines, one must first explore what makes human intelligence distinct from many current (circa 2022) neural networks typically used in AI applications. One way of drawing such a distinction is through the following four properties:

1. World Model

Humans naturally develop a "world model" that allows them to envision an endless number of short- and long-term "what if" scenarios that inform their decision-making and actions. AI models could become vastly more efficient with a similar capability, one that would allow them to simulate potential scenarios from beginning to end in a resource-efficient manner. An intelligent mechanism needs to model a complex environment with multiple interacting individual agents. An input-to-output mapping function (such as can be achieved with a complex feed-forward network), needs to "unroll" all the potential paths and interactions. The complexity of such an input-to-output unrolled model in a real-life environment would quickly explode in complexity, especially when accounting for tracking arbitrary duration of relevant history for each of the agents. In contrast, an intelligent mechanism that can model each of the factors and agents independently in a simulation environment, can evaluate numerous what-if future scenarios and grow the model by replicating copies of the actors, each with their knowable relevant history and behavior.

Key to acquiring a world model with such capacity for simulation is the decoupling between the construction of the building blocks of the world model (an epistemic reasoning process) and their subsequent use in simulation of possible outcomes. The resulting "what-if" scenarios could then be compared in a way that remains consistent even if the simulation methodology changes over time. A special case example of such approach can be found in Digital Twins, where the machine is equipped (through self-learning or explicit design) with a model of its environment and can simulate the potential futures of various interactions.

Figure 1. Digital Twin technology creates multifaceted models of complex actors in an interactive setting. Image credit: chesky via Adobe Stock
Figure 1. Digital Twin technology creates multifaceted models of complex actors in an interactive setting. Image credit: chesky via Adobe Stock

Intelligent beings and machines use models of the world (‘World Views’) to make sense of observations and assess potential futures to select the best course of action. In transitioning from a ‘generic’ large-scale setting (like replying to web queries) to direct interaction in a particular setting that includes multiple actors, the world model must be effectively scaled and customizable. A decoupled, modular, customizable approach is a logically different and far less complex architecture than one that attempts to simulate and reason all in a single "input-output function" step.

2. Theory of Mind

"Theory of mind" refers to a complex mental skill that has been defined by cognitive science as a capacity to understand and predict the actions and beliefs of another person by tracking that person’s attention and ascribing a mental state to them.

In the simplest of terms, it is what we do when we attempt to read another person’s mind. We develop and use this skill throughout our lives to help us navigate our social interactions. It’s why we don’t remind our dieting co-worker of the giant plate of fresh chocolate chip cookies in the breakroom.

We see traces of the theory of mind being used in AI applications such as chatbots that are uniquely attuned to the emotions of the customer they are serving, based on the reason they opened the chat, the language they are using, etc. However, performance metrics used to train such social chatbots – typically defined as conversation-turns per session, or CPS – merely train the model to maximize the attention from the human, and do not force the system to develop an explicit model of the human’s mind as measured by reasoning and planning tasks.

Figure 2. AI system with Theory of Mind capability can modify its output based on end user needs and preferences. Image credit: Intel Labs.
Figure 2. AI system with Theory of Mind capability can modify its output based on end user needs and preferences. Image credit: Intel Labs.

In a system that needs to interact with a particular set of individuals, theory of mind would require a more structured representation that is amenable to logical reasoning operations such as those employed in deductive and inductive reasoning, planning, inference of intent, and so on. Moreover, such a model would have to keep track of the varied behavioral profiles, be predictably updateable with the influx of new information and avoid relapsing into a previous model state.

3. Continual learning

With some exceptions, the standard machine learning paradigm of today is batched, offline learning, potentially followed by fine-tuning to the specific task. The resulting models are therefore unable to extract useful long-term updates from the information they are being exposed to while deployed. Humans have no such limitation. They learn continuously and use this information to build cognitive models like world views and theories of mind. In fact, continual learning is what enables humans to maintain and update their mental models.

The problem of continual learning (also dubbed lifelong and continuous learning) is now garnering much stronger interest in the AI research community, in part due to the practical demands brought by the advent of technologies like federated learning and workflows like medical data processing. An AI system that employs a world model of an environment it operates in, with theories of mind associated with the various agents in that environment, a continual learning capability would be critical to maintain and update historical and current state descriptors for each object and agent.

While the industry’s need is very clear, much remains to be done. Specifically, solutions that enable continual learning of information to then be used for reasoning or planning are still in their nascency – and such solutions would be required to enable the model-building capabilities above.

4. Late binding context

Late binding context refers to the combination of contextually specific (rather than generic) responses and utilizes the latest relevant information available at the time of query or decision. Contextual awareness embodies all the subtle nuances of human learning – it is the ‘who’, ‘why’, ‘when’ and ‘what’ that inform human decisions and behavior. It keeps humans from resorting to reasoning shortcuts and jumping to imprecise, generalized conclusions. Rather, contextual awareness allows us to build an adaptive set of behaviors tailored to the specific state of environment that needs to be addressed. Without this capability, our decision-making would be greatly impaired. Late binding context is also closely intertwined with continuous learning. For more information on late binding context please see the previous blog, Advancing Machine Intelligence: Why Context Is Everything.

Human Cognition as a Roadmap to the Future of AI

Without the key cognitive capabilities listed above, many of the critical needs of human industry and society will not be met. There is therefore a fairly urgent need to apportion more research to the translation of human cognitive capabilities into AI functionality – in particular those properties that make it unique from current AI models. The four properties listed above are a starting point. They lay at the center of a complex web of human cognitive function and provide a path towards computationally efficient adaptive models that can be deployed in real-life multi-actor environments. As AI proliferates from centralized, homogenized large models to the multitude of uses that are integrated within socially complex settings, the next set of properties will need to emerge.

References

  1. Mitchell, M. (2021). Why AI is harder than we think. arXiv preprint arXiv:2104.12871. https://arxiv.org/abs/2104.12871
  2. Marcus, G. (2022, July 19). Deep Learning Is Hitting a Wall. Nautilus | Science Connected. https://nautil.us/deep-learning-is-hitting-a-wall-14467/
  3. Singer, G. (2022, January 7). The Rise of Cognitive AI – Towards Data Science. Medium. https://towardsdatascience.com/the-rise-of-cognitive-ai-a29d2b724ccc
  4. Ziegler, A., Kalliamvakou, E., Li, X. A., Rice, A., Rifkin, D., Simister, S., … & Aftandilian, E. (2022, June). Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (pp. 21–29). https://dl.acm.org/doi/pdf/10.1145/3520312.3534864
  5. Malone, T. W., Rus, D., & Laubacher, R. (2020). Artificial Intelligence and the future of work. A report prepared by MIT Task Force on the work of the future, Research Brief, 17, 1–39. https://workofthefuture.mit.edu/research-post/artificial-intelligence-and-the-future-of-work/
  6. Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H. T., … & Le, Q. (2022). Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239. https://arxiv.org/pdf/2201.08239.pdf
  7. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901. https://arxiv.org/abs/2005.14165
  8. Curtis, B., & Savulescu, J. (2022, June 15). Is Google’s LaMDA conscious? A philosopher’s view. The Conversation. https://theconversation.com/is-googles-lamda-conscious-a-philosophers-view-184987
  9. Dickson, B. (2022, July 24). Large language models can’t plan, even if they write fancy essays. TechTalks. https://bdtechtalks.com/2022/07/25/large-language-models-cant-plan/
  10. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6), 1–35. https://dl.acm.org/doi/abs/10.1145/3457607
  11. LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence Version 0.9. 2, 2022–06–27. https://openreview.net/pdf?id=BZ5a1r-kVsf
  12. El Saddik, A. (2018). Digital twins: The convergence of multimedia technologies. IEEE multimedia, 25(2), 87–92. https://ieeexplore.ieee.org/abstract/document/8424832
  13. Frith, C., & Frith, U. (2005). Theory of mind. Current biology, 15(17), R644-R645. https://www.cell.com/current-biology/pdf/S0960-9822(05)00960-7.pdf
  14. Apperly, I. A., & Butterfill, S. A. (2009). Do humans have two systems to track beliefs and belief-like states?. Psychological review, 116(4), 953. https://psycnet.apa.org/doiLanding?doi=10.1037%2Fa0016923
  15. Baron-Cohen, S. (1991). Precursors to a theory of mind: Understanding attention in others. Natural theories of mind: Evolution, development and simulation of everyday mindreading, 1, 233–251.
  16. Wikipedia contributors. (2022, August 14). Theory of mind. Wikipedia. https://en.wikipedia.org/wiki/Theory_of_mind
  17. Shum, H. Y., He, X. D., & Li, D. (2018). From Eliza to XiaoIce: challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering, 19(1), 10–26. https://link.springer.com/article/10.1631/FITEE.1700826
  18. Zhou, L., Gao, J., Li, D., & Shum, H. Y. (2020). The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1), 53–93. https://direct.mit.edu/coli/article/46/1/53/93380/The-Design-and-Implementation-of-XiaoIce-an
  19. Reina, G. A., Gruzdev, A., Foley, P., Perepelkina, O., Sharma, M., Davidyuk, I., … & Bakas, S. (2021). OpenFL: An open-source framework for Federated Learning. arXiv preprint arXiv:2105.06413. https://arxiv.org/abs/2105.06413
  20. Vokinger, K. N., Feuerriegel, S., & Kesselheim, A. S. (2021). Continual Learning in medical devices: FDA’s action plan and beyond. The Lancet Digital Health, 3(6), e337-e338. https://www.thelancet.com/journals/landig/article/PIIS2589-7500(21)00076-5/fulltext
  21. Singer, G. (2022b, May 14). Advancing Machine Intelligence: Why Context Is Everything. Medium. https://towardsdatascience.com/advancing-machine-intelligence-why-context-is-everything-4bde90fb2d79

The post Beyond Input-Output Reasoning: Four Key Properties of Cognitive AI appeared first on Towards Data Science.

]]>
Advancing Machine Intelligence: Why Context Is Everything https://towardsdatascience.com/advancing-machine-intelligence-why-context-is-everything-4bde90fb2d79/ Tue, 10 May 2022 16:05:08 +0000 https://towardsdatascience.com/advancing-machine-intelligence-why-context-is-everything-4bde90fb2d79/ Most of us have heard the phrase, "Image is everything." But when it comes to taking AI to the next level, it's context that is everything.

The post Advancing Machine Intelligence: Why Context Is Everything appeared first on Towards Data Science.

]]>
Most of us have heard the phrase, "Image is everything." But when it comes to taking AI to the next level, it’s context that is everything.

Contextual awareness embodies all the subtle nuances of human learning. It is the ‘who’, ‘where’, ‘when’, and ‘why’ that inform human decisions and behavior. Without context, the current foundation models are destined to spin their wheels and ultimately interrupt the trajectory of expectation for AI to improve our lives.

This blog will discuss the significance of context in ML, and how late binding context could raise the bar on machine enlightenment.

Why Context Matters

Context is so deeply embedded in human learning that it is easy to overlook the critical role it plays in how we respond to a given situation. To illustrate this point, consider a conversation between two people that begins with a simple question: How is Grandma?

In a real-world conversation, this simple query could elicit any number of potential responses depending on contextual factors, including time, circumstance, relationship, etc.:

Fig 1. A proper answer to "How's Grandma?" is highly context-dependent. Image credit: Intel Labs.
Fig 1. A proper answer to "How’s Grandma?" is highly context-dependent. Image credit: Intel Labs.

The question illustrates how the human mind can track and take into account a vast amount of contextual information, even subtle humor, to return a relevant response. This ability to fluidly adapt to a variety of often subtle contexts is well beyond the reach of modern AI systems.

To grasp the significance of this deficit in Machine Learning, consider the development of reinforcement learning (RL)-based autonomous agents and robots. Despite the hype and success that RL-based architectures have had in simulated game environments like Dota 2 and StarCraft II, even purely gaming environments like NetHack pose a formidable obstacle to current RL systems due to the highly conditional nature and complexity of policies that are required to win the game. Similarly, as noted in many recent works, autonomous robots have miles to go before they can interact with previously unseen physical environments without the need of a serious engineering effort to either simulate the correct type of environment prior to deployment, or to harden the learned policy.

Current ML and Handling of Contextual Queries

With some notable exceptions, most ML models incorporate very limited context of a specific query, relying primarily on the generic context provided by the dataset that the model is trained or fine-tuned on. Such models also raise significant concerns about bias which makes them less suited for use in many business, healthcare, and other critical applications. Even state-of-the-art models like D3ST used in voice assistant AI applications require manually creating descriptions of schemata or ontologies with possible intents and actions that the model needs to identify context. While this involves a relatively minimal level of handcrafting, it nonetheless means that an explicit human input is required every time the context of the task is to be updated.

That’s not to say there haven’t been significant developments in context awareness for machine learning models. GPT-3, a famous large language model from the OpenAI team, has been used to generate full articles that rival human composition – a task that requires keeping track of at least local context. The Pathways Language Model (PaLM), introduced by Google in April 2022, demonstrates even greater capability, including the ability to understand conceptual combinations in appropriate contexts to answer complex queries.

Fig 2. PaLM is able to successfully handle queries that require jumping between different contexts for the same concept. Image credit: Google Research [13] under CC BY 4.0 license.
Fig 2. PaLM is able to successfully handle queries that require jumping between different contexts for the same concept. Image credit: Google Research [13] under CC BY 4.0 license.

Many of the recent advancements have focused on retrieval-based query augmentation, in which the input into the model (query) is supplemented by automatic retrieval of relevant data from an auxiliary database. This has enabled significant advances in applications like question answering and reasoning over knowledge graphs.

With such a flurry of significant improvements in the quality of achievable outputs even within some contextual constraints, it may be tempting to assume that this demonstrates a more general context awareness in modern AI systems. However, these models still do not lend the kind of context that is needed for more complex applications, as might be used in manufacturing, medical practice, etc. Such applications often require a certain fluidity with regard to context – as discussed in the adaptability section in one of the previous blogs. For example, the relevant context might have to be conditioned on temporal information like the level of urgency of the user’s request, or objectives and sensitivities of an interaction. This adaptability would allow the appropriate context for a given query to evolve based on the progression of communication with the human. In simple human terms, the model would have to avoid jumping to any conclusions until it has all the relevant contextual information. This carefully-timed suspension of the final response to the original query is what can be called late binding context.

It bears mentioning that recent neural network models do have the capability to achieve some late binding context. For example, if the model is appended with an auxiliary database like Wikipedia, it can condition its response with the latest version of Wikidata, thus considering some degree of time-relevant context before offering a response to a particular query.

One of the domains that puts high premium on context is conversational AI and in particular multi-turn dialogue modeling. However, it is acknowledged that there are key challenges in providing topic awareness, and considering implied time, prior knowledge and intentionality.

The issue with most of the currently deployed AI technologies is that even if they can perform the process of conditioning in a particular case, conditioning over time remains a challenge for many applications, since it requires a combination of understanding of the task at hand as well as a memory of the sequence of events that happened before, which acts as a conditioning prior. To consider a more light-hearted, metaphorical example, one could recall the Canadian detective show "Murdoch Mysteries", which is famous for its refrain "What have you, George?". This is the phrase that is continuously being used by Detective Murdoch to query constable Crabtree on the latest developments, and the answer is always different and highly dependent on the events that have previously transpired in the story.

Building Context Into Machine Learning

So how could it be possible to incorporate and leverage late binding context in machine learning, at scale?

One way would be to create "selection masks" or "overlays" of meta-knowledge that provide overlapping layers of relevant contextual information that effectively narrow down search parameters based on a particular use case. In case of a medical search for the correct prescription, for example, a doctor would consider the patient’s diagnosed condition, other comorbidities, age, a history of previous adverse reactions and allergies, etc, to constrain the search space to a specific drug. To address the late-binding aspects of context, such overlays must be dynamic to capture recent information, refinement of scope based on case-specific knowledge, comprehension of the objectives of the interaction in flight, and more.

Fig 3: Correct medical treatment decisions require a lot of patient-specific timely contextual considerations. Image credit: Intel Labs.
Fig 3: Correct medical treatment decisions require a lot of patient-specific timely contextual considerations. Image credit: Intel Labs.

Source attribution is another key meta-knowledge dimension that can be particularly useful as a selection mask that could be used to enable late binding context. This is how a model would give more credence to a specific source over another–say the New England Journal of Medicine versus an anonymous Reddit post. Another application for source attribution would be selecting the correct set of decision rules and constraints that should be applied in a given situation – for example, laws of the local jurisdiction, or traffic rules in a specific state. Source attribution is also key for reducing bias by considering information within the context of the source creating it rather than assume correctness through the statistics of occurrences.

This blog has not touched a very important aspect – how can a human or a future AI system select the relevant pieces of information to consider as a context of a particular query? What is the data structure that one must search over in order to find the contextually relevant pieces of data, and how is this structure learned? More on these questions in future posts.

Avoid taking intelligence out of context

The field of AI is making strides in incorporating conditioning, compositionality and context. However, the next level of machine intelligence will require significant advances in incorporating the ability to dynamically comprehend and apply the multiple facets of late-binding context. When considered within the scope of highly-aware, in-the moment interactive AI, context is everything.

References

  1. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  2. Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., … & Zhang, S. (2019). Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
  3. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., … & Silver, D. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), 350–354.
  4. Küttler, H., Nardelli, N., Miller, A., Raileanu, R., Selvatici, M., Grefenstette, E., & Rocktäschel, T. (2020). The nethack learning environment. Advances in Neural Information Processing Systems, 33, 7671–7684.
  5. Kostrikov, I., Nair, A., & Levine, S. (2021). Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169.
  6. Ibarz, J., Tan, J., Finn, C., Kalakrishnan, M., Pastor, P., & Levine, S. (2021). How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4–5), 698–721.
  7. Loquercio, A., Kaufmann, E., Ranftl, R., Müller, M., Koltun, V., & Scaramuzza, D. (2021). Learning high-speed flight in the wild. Science Robotics, 6(59), eabg5810.
  8. Yasunaga, M., Ren, H., Bosselut, A., Liang, P., & Leskovec, J. (2021). Qa-gnn: Reasoning with language models and knowledge graphs for question answering. arXiv preprint arXiv:2104.06378.
  9. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6), 1–35.
  10. Zhao, J., Gupta, R., Cao, Y., Yu, D., Wang, M., Lee, H., … & Wu, Y. (2022). Description-Driven Task-Oriented Dialog Modeling. arXiv preprint arXiv:2201.08904.
  11. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
  12. Reporter, G. S. (2020, September 11). A robot wrote this entire article. Are you scared yet, human? The Guardian.
  13. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … & Fiedel, N. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  14. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
  15. Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., … & Sifre, L. (2021). Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426.
  16. Singer, G. (2022, January 15). LM!=KM: The Five Reasons Why Language Models Fall Short of Supporting Knowledge Model Requirements of Next-Gen AI. Medium.
  17. Moore, A. W. (2022, April 14). Conversational AI’s Moment Is Now. Forbes.
  18. Xu, Y., Zhao, H., & Zhang, Z. (2021, May). Topic-aware multi-turn dialogue modeling. In The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21).
  19. Li, Y., Li, W., & Nie, L. (2022). Dynamic Graph Reasoning for Conversational Open-Domain Question Answering. ACM Transactions on Information Systems (TOIS), 40(4), 1–24.
  20. Gao, C., Lei, W., He, X., de Rijke, M., & Chua, T. S. (2021). Advances and challenges in conversational recommender systems: A survey. AI Open, 2, 100–126.
  21. Singer, G. (2022b, May 9). Understanding of and by Deep Knowledge – Towards Data Science. Medium.

The post Advancing Machine Intelligence: Why Context Is Everything appeared first on Towards Data Science.

]]>
Multimodality: A New Frontier in Cognitive AI https://towardsdatascience.com/multimodality-a-new-frontier-in-cognitive-ai-8279d00e3baf/ Wed, 02 Feb 2022 06:39:43 +0000 https://towardsdatascience.com/multimodality-a-new-frontier-in-cognitive-ai-8279d00e3baf/ On modern Multimodal ML architectures and their applications

The post Multimodality: A New Frontier in Cognitive AI appeared first on Towards Data Science.

]]>
Enabling smarter, adaptive AI with innovative multimodal systems

Written in collaboration with Vasudev Lal and the Cognitive AI team at Intel Labs.

An exciting frontier in Cognitive AI involves building systems that can integrate multiple modalities and synthesize the meaning of language, images, video, audio and structured knowledge sources such as relation graphs. Adaptive applications like conversational AI; video and image search using language; autonomous robots and drones; and AI multimodal assistants will require systems that can interact with the world using all available modalities and respond appropriately within specific contexts. In this blog, we will introduce the concept of multimodal learning along with some of its main use cases, and discuss the progress made at Intel Labs towards creating of robust multimodal reasoning systems.

In the past few years, deep learning (DL) solutions have performed better than the human baseline in many natural language processing (NLP) benchmarks (e.g., SuperGLUE, GLUE, SQuAD) and computer vision benchmarks (e.g., ImageNet). The progress on individual modalities is a testament to the perception or recognition-like capabilities achieved by the highly effective statistical mappings learned by neural networks.

These single-modality tasks were considered extremely difficult to tackle just a decade ago but are currently major AI workloads in datacenter, client, and edge products. However, in multimodal settings, many of the insights that could be gleaned using automated methods still go unexploited.

Multimodality for Human-Centric Cognitive AI

Human cognitive abilities are often associated with successful learning from multiple modalities. For example, the concept of an apple should include information obtained from vision: what it usually looks like in terms of color, shape, texture, etc. But the concept of an apple formed by humans and advanced AI systems should also be informed by what sound the apple makes when it is bitten into, what people mean when they talk about apple pie, and the comprehensive knowledge available about apples in text corpora like Wikipedia, or structured knowledge bases like Wikidata.

Figure 1. The various modalities associated with the concept of "apple". Image credit: Intel Labs © 2022 Intel Corporation.
Figure 1. The various modalities associated with the concept of "apple". Image credit: Intel Labs © 2022 Intel Corporation.

A multimodal AI system can ingest knowledge from multiple sources and modalities and utilize it to solve tasks involving any modality. Information learned through images and the knowledge base should be usable in answering a natural language question; similarly, information learned from text should be used when needed on visual tasks. It all connects through concepts that intersect all modalities or, as it is said: a dog is a dog is a dog.

Figure 2. A dog is a dog is a dog. Image credit: Intel Labs © 2022 Intel Corporation.
Figure 2. A dog is a dog is a dog. Image credit: Intel Labs © 2022 Intel Corporation.

Commonsense knowledge is inherently multimodal

Humans possess a lot of commonsense knowledge about the world, like awareness that birds fly in the sky and cars drive on the road. Such commonsense knowledge is typically acquired through a combination of visual, linguistic, and sensory cues rather than language alone. Common sense was called ‘the dark matter of AI’ by Oren Etzioni, CEO of the Allen Institute for Artificial Intelligence. That’s because common sense consists of implicit information – the broad (and broadly shared) set of unwritten assumptions and rules of thumb that humans automatically use to make sense of the world.

Interestingly, multimodal systems can provide an avenue to address the lack of commonsense knowledge in AI systems. One way to improve the commonsense knowledge of transformer-based language models like BERT/GPT-3, would be to incorporate training signals spanning other modalities into the model architecture. The first step in achieving this capability is to align the internal representation across the different modalities.

When the AI receives an image and related text and processes both, it needs to associate the same object or concept between the modalities. For example, consider a scenario where AI sees a picture of a car with text mentioning the wheels on the car. The AI needs to attend to the part of the image with the car wheels when it attends to the part of the text that refers to them. The AI needs to "know" that the image of the car wheels and the text mentioning the wheels refer to the same object across different modalities.

Current Multimodal AI tasks and architectures

As of early 2022, multimodal AI systems are experimenting with driving text/NLP and vision to an aligned embedding space to facilitate multimodal decision-making. There exist a number of tasks that require the model to have at least some amount of multimodal capacity. Following is a brief overview of four prevalent workloads and the corresponding SotA models

  • Image description generation, text-to-image generation

Perhaps, the most well-known models that deal with the task of image descriptions and text-to-image generation are OpenAI’s CLIP and DALL-E, and their successor GLIDE.

CLIP pre-trains separate image and text encoders and learns to predict which images in a dataset are paired with various descriptions. Interestingly, just as with the "Halle Berry" neuron in humans, CLIP has been shown to have multimodal neurons that activate when exposed both to the classifier label text as well as to the corresponding image, indicating a fused multimodal representation. DALL-E is a 13 billion parameter variant of GPT-3 which takes text as an input and generates a series of output images to match the text; the generated images then get ranked using CLIP. GLIDE is an evolution of DALL-E which still uses CLIP to rank generated images; however, the image generation is done using a diffusion model.

  • Visual question answering

Visual question answering, as presented in datasets like VQA, is a task that requires a model to correctly respond to a text-based question based on an image. Teams at Microsoft Research have developed some of the leading approaches for the task. METER is a general framework for training performant end-to-end vision-language transformers using a variety of possible sub-architectures for the vision encoder, text encoder, multimodal fusion and decoder modules. Unified Vision-Language pretrained Model (VLMo) uses a modular transformer network to jointly learn a dual encoder and a fusion encoder. Each block in the network contains a pool of modality-specific experts and a shared self-attention layer, offering significant flexibility for fine-tuning.

  • Text-to-image and image-to-text search

Web search is another important application of Multimodal Learning. An example of a dataset presenting this task is WebQA, which is a multimodal and multi-hop benchmark that simulates web search. WebQA was constructed by teams at Microsoft and Carnegie Mellon University.

In this task, a model needs to identify sources (either image or text-based) that can help answer the query. For most questions, the model needs to consider more than one source to get to the correct answer. The system then needs to reason using these multiple sources to generate an answer for the query in natural language.

Google has tackled the multimodal search task with A Large-scale ImaGe and Noisy-Text Embedding model (ALIGN). This model exploits the easily available but noisy alt-text data associated with images on the internet to train separate visual (EfficientNet-L2) and text (BERT-Large) encoders, the outputs of which are then combined using contrastive learning. The resulting model stores multimodal representations that power cross-modal search without any further fine-tuning.

  • Video-language modeling

Historically, video-based tasks have been challenging for AI systems because they are resource-intensive; but this is beginning to change. One of the main efforts in the domain of video-language modeling and other video-related multimodal tasks is driven by Microsoft’s Project Florence-VL. In mid-2021, Project Florence-VL introduced ClipBERT, which involves a combination of a CNN and a transformer model that operates on top of sparsely sampled frames and is optimized in an end-to-end fashion to solve popular video-language tasks. VIOLET and SwinBERT are evolutions of ClipBERT that introduce Masked Visual-token Modeling and Sparse Attention to improve SotA in video question answering, video retrieval and video captioning.

The difference is in the details, but all the models above share the same characteristic of using a transformer-based architecture. This type of architecture is often coupled with parallel learning modules to extract data from the various modalities and then unify them into a single multimodal representation.

Intel Labs and Microsoft Create Vision-and-Language Pre-training Model

In a similar fashion to the approaches described above, the work of the Cognitive AI (CAI) research team at Intel Labs focuses on creating multimodal representations using a transformer-based model architecture. However, unlike some models such as CLIP (which is good at instance-level pairing of image and text), the Cognitive AI team’s approach is to achieve fine-grained alignment of entities in image and text. The architectures developed also allow full-image context to be provided to the same multimodal transformer that also processes text.

Working jointly with the Microsoft Research Natural Language Computing (NLC) group, the Cognitive AI team recently unveiled KD-VLP, a model that is particularly effective at concept-level vision-language alignment. The architecture and pre-training tasks emphasize entity-level representations, or objectness, in the system. KD-VLP demonstrates competitive performance on tasks like Visual Question Answering (VQA2.0), Visual Commonsense Reasoning (VCR), Image and Text Retrieval (IR/TR) on MSCOCO and Flickr30K, Natural Language for Visual Reasoning (NLVR2), and Visual Entailment (SNLI-VE).

The self-supervised training of the model results in emergent attention patterns that are also interpretable. For example, the following clip shows how the visual attention of the model changes as it ponders each word in the accompanying text. These patterns provide valuable insight into the model’s inner workings and insight into its reasoning mechanisms. Such insight is valuable when exploring gaps in the model’s reasoning capabilities that need to be addressed.

Figure 3: Heatmap tracking multimodal attention. Image credit: Intel Labs © 2022 Intel Corporation.
Figure 3: Heatmap tracking multimodal attention. Image credit: Intel Labs © 2022 Intel Corporation.

This research collaboration with the Microsoft research team has produced solutions that tackle multimodal challenges such as question answering over a multimodal dataset. A knowledge-informed multimodal system currently leads the public leaderboard on the VisualCOMET task, where the AI system needs to reason about the dynamic content of a still image. The model can evoke a dynamic storyline from a single image, like how humans can conjure up what happened previously and what can happen next.

This single-model solution is also rather competitive on the public leaderboard of the Visual Commonsense Reasoning (VCR) challenge. It is currently within the top five among single model solutions and our solution to WebQA made it on the winning list of NeurIPS2021 competition. The WebQA solution involves a novel method to incorporate multimodal sources into a language-generation model. The system can contextualize image and text sources with the question through a multimodal encoder and effectively aggregate information across multiple sources. A decoder uses the result of this fusion across multiple multimodal sources to answer the query in natural language.

Figure 4: Example of a WebQA question with attention heat map. Trogon Surrucura image credit: Wikimedia and Cláudio Dias Timm.
Figure 4: Example of a WebQA question with attention heat map. Trogon Surrucura image credit: Wikimedia and Cláudio Dias Timm.

Conclusion

Real-life environments are inherently multimodal. This application area allows the AI research community to further push the transition of AI from statistical analytics of a single perception modality (like images or text) to a multifaceted view of objects and their interaction, helping to make progress on the journey from ‘form’ to ‘meaning.’

References

  1. Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., … & Bowman, S. R. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.
  2. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  3. Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822.
  4. Rajpurkar, P., Jia, R., & Liang, P. (2021). The Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuAD-explorer/
  5. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026–1034).
  6. Wikidata. (2019). Retrieved January 31, 2022, from https://www.wikidata.org/wiki/Wikidata:Main_Page
  7. Knight, W. (2020, April 2). The US military wants to teach AI some basic common sense. MIT Technology Review. https://www.technologyreview.com/2018/10/11/103957/the-us-military-wants-to-teach-ai-some-basic-common-sense/
  8. Pavlus, J. (2020, May 4). Common Sense Comes to Computers. Quanta Magazine. https://www.quantamagazine.org/common-sense-comes-to-computers-20200430/
  9. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  10. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
  11. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
  12. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., … & Sutskever, I. (2021). Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092.
  13. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., … & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
  14. Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., & Fried, I. (2005). Invariant visual representation by single neurons in the human brain. Nature, 435(7045), 1102–1107.
  15. Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., … & Olah, C. (2021). Multimodal neurons in artificial neural networks. Distill, 6(3), e30.
  16. Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv:1503.03585, 2015.
  17. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6904–6913).
  18. Dou, Z. Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang, L., … & Zeng, M. (2021). An Empirical Study of Training End-to-End Vision-and-Language Transformers. arXiv preprint arXiv:2111.02387.
  19. Wang, W., Bao, H., Dong, L., & Wei, F. (2021). VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. arXiv preprint arXiv:2111.02358.
  20. Chang, Y., Narang, M., Suzuki, H., Cao, G., Gao, J., & Bisk, Y. (2021). WebQA: Multihop and Multimodal QA. arXiv preprint arXiv:2109.00590.
  21. Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., … & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918.
  22. Jia, C., & Yang, Y. (2021, May 11). ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. Google AI Blog. https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html
  23. Tan, M., & Le, Q. V. (2019, May 29). EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling. Google AI Blog. https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html
  24. Devlin, J., & Chang, M. (2018, November 2). Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google AI Blog. https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
  25. Microsoft. (2021, December 14). Project Florence-VL. Microsoft Research. https://www.microsoft.com/en-us/research/project/project-florence-vl/
  26. Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T. L., Bansal, M., & Liu, J. (2021). Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7331–7341).
  27. Fu, T. J., Li, L., Gan, Z., Lin, K., Wang, W. Y., Wang, L., & Liu, Z. (2021). VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling. arXiv preprint arXiv:2111.12681.
  28. Lin, K., Li, L., Lin, C. C., Ahmed, F., Gan, Z., Liu, Z., … & Wang, L. (2021). SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning. arXiv preprint arXiv:2111.13196.
  29. Liu, Y., Wu, C., Tseng, S. Y., Lal, V., He, X., & Duan, N. (2021). Kd-vlp: Improving end-to-end vision-and-language pretraining with object knowledge distillation. arXiv preprint arXiv:2109.10504.
  30. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425–2433).
  31. Zellers, R., Bisk, Y., Farhadi, A., & Choi, Y. (2019). From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6720–6731).
  32. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., … & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer, Cham.
  33. Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67–78.
  34. Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., & Artzi, Y. (2018). A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491.
  35. Xie, N., Lai, F., Doran, D., & Kadav, A. (2018). Visual entailment task for visually-grounded language learning. arXiv preprint arXiv:1811.10582.
  36. Microsoft. (2022, January 19). Natural Language Computing. Microsoft Research. https://www.microsoft.com/en-us/research/group/natural-language-computing/
  37. Park, J. S., Bhagavatula, C., Mottaghi, R., Farhadi, A., & Choi, Y. (2020, August). VisualCOMET: Reasoning about the dynamic context of a still image. In European Conference on Computer Vision (pp. 508–524). Springer, Cham.

The post Multimodality: A New Frontier in Cognitive AI appeared first on Towards Data Science.

]]>