Sheila Teo, Author at Towards Data Science https://towardsdatascience.com/author/sheilateozy/ The world’s leading publication for data science, AI, and ML professionals. Tue, 28 Jan 2025 13:34:48 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Sheila Teo, Author at Towards Data Science https://towardsdatascience.com/author/sheilateozy/ 32 32 How I Won Singapore’s GPT-4 Prompt Engineering Competition https://towardsdatascience.com/how-i-won-singapores-gpt-4-prompt-engineering-competition-34c195a93d41/ Fri, 29 Dec 2023 00:29:22 +0000 https://towardsdatascience.com/how-i-won-singapores-gpt-4-prompt-engineering-competition-34c195a93d41/ A deep dive into the strategies I learned for harnessing the power of Large Language Models (LLMs)

The post How I Won Singapore’s GPT-4 Prompt Engineering Competition appeared first on Towards Data Science.

]]>
Celebrating a milestone - The real win was the priceless learning experience!
Celebrating a milestone – The real win was the priceless learning experience!

Last month, I had the incredible honor of winning Singapore’s first ever GPT-4 Prompt Engineering competition, which brought together over 400 prompt-ly brilliant participants, organised by the Government Technology Agency of Singapore (GovTech).

Prompt engineering is a discipline that blends both art and science – it is as much technical understanding as it is of creativity and strategic thinking. This is a compilation of the prompt engineering strategies I learned along the way, that push any LLM to do exactly what you need and more!

Author’s Note:In writing this, I sought to steer away from the traditional prompt engineering techniques that have already been extensively discussed and documented online. Instead, my aim is to bring fresh insights that I learned through experimentation, and a different, personal take in understanding and approaching certain techniques. I hope you’ll enjoy reading this piece!

This article covers the following, with 🔵 referring to beginner-friendly prompting techniques while 🔴 refers to advanced strategies:

1. [🔵 ] Structuring prompts using the CO-STAR framework

2. [🔵 ] Sectioning prompts using delimiters

3. [🔴 ] Creating system prompts with LLM guardrails

4. [🔴 ] Analyzing datasets using only LLMs, without plugins or code – With a hands-on example of analyzing a real-world Kaggle dataset using GPT-4


1. [🔵 ] Structuring Prompts using the CO-STAR framework

Effective prompt structuring is crucial for eliciting optimal responses from an LLM. The CO-STAR framework, a brainchild of GovTech Singapore’s Data Science & AI team, is a handy template for structuring prompts. It considers all the key aspects that influence the effectiveness and relevance of an LLM’s response, leading to more optimal responses.

CO-STAR framework - Image by author
CO-STAR framework – Image by author

Here’s how it works:

(C) Context: Provide background information on the task

This helps the LLM understand the specific scenario being discussed, ensuring its response is relevant.

(O) Objective: Define what the task is that you want the LLM to perform

Being clear about your objective helps the LLM to focus its response on meeting that specific goal.

(S) Style: Specify the writing style you want the LLM to use

This could be a particular famous person’s style of writing, or a particular expert in a profession, like a business analyst expert or CEO. This guides the LLM to respond with the manner and choice of words aligned with your needs.

(T) Tone: Set the attitude of the response

This ensures the LLM’s response resonates with the intended sentiment or emotional context required. Examples are formal, humorous, empathetic, among others.

(A) Audience: Identify who the response is intended for

Tailoring the LLM’s response to an audience, such as experts in a field, beginners, children, and so on, ensures that it is appropriate and understandable in your required context.

(R) Response: Provide the response format

This ensures that the LLM outputs in the exact format that you require for downstream tasks. Examples include a list, a JSON, a professional report, and so on. For most LLM applications which work on the LLM responses programmatically for downstream manipulations, a JSON output format would be ideal.

A practical application of CO-STAR

Here’s a real-world example of why the CO-STAR framework is useful. Let’s say you work as a social media manager, and you need help drafting a Facebook post to advertise your company’s new product.

A quick prompt without CO-STAR might look something like this:

Write a facebook post to advertise my company’s new product. My company’s name is Alpha and the product is called Beta, a new ultra-fast hairdryer.

Here’s the output from GPT-4:

Introducing the Revolutionary Beta Hairdryer by Alpha!

  • Say goodbye to long, tedious hair drying sessions! Alpha is proud to unveil the Beta, our latest innovation in haircare technology.
  • Ultra-Fast Drying: The Beta is designed to cut your drying time in half, thanks to its advanced, powerful motor. Get ready faster and enjoy more time doing what you love! Shop Now: [Link to product page]

While adequate, this output is painfully generic, lacking the specific nuances and targeted appeal necessary to resonate with your company’s audience.

Here’s an example of using the CO-STAR template instead, which reminds you to incorporate other aspects of your task into your prompt, specifically the STYLE, TONE, and AUDIENCE aspects that are missing from the previous quick prompt:

CONTEXT

I want to advertise my company’s new product. My company’s name is Alpha and the product is called Beta, which is a new ultra-fast hairdryer.

OBJECTIVE

Create a Facebook post for me, which aims to get people to click on the product link to purchase it.

STYLE

Follow the writing style of successful companies that advertise similar products, such as Dyson.

TONE

Persuasive

AUDIENCE

My company’s audience profile on Facebook is typically the older generation. Tailor your post to target what this audience typically looks out for in hair products.

RESPONSE

The Facebook post, kept concise yet impactful.

By applying CO-STAR, GPT-4’s response becomes significantly more targeted and effective:

Embrace Effortless Elegance with Alpha Beta – The Ultra-Fast Hairdryer for the Wise Generation.

Rediscover the joy of simple, effective hair care with Alpha Beta. Our latest innovation is more than just a hairdryer; it’s a promise of swift, gentle, and reliable hair styling for those who appreciate the finer things in life.

  • Easy and Intuitive Use: Say goodbye to complicated gadgets. Alpha Beta is crafted for comfort and simplicity, perfect for those who value straightforward, hassle-free technology.
  • Time-Saving Technology: We understand your time is precious. That’s why Alpha Beta cuts down drying time significantly, giving you more moments to enjoy life’s pleasures.

Make every day a good hair day with Alpha Beta. Experience the blend of sophistication and simplicity today.

[Your Product Link Here]

The CO-STAR framework guides you to provide all of the crucial pieces of information about your task to the LLM in a structured manner, ensuring a tailored and optimized response to exactly what you need.


2. [🔵 ] Sectioning Prompts Using Delimiters

Image generated by DALL·E 3
Image generated by DALL·E 3

Delimiters are special tokens that help the LLM distinguish which parts of your prompt it should consider as a single unit of meaning. This is important because your entire prompt arrives to the LLM as a single long sequence of tokens. Delimiters provide structure to this sequence of tokens by fencing specific parts of your prompt to be treated differently.

It is noteworthy that delimiters may not make a difference to the quality of an LLM’s response for straightforward tasks. However, the more complex the task, the more impact the usage of delimiters for sectioning has on the LLM’s response.

Delimiters as Special Characters

A delimiter could be any sequence of special characters that usually wouldn’t appear together, for example:

  • ===

The number and type of special characters chosen is inconsequential, as long as they are unique enough for the LLM to understand them as content separators instead of normal punctuation.

Here’s an example of how you might use such delimiters in a prompt:

Classify the sentiment of each conversation in <<>> as ‘Positive’ or ‘Negative’. Give the sentiment classifications without any other preamble text.

EXAMPLE CONVERSATIONS

[Agent]: Good morning, how can I assist you today? [Customer]: This product is terrible, nothing like what was advertised! [Customer]: I’m extremely disappointed and expect a full refund.

[Agent]: Good morning, how can I help you today? [Customer]: Hi, I just wanted to say that I’m really impressed with your product. It exceeded my expectations!

EXAMPLE OUTPUTS

Negative

Positive

<<< [Agent]: Hello! Welcome to our support. How can I help you today? [Customer]: Hi there! I just wanted to let you know I received my order, and it’s fantastic! [Agent]: That’s great to hear! We’re thrilled you’re happy with your purchase. Is there anything else I can assist you with? [Customer]: No, that’s it. Just wanted to give some positive feedback. Thanks for your excellent service!

[Agent]: Hello, thank you for reaching out. How can I assist you today? [Customer]: I’m very disappointed with my recent purchase. It’s not what I expected at all. [Agent]: I’m sorry to hear that. Could you please provide more details so I can help? [Customer]: The product is of poor quality and it arrived late. I’m really unhappy with this experience.

Above, the examples are sectioned using the delimiter ###, with the section headings EXAMPLE CONVERSATIONS and EXAMPLE OUTPUTS in capital letters to differentiate them. The preamble states that the conversations to be classified are sectioned inside <<<CONVERSATIONS>>>, and these conversations are subsequently given to the LLM at the bottom of the prompt without any explanatory text, but the LLM understands that these are the conversations it should classify due to the presence of the delimiters <<< and >>>.

Here is the output from GPT-4, with the sentiment classifications given without any other preamble text outputted, like what we asked for:

Positive

Negative

Delimiters as XML Tags

Another approach to using delimiters is having them as XML tags. XML tags are tags enclosed in angle brackets, with opening and closing tags. An example is <tag> and </tag>. This is effective as LLMs have been trained on a lot of web content in XML, and have learned to understand its formatting.

Here’s the same prompt above, but structured using XML tags as delimiters instead:

Classify the sentiment of the following conversations into one of two classes, using the examples given. Give the sentiment classifications without any other preamble text.

Positive Negative [Agent]: Good morning, how can I assist you today? [Customer]: This product is terrible, nothing like what was advertised! [Customer]: I’m extremely disappointed and expect a full refund. [Agent]: Good morning, how can I help you today? [Customer]: Hi, I just wanted to say that I’m really impressed with your product. It exceeded my expectations! Negative Positive [Agent]: Hello! Welcome to our support. How can I help you today? [Customer]: Hi there! I just wanted to let you know I received my order, and it’s fantastic! [Agent]: That’s great to hear! We’re thrilled you’re happy with your purchase. Is there anything else I can assist you with? [Customer]: No, that’s it. Just wanted to give some positive feedback. Thanks for your excellent service! [Agent]: Hello, thank you for reaching out. How can I assist you today? [Customer]: I’m very disappointed with my recent purchase. It’s not what I expected at all. [Agent]: I’m sorry to hear that. Could you please provide more details so I can help? [Customer]: The product is of poor quality and it arrived late. I’m really unhappy with this experience.

It is beneficial to use the same noun for the XML tag as the words you have used to describe them in the instructions. The instructions we gave in the prompt above were:

Classify the sentiment of the following conversations into one of two classes, using the examples given. Give the sentiment classifications without any other preamble text.

Where we used the nouns conversations, classes, and examples. As such, the XML tags we use as delimiters are <conversations>, <classes>, <example-conversations>, and <example-classes>. This ensures that the LLM understands how your instructions relate to the XML tags used as delimiters.

Again, the sectioning of your instructions in a clear and structured manner through the use of delimiters ensures that GPT-4 responds exactly how you want it to:

Positive

Negative


3. [🔴 ] Creating System Prompts With LLM Guardrails

Before diving in, it is important to note that this section is relevant only to LLMs that possess a System Prompt feature, unlike the other sections in this article which are relevant for any LLM. The most notable LLM with this feature is, of course, ChatGPT, and therefore we will use ChatGPT as the illustrating example for this section.

Image generated by DALL·E 3
Image generated by DALL·E 3

Terminology surrounding System Prompts

First, let’s iron out terminology: With regards to ChatGPT, there exists a plethora of resources using these 3 terms almost interchangeably: "System Prompts", "System Messages", and "Custom Instructions". This has proved confusing to many (including me!), so much so that OpenAI released an article explaining these terminologies. Here’s a quick summary of it:

  • "System Prompts" and "System Messages" are terms used when interacting with ChatGPT programmatically over its Chat Completions API.
  • On the other hand, "Custom Instructions" is the term used when interacting with ChatGPT over its user interface at https://chat.openai.com/.
Image from Enterprise DNA Blog
Image from Enterprise DNA Blog

Overall, though, the 3 terms refer to the same thing, so don’t let the terminology confuse you! Moving forward, this section will use the term "System Prompts". Now let’s dive in!

What are System Prompts?

System Prompts are an additional prompt where you provide instructions on how the LLM should behave. It is considered additional as it is outside of your "normal" prompts (better known as User Prompts) to the LLM.

Within a chat, every time you provide a new prompt, System Prompts act like a filter that the LLM automatically applies before giving its response to your new prompt. This means that the System Prompts are taken into account every time the LLM responds within the chat.

When should System Prompts be used?

The first question on your mind might be: Why should I provide instructions inside the System Prompt when I can also provide them in my first prompt to a new chat, before further conversations with the LLM?

The answer is because LLMs have a limit to their conversational memory. In the latter case, as the conversation carries on, the LLM is likely to "forget" this first prompt you provided to the chat, making these instructions obsolete.

On the other hand, when instructions are provided in the System Prompt, these System Prompt instructions are automatically taken into account together with each new prompt provided to the chat. This ensures that the LLM continues to receive these instructions even as the conversation carries on, no matter how long the chat becomes.

In conclusion:

Use System Prompts to provide instructions that you want the LLM to remember when responding throughout the entire chat.

What should System Prompts include?

Instructions in the System Prompt typically includes the following categories:

  • Task definition, so the LLM will always remember what it has to do throughout the chat.
  • Output format, so the LLM will always remember how it should respond.
  • Guardrails, so the LLM will always remember how it should not respond. Guardrails are emerging field in LLM governance, referring to configured boundaries that an LLM is allowed to operate in.

For example, a System Prompt might look like this:

You will answer questions using this text: [insert text]. You will respond with a JSON object in this format: {"Question": "Answer"}. If the text does not contain sufficient information to answer the question, do not make up information and give the answer as "NA". You are only allowed to answer questions related to [insert scope]. Never answer any questions related to demographic information such as age, gender, and religion.

Where each portion relates to the categories as follows:

Breaking down a System Prompt - Image by author
Breaking down a System Prompt – Image by author

But then what goes into the "normal" prompts to the chat?

Now you might be thinking: That sounds like a lot of information already being given in the System Prompt. What do I put in my "normal" prompts (better known as User Prompts) to the chat then?

The System Prompt outlines the general task at hand. In the above System Prompt example, the task has been defined to only use a specific piece of text for question-answering, and the LLM is instructed to respond in the format {"Question": "Answer"}.

You will answer questions using this text: [insert text]. You will respond with a JSON object in this format: {"Question": "Answer"}.

In this case, each User Prompt to the chat would simply be the question that you want answered using the text. For example, a User Prompt might be "What is the text about?". And the LLM would respond with {"What is the text about?": "The text is about..."}.

But let’s generalize this task example further. In practice, it would be more likely that you have multiple pieces of text that you want to ask questions on, rather than just 1. In this case, we could edit the first line of the above System Prompt from

You will answer questions using this text: [insert text].

to

You will answer questions using the provided text.

Now, each User Prompt to the chat would include both the text to conduct question-answering over, and the question to be answered, such as:

[insert text] [insert question]

Here, we also use XML tags as delimiters in order to provide the 2 required pieces of information to the LLM in a structured manner. The nouns used in the XML tags, text and question, correspond to the nouns used in the System Prompt so that the LLM understands how the tags relate to the System Prompt instructions.

In conclusion, the System Prompt should give the overall task instructions, and each User Prompt should provide the exact specifics that you want the task to be executed using. In this case, for example, these exact specifics are the text and the question.

Extra: Making LLM guardrails dynamic

Above, guardrails are added through a few sentences in the System Prompt. These guardrails are then set in stone and do not change for the entire chat. What if you wish to have different guardrails in place at different points of the conversation?

Unfortunately for users of the ChatGPT user interface, there is no straightforward way to do this right now. However, if you’re interacting with ChatGPT programmatically, you’re in luck! The increasing focus on building effective LLM guardrails has seen the development of open-source packages that allow you to set up far more detailed and dynamic guardrails programmatically.

A noteworthy one is NeMo Guardrails developed by the NVIDIA team, which allows you to configure the expected conversation flow between users and the LLM, and thus set up different guardrails at different points of the chat, allowing for dynamic guardrails that evolve as the chat progresses. I definitely recommend checking it out!


4. [🔴 ] Analyzing datasets using only LLMs, without plugins or code

Image generated by DALL·E 3
Image generated by DALL·E 3

You might have heard of OpenAI’s Advanced Data Analysis plugin within ChatGPT’s GPT-4 that is available to premium (paid) accounts. It allows users to upload datasets to ChatGPT and run code directly on the dataset, allowing for accurate data analysis.

But did you know that you don’t always need such plugins to analyze datasets well with LLMs? Let’s first understand the strengths and limitations of purely using LLMs to analyze datasets.

Types of dataset analysis that LLMs are _not_ great at

As you probably already know, LLMs are limited in their ability to perform accurate mathematical calculations, making them unsuitable for tasks requiring precise quantitative analysis on datasets, such as:

  • Descriptive Statistics: Summarizing numerical columns quantitatively, through measures like the mean or variance.
  • Correlation Analysis: Obtaining ** the** precise correlation coefficient between columns.
  • Statistical Analysis: Such as hypothesis testing to determine if there are statistically significant differences between groups of data points.
  • Machine Learning: Performing predictive modelling on a dataset such as using linear regressions, gradient boosted trees, or neural networks.

Performing such quantitative tasks on datasets is why OpenAI’s Advanced Data Analysis plugin exists, so that programming languages step in to run code for such tasks on a dataset.

So, why would anyone want to analyze datasets using only LLMs and without such plugins?

Types of dataset analysis that LLMs are great at

LLMs are excellent at identifying patterns and trends. This capability stems from their extensive training on diverse and voluminous data, enabling them to discern intricate patterns that may not be immediately apparent.

This makes them well-suited for tasks based on pattern-finding within datasets, such as:

  • Anomaly detection: Identifying unusual data points that deviate from the norm, based on one or more column values.
  • Clustering: Grouping data points with similar characteristics across columns.
  • Cross-Column Relationships: Identifying combined trends across columns.
  • Textual Analysis (For text-based columns): Categorization based on topic or sentiment.
  • Trend Analysis (For datasets with time aspects): Identifying patterns, seasonal variations, or trends within columns across time.

For such pattern-based tasks, using LLMs alone may in fact produce better results within a shorter timeframe than using code! Let’s illustrate this fully with an example.

Analyzing a Kaggle dataset using only LLMs

We’ll use a popular real-world Kaggle dataset curated for Customer Personality Analysis, wherein a company seeks to segment its customer base in order to understand its customers better.

For easier validation of the LLM’s analysis later, we’ll subset this dataset to 50 rows and retain only the most relevant columns. After which, the dataset for analysis looks like this, where each row represents a customer, and the columns depict customer information:

First 3 rows of dataset - Image by author
First 3 rows of dataset – Image by author

Say you work on the company’s marketing team. You are tasked to utilize this dataset of customer information to guide marketing efforts. This is a 2-step task: First, use the dataset to generate meaningful customer segments. Next, generate ideas on how to best market towards each segment. Now this is a practical business problem where the pattern-finding (for step 1) capability of LLMs can truly excel.

Let’s craft a prompt for this task as follows, using 4 Prompt Engineering techniques (more on these later!):

  1. Breaking down a complex task into simple steps
  2. Referencing intermediate outputs from each step
  3. Formatting the LLM’s response
  4. Separating the instructions from the dataset

System Prompt: I want you to act as a data scientist to analyze datasets. Do not make up information that is not in the dataset. For each analysis I ask for, provide me with the exact and definitive answer and do not provide me with code or instructions to do the analysis on other platforms.

Prompt:# CONTEXT # I sell wine. I have a dataset of information on my customers: [year of birth, marital status, income, number of children, days since last purchase, amount spent].

#############

OBJECTIVE

I want you use the dataset to cluster my customers into groups and then give me ideas on how to target my marketing efforts towards each group. Use this step-by-step process and do not use code:

  1. CLUSTERS: Use the columns of the dataset to cluster the rows of the dataset, such that customers within the same cluster have similar column values while customers in different clusters have distinctly different column values. Ensure that each row only belongs to 1 cluster.

For each cluster found,

  1. CLUSTER_INFORMATION: Describe the cluster in terms of the dataset columns.
  2. CLUSTER_NAME: Interpret [CLUSTER_INFORMATION] to obtain a short name for the customer group in this cluster.
  3. MARKETING_IDEAS: Generate ideas to market my product to this customer group.
  4. RATIONALE: Explain why [MARKETING_IDEAS] is relevant and effective for this customer group.

#############

STYLE

Business analytics report

#############

TONE

Professional, technical

#############

AUDIENCE

My business partners. Convince them that your marketing strategy is well thought-out and fully backed by data.

#############

RESPONSE: MARKDOWN REPORT

<For each cluster in [CLUSTERS]>

  • Customer Group: [CLUSTER_NAME]
  • Profile: [CLUSTER_INFORMATION]
  • Marketing Ideas: [MARKETING_IDEAS]
  • Rationale: [RATIONALE]
Give a table of the list of row numbers belonging to each cluster, in order to back up your analysis. Use these table headers: [[CLUSTER_NAME], List of Rows]. ############# # START ANALYSIS # If you understand, ask me for my dataset.

Below is GPT-4’s reply, and we proceed to pass the dataset to it in a CSV string.

GPT-4's response - Image by author
GPT-4’s response – Image by author

Following which, GPT-4 replies with its analysis in the markdown report format we asked for:

GPT-4's response - Image by author
GPT-4’s response – Image by author
GPT-4's response - Image by author
GPT-4’s response – Image by author
GPT-4's response - Image by author
GPT-4’s response – Image by author

Validating the LLM’s analysis

For the sake of brevity, we’ll pick 2 customer groups generated by the LLM for validation – say, Young Families and Discerning Enthusiasts.

Young Families

  • Profile synthesized by LLM: Born after 1980, Married or Together, Moderate to low income, Have children, Frequent small purchases.
  • Rows clustered into this group by LLM: 3, 4, 7, 10, 16, 20
  • Digging into the dataset, the full data for these rows are:
Full data for Young Families - Image by author
Full data for Young Families – Image by author

Which exactly correspond to the profile identified by the LLM. It was even able to cluster the row with a null value without us preprocessing it beforehand!

Discerning Enthusiasts– Profile synthesized by LLM: Wide age range, Any marital status, High income, Varied children status, High spend on purchases.

  • Rows clustered into this group by LLM: 2, 5, 18, 29, 34, 36
  • Digging into the dataset, the full data for these rows are:
Full data for Discerning Enthusiasts **** - Image by author
Full data for Discerning Enthusiasts **** – Image by author

Which again align very well with the profile identified by the LLM!

This example showcases LLMs’ abilities in pattern-finding, interpreting and distilling multi-dimensional datasets into meaningful insights, while ensuring that its analysis is deeply rooted in the factual truth of the dataset.

What if we used ChatGPT’s Advanced Data Analysis plugin?

For completeness, I attempted this same task with the same prompt, but asked ChatGPT to execute the analysis using code instead, which activated its Advanced Data Analysis plugin. The idea was for the plugin to run code using a clustering algorithm like K-Means directly on the dataset to obtain each customer group, before synthesizing the profile of each cluster to provide marketing strategies.

However, multiple attempts resulted in the following error messages with no outputs, despite the dataset being only 50 rows:

Error and no output from Attempt 1 - Image by author
Error and no output from Attempt 1 – Image by author
Error and no output from Attempt 2 - Image by author
Error and no output from Attempt 2 – Image by author

With the Advanced Data Analysis plugin right now, it appears that executing simpler tasks on datasets such as calculating descriptive statistics or creating graphs can be easily achieved, but more advanced tasks that require computing of algorithms may sometimes result in errors and no outputs, due to computational limits or otherwise.

So…When to analyze datasets using LLMs?

The answer is it depends on the type of analysis.

For tasks requiring precise mathematical calculations or complex, rule-based processing, conventional programming methods remain superior.

For tasks based on pattern-recognition, it can be challenging or more time-consuming to execute using conventional programming and algorithmic approaches. LLMs, however, excel at such tasks, and can even provide additional outputs such as annexes to back up its analysis, and full analysis reports in markdown formatting.

Ultimately, the decision to utilize LLMs hinges on the nature of the task at hand, balancing the strengths of LLMs in pattern-recognition against the precision and specificity offered by traditional programming techniques.

Now back to the prompt engineering!

Before this section ends, let’s go back to the prompt used to generate this dataset analysis and break down the key prompt engineering techniques used:

Prompt:

CONTEXT

I sell wine. I have a dataset of information on my customers: [year of birth, marital status, income, number of children, days since last purchase, amount spent].

#############

OBJECTIVE

I want you use the dataset to cluster my customers into groups and then give me ideas on how to target my marketing efforts towards each group. Use this step-by-step process and do not use code:

  1. CLUSTERS: Use the columns of the dataset to cluster the rows of the dataset, such that customers within the same cluster have similar column values while customers in different clusters have distinctly different column values. Ensure that each row only belongs to 1 cluster.

For each cluster found,

  1. CLUSTER_INFORMATION: Describe the cluster in terms of the dataset columns.
  2. CLUSTER_NAME: Interpret [CLUSTER_INFORMATION] to obtain a short name for the customer group in this cluster.
  3. MARKETING_IDEAS: Generate ideas to market my product to this customer group.
  4. RATIONALE: Explain why [MARKETING_IDEAS] is relevant and effective for this customer group.

#############

STYLE

Business analytics report

#############

TONE

Professional, technical

#############

AUDIENCE

My business partners. Convince them that your marketing strategy is well thought-out and fully backed by data.

#############

RESPONSE: MARKDOWN REPORT

<For each cluster in [CLUSTERS]>

  • Customer Group: [CLUSTER_NAME]
  • Profile: [CLUSTER_INFORMATION]
  • Marketing Ideas: [MARKETING_IDEAS]
  • Rationale: [RATIONALE]
Give a table of the list of row numbers belonging to each cluster, in order to back up your analysis. Use these table headers: [[CLUSTER_NAME], List of Rows]. ############# # START ANALYSIS # If you understand, ask me for my dataset.

Technique 1: Breaking down a complex task into simple stepsLLMs are great at performing simple tasks, but not so great at complex ones. As such, with complex tasks like this one, it is important to break down the task into simple step-by-step instructions for the LLM to follow. The idea is to give the LLM the steps that you yourself would take to execute the task.

In this example, the steps are given as:

Use this step-by-step process and do not use code:

  1. CLUSTERS: Use the columns of the dataset to cluster the rows of the dataset, such that customers within the same cluster have similar column values while customers in different clusters have distinctly different column values. Ensure that each row only belongs to 1 cluster.

For each cluster found,

  1. CLUSTER_INFORMATION: Describe the cluster in terms of the dataset columns.
  2. CLUSTER_NAME: Interpret [CLUSTER_INFORMATION] to obtain a short name for the customer group in this cluster.
  3. MARKETING_IDEAS: Generate ideas to market my product to this customer group.
  4. RATIONALE: Explain why [MARKETING_IDEAS] is relevant and effective for this customer group.

As opposed to simply giving the overall task to the LLM as "Cluster the customers into groups and then give ideas on how to market to each group".

With step-by-step instructions, LLMs are significantly more likely to deliver the correct results.

Technique 2: Referencing intermediate outputs from each stepWhen providing the step-by-step process to the LLM, we give the intermediate output from each step a capitalized VARIABLE_NAME, namely CLUSTERS, CLUSTER_INFORMATION, CLUSTER_NAME, MARKETING_IDEAS and RATIONALE.

Capitalization is used to differentiate these variable names from the body of instructions given. These intermediate outputs can later be referenced using square brackets as [VARIABLE_NAME].

Technique 3: Formatting the LLM’s responseHere, we ask for a markdown report format, which beautifies the LLM’s response. Having variable names from intermediate outputs again comes in handy here to dictate the structure of the report.

RESPONSE: MARKDOWN REPORT

<For each cluster in [CLUSTERS]>

  • Customer Group: [CLUSTER_NAME]
  • Profile: [CLUSTER_INFORMATION]
  • Marketing Ideas: [MARKETING_IDEAS]
  • Rationale: [RATIONALE]
Give a table of the list of row numbers belonging to each cluster, in order to back up your analysis. Use these table headers: [[CLUSTER_NAME], List of Rows].

In fact, you could even subsequently ask ChatGPT to provide the report as a downloadable file, allowing you to work off of its response in writing your final report.

Saving GPT-4's response as a file - Image by author
Saving GPT-4’s response as a file – Image by author

Technique 4: Separating the task instructions from the dataset You’ll notice that we never gave the dataset to the LLM in our first prompt. Instead, the prompt gives only the task instructions for the dataset analysis, with this added to the bottom:

START ANALYSIS

If you understand, ask me for my dataset.

ChatGPT then responded that it understands, and we passed the dataset to it as a CSV string in our next prompt:

GPT-4's response - Image by author
GPT-4’s response – Image by author

But why separate the instructions from the dataset?

Doing so helps the LLM maintain clarity in understanding each, with lower likelihood of missing out information, especially in more complex tasks such as this one with longer instructions. You might have experienced scenarios where the LLM "accidentally forgets" a certain instruction you gave as part of a longer prompt – for example, if you asked for a 100-word response and the LLM gives you a longer paragraph back. By receiving the instructions first, before the dataset that the instructions are for, the LLM can first digest what it should do, before executing it on the dataset provided next.

Note however that this separation of instructions and dataset can only be achieved with chat LLMs as they maintain a conversational memory, unlike completion LLMs which do not.


Closing Thoughts

Before this article ends, I wanted to share some personal reflections on this incredible journey.

First, a heartfelt thank you to GovTech Singapore for orchestrating such an amazing competition. If you’re interested in the mechanics of how GovTech organized this first-of-its-kind competition – check out this article by Nicole Lee, the lead organizer herself!

A live on-stage battle in the final round!
A live on-stage battle in the final round!

Second, a big shout-out to my fellow phenomenal competitors, who each brought something special, making the competition as enriching as it was challenging! I’ll never forget the final round, with us battling it out on stage and a live audience cheering us on – an experience I’ll always remember fondly.

For me, this wasn’t just a competition; it was a celebration of talent, creativity, and the spirit of learning. And I’m beyond excited to see what comes next!


I had a lot of fun writing this, and if you had fun reading, I would really appreciate if you took a second to leave some claps and a follow!

See you in the next one! Sheila

The post How I Won Singapore’s GPT-4 Prompt Engineering Competition appeared first on Towards Data Science.

]]>
Stacked Ensembles for Advanced Predictive Modeling With H2O.ai and Optuna https://towardsdatascience.com/stacked-ensembles-for-advanced-predictive-modeling-with-h2o-ai-and-optuna-8c339f8fb602/ Mon, 18 Dec 2023 16:05:44 +0000 https://towardsdatascience.com/stacked-ensembles-for-advanced-predictive-modeling-with-h2o-ai-and-optuna-8c339f8fb602/ And how I placed top 10% in Europe's largest machine learning competition with them!

The post Stacked Ensembles for Advanced Predictive Modeling With H2O.ai and Optuna appeared first on Towards Data Science.

]]>
How I Achieved Top 10% in Europe’s Largest Machine Learning Competition with Stacked Ensembles

A conceptual and hands-on coding guide to training Stacked Ensembles with H2O.ai and Optuna

We all know that ensemble models outperform any singular model at predictive modeling. You’ve probably heard all about Bagging and Boosting as common ensemble methods, with Random Forests and Gradient Boosting Machines as respective examples.

But what about ensembling different models together under a separate higher-level model? This is where stacked ensembles comes in. This article is step-by-step guide on how to train stacked ensembles using the popular machine learning library, H2O.

To demonstrate the power of stacked ensembles, I will provide a walk-through of my full code for training a stacked ensemble of 40 Deep Neural Network, XGBoost and LightGBM models for the prediction task posed in the 2023 Cloudflight Coding Competition (AI Category), one of the largest coding competitions in Europe, where I placed top 10% on the competition leaderboard within a training time of 1 hour!

This guide will cover:

  1. What are stacked ensembles and how do they work?
  2. How to train stacked ensembles with H2O.ai – With a full code walk-through in Python

  3. Comparing the performance of a stacked ensemble versus standalone models

1. What are Stacked Ensembles and how do they work?

A stacked ensemble combines predictions from multiple models through another, higher-level model, with the aim being to increase overall predictive performance by capitalizing on the unique strengths of each constituent model. It involves 2 stages:

Stage 1: Multiple Base Models

First, multiple base models are independently trained on the same training dataset. These models should ideally be diverse, ranging from simple linear regressions to complex Deep Learning models. The key is that they should differ from each other in some way, either in terms of using a different algorithm or using the same algorithm but with different hyperparameter settings.

The more diverse the base models are, the more powerful the eventual stacked ensemble. This is because different models are able to capture different patterns in the data. For example, a tree-based model might be good at capturing non-linear relationships, while a linear model excels at understanding linear trends. When these diverse base models are combined, the stacked ensemble can then leverage the different strengths of each base model, increasing predictive performance.

Stage 2: One Meta-Model

After all the base models are trained, each base model’s predictions for the target is used as a feature for training a higher-level model, termed a meta-model. This means that the meta-model is not trained on the original dataset’s features, but instead on the predictions of the base models. If there are n base models, there are n predictions generated, and these are the n features used for training the meta-model.

While the training features differ between the base models and the meta-model, the target however stays the same, which is the original target from the dataset.

The meta-model learns how to best combine the predictions from the base models to make a final, more accurate prediction.

Detailed Steps for Training a Stacked Ensemble

For each base model:

  1. Pick an algorithm (eg. Random Forest).
  2. Use cross-validation to obtain the best set of hyperparameters for the algorithm.
  3. Obtain cross-validation predictions for the target in the training set. These will be used to train the meta-model subsequently.

To illustrate this, say a Random Forest algorithm was chosen in Step 1, and its optimal hyperparameters were determined as h in Step 2.

_The cross-validation predictions are obtained through the following, assuming 5-fold cross-validation:

  1. Train a Random Forest with hyperparamters h on Folds 1–4.
  2. Used the trained Random Forest to make predictions for Fold 5. These are the cross-validation predictions for Fold 5.
  3. Repeat the above to obtain cross-validation predictions for each fold. After which, cross-validation predictions for the target will be obtained for the entire training set._

For the meta-model:

  1. Obtain the features for training the meta-model. These are the predictions of each of the base models.
  2. Obtain the target for training the meta-model. This is the original target from the training set.
  3. Pick an algorithm (eg. Linear Regression).
  4. Use cross-validation to obtain the best set of hyperparameters for the algorithm.

And voila! You now have:

  • Multiple base models that are trained with optimal hyperparameters
  • One meta-model that is also trained with optimal hyperparameters

Which means you have successfully trained a stacked ensemble!


2. How to Train Stacked Ensembles with H2O.ai

Now, let’s jump into coding it out!

As mentioned, this section covers my full code for training a stacked ensemble for the prediction task posed in the 2023 Cloudflight Coding Competition (AI Category), which is a regression task using tabular data. Within the competition’s time constraints, I created a stacked ensemble from 40 base models of 3 algorithm types – Deep Neural Network, XGBoost, and LightGBM, with these specific algorithms chosen as they often achieve superior performance in practice.

2.1. Data Preparation

First, let’s import the necessary libraries.

Python">import pandas as pd
import h2o
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators import H2OXGBoostEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
import optuna
from tqdm import tqdm

seed = 1

And initialize the H2O cluster.

h2o.init()

Next, load in the dataset.

data = pd.read_csv('path_to_your_tabular_dataset')

Before moving on to model building using H2O, let’s first understand the following traits of H2O models:

  1. H2O models cannot take in Pandas DataFrame objects, so data must be converted from a Pandas DataFrame to its H2O equivalent, which is a H2OFrame.
  2. H2O models can encode categorical features automatically, which is great as it takes this preprocessing step out of our hands. To ensure that such features are understood by the models to be categorical, they must be explicitly converted into the factor (categorical) data type.
data_h2o = h2o.H2OFrame(data)

categorical_cols = [...]  #insert the names of the categorical features here
for col in categorical_cols:
  data_h2o[col] = data_h2o[col].asfactor()

Now we can proceed to split our dataset into train (90%) and validation (10%) sets, using the split_frame() method of H2OFrame objects.

splits = data_h2o.split_frame(ratios=[0.9], seed=seed)
train = splits[0]
val = splits[1]

Lastly, let’s obtain the features and target for modelling. Unlike Scikit-Learn models which take as input the values of the features and the target, H2O models take as input the names of the features and the target.

y = '...'  #insert name of the target column here
x = list(train.columns)
x.remove(y) 

Now, let the model training fun begin!

2.2. Training Deep Neural Networks (DNN) as Base Models

Let’s start by training the DNNs that will form our set of base models for the stacked ensemble, using H2O’s H2ODeepLearningEstimator.

Aside: Why train DNNs in H2O, instead of Tensorflow, Keras, or PyTorch?

Before jumping into the code for this, you might be wondering why I chose to train DNNs using H2O’s H2ODeepLearningEstimator, as opposed to using Tensorflow, Keras, or PyTorch, which are the common libraries used to build DNNs.

The straightforward answer is that building a stacked ensemble in H2O uses the H2OStackedEnsembleEstimator, which can only accept base models that are part of the H2O model family. However, the more critical reason is that H2O’s H2ODeepLearningEstimator enables far easier tuning of DNNs than these other frameworks, and here’s why.

In TensorFlow, Keras, or PyTorch, regularization effects like dropout layers must be manually added into the model architecture, such as using keras.layers.Dropout(). This allows for greater customization, but also requires more detailed knowledge and effort. For example, you have to decide where and how many times to include the keras.layers.Dropout() layer within your model architecture.

On the other hand, H2O’s H2ODeepLearningEstimator is more abstracted and accessible to the layman. Regularization can be enabled in a straightforward manner through model hyperparameters, reducing the need for manual setup of these components as layers. Furthermore, the default model hyperparameters already includes regularization. The common feature preprocessing steps, such as scaling of numerical features and encoding of categorical features, are also included as model hyperparameters for automatic feature preprocessing. These enable the tuning of DNNs to be a far more straightforward and easy process, without having to dive into the complexities of deep learning model architecture. In the context of a time crunch in the competition, this was extremely useful for me!

But which set of hyperparameters should we train H2ODeepLearningEstimator with? This is where optuna comes in. Optuna is a hyperparameter optimization framework, similar to the traditional grid search and random search approaches, but better in that it employs a more sophisticated approach.

Grid search systematically explores a predefined range of hyperparameter values, while random search selects random combinations within these specified limits. However, optuna uses Bayesian optimization to learn from previous searches to propose better-performing hyperparameter sets in each subsequent search, increasing the efficiency of its search for the optimal model hyperparameters. This is especially effective in complex and large hyperparameter spaces where traditional search methods can be prohibitively time-consuming and may eventually still fail to locate the optimal set of hyperparameters.

Now, let’s get into the code. We’ll use optuna to tune the hyperparameters of H2O’s H2ODeepLearningEstimator, and keep track of all the trained models inside the list dnn_models.

dnn_models = []

def objective(trial):
    #params to tune
    num_hidden_layers = trial.suggest_int('num_hidden_layers', 1, 10)
    hidden_layer_size = trial.suggest_int('hidden_layer_size', 100, 300, step=50)

    params = {
        'hidden': [hidden_layer_size]*num_hidden_layers,
        'epochs': trial.suggest_int('epochs', 5, 100),
        'input_dropout_ratio': trial.suggest_float('input_dropout_ratio', 0.1, 0.3),  #dropout for input layer
        'l1': trial.suggest_float('l1', 1e-5, 1e-1, log=True),  #l1 regularization
        'l2': trial.suggest_float('l2', 1e-5, 1e-1, log=True),  #l2 regularization
        'activation': trial.suggest_categorical('activation', ['rectifier', 'rectifierwithdropout', 'tanh', 'tanh_with_dropout', 'maxout', 'maxout_with_dropout'])
}

    #param 'hidden_dropout_ratios' is applicable only if the activation type is rectifier_with_dropout, tanh_with_dropout, or maxout_with_dropout
    if params['activation'] in ['rectifierwithdropout', 'tanh_with_dropout', 'maxout_with_dropout']:
        hidden_dropout_ratio = trial.suggest_float('hidden_dropout_ratio', 0.1, 1.0)  
        params['hidden_dropout_ratios'] = [hidden_dropout_ratio]*num_hidden_layers  #dropout for hidden layers

    #train model
    model = H2ODeepLearningEstimator(**params,
                                     standardize=True,  #h2o models can do this feature preprocessing automatically
                                     categorical_encoding='auto',  #h2o models can do this feature preprocessing automatically
                                     nfolds=5,
                                     keep_cross_validation_predictions=True,  #need this for training the meta-model later
                                     seed=seed)
    model.train(x=x, y=y, training_frame=train)

    #store model
    dnn_models.append(model)

    #get cross-validation rmse 
    cv_metrics_df = model.cross_validation_metrics_summary().as_data_frame()
    cv_rmse_index = cv_metrics_df[cv_metrics_df[''] == 'rmse'].index
    cv_rmse = cv_metrics_df['mean'].iloc[cv_rmse_index]
    return cv_rmse

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)

Above, an optuna study is created to search for the best set of H2ODeepLearningEstimator hyperparameters that minimizes the cross-validation RMSE (as this is a regression task), with the optimization process running for 20 trials using the parameter n_trials=20. This means that 20 DNNs are trained and stored in the list dnn_models for usage as base models for the stacked ensemble later on, each with a different set of hyperparameters. In the interest of time under the competition’s time constraints, I chose to train 20 DNNs, but you can set n_trials to be however many DNNs you wish to train for your stacked ensemble.

Importantly, the H2ODeepLearningEstimator must be trained with keep_cross_validation_predictions=True, as these cross-validation predictions will be used as features for training the meta-model later.

2.3. Training XGBoost and LightGBM as Base Models

Next, let’s train the XGBoost and LightGBM models that will also form our set of base models for the stacked ensemble. We’ll again use optuna to tune the hyperparameters of H2O’s H2OXGBoostEstimator, and keep track of all the trained models inside the list xgboost_lightgbm_models.

Before diving into the code for this, we must first understand that H2OXGBoostEstimator is the integration of the XGBoost framework from the popular xgboost library into H2O. On the other hand, H2O does not integrate the lightgbm library. However, it does provide a method for emulating the LightGBM framework using a certain set of parameters within H2OXGBoostEstimator– and this is exactly what we will implement in order to train both XGBoost and LightGBM models using H2OXGBoostEstimator.

xgboost_lightgbm_models = []

def objective(trial):
    #common params between xgboost and lightgbm
    params = {
        'ntrees': trial.suggest_int('ntrees', 50, 5000),
        'max_depth': trial.suggest_int('max_depth', 1, 9),
        'min_rows': trial.suggest_int('min_rows', 1, 5),
        'sample_rate': trial.suggest_float('sample_rate', 0.8, 1.0),
        'col_sample_rate': trial.suggest_float('col_sample_rate', 0.2, 1.0),
        'col_sample_rate_per_tree': trial.suggest_float('col_sample_rate_per_tree', 0.5, 1.0)
    }

    grow_policy = trial.suggest_categorical('grow_policy', ['depthwise', 'lossguide'])

     #######################################################################################################################
     #from H2OXGBoostEstimator's documentation, (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html) # 
     #lightgbm is emulated when grow_policy=lossguide and tree_method=hist                                                 #
     #so we will tune lightgbm-specific hyperparameters when this set of hyperparameters is used                           #
     #and tune xgboost-specific hyperparameters otherwise                                                                  #
     #######################################################################################################################

    #add lightgbm-specific params
    if grow_policy == 'lossguide':  
        tree_method = 'hist'  
        params['max_bins'] = trial.suggest_int('max_bins', 20, 256)
        params['max_leaves'] = trial.suggest_int('max_leaves', 31, 1024)

    #add xgboost-specific params
    else:
        tree_method = 'auto'
        params['booster'] = trial.suggest_categorical('booster', ['gbtree', 'gblinear', 'dart'])
        params['reg_alpha'] = trial.suggest_float('reg_alpha', 0.001, 1)
        params['reg_lambda'] = trial.suggest_float('reg_lambda', 0.001, 1)
        params['min_split_improvement'] = trial.suggest_float('min_split_improvement', 1e-10, 1e-3, log=True)

    #add grow_policy and tree_method into params dict
    params['grow_policy'] = grow_policy
    params['tree_method'] = tree_method

    #train model
    model = H2OXGBoostEstimator(**params,
                                learn_rate=0.1,
                                categorical_encoding='auto',  #h2o models can do this feature preprocessing automatically
                                nfolds=5,
                                keep_cross_validation_predictions=True,  #need this for training the meta-model later
                                seed=seed) 
    model.train(x=x, y=y, training_frame=train)

    #store model
    xgboost_lightgbm_models.append(model)

    #get cross-validation rmse
    cv_metrics_df = model.cross_validation_metrics_summary().as_data_frame()
    cv_rmse_index = cv_metrics_df[cv_metrics_df[''] == 'rmse'].index
    cv_rmse = cv_metrics_df['mean'].iloc[cv_rmse_index]
    return cv_rmse

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)

Similarly, 20 XGBoost and LightGBM models are trained and stored in the list xgboost_lightgbm_models for usage as base models for the stacked ensemble later on, each with a different set of hyperparameters. You can set n_trials to be however many XGBoost/LightGBM models you wish to train for your stacked ensemble.

Importantly, the H2OXGBoostEstimator must also be trained with keep_cross_validation_predictions=True, as these cross-validation predictions will be used as features for training the meta-model later.

2.4. Training the Meta-Model

We will use all of the Deep Neural Network, XGBoost and LightGBM models trained above as base models. However, this does not mean that all of them will be used in the stacked ensemble, as we will perform automatic base model selection when tuning our meta-model (more on this later)!

Recall that we had stored each trained base model inside the lists dnn_models (20 models) and xgboost_lightgbm_models (20 models), giving a total of 40 base models for our stacked ensemble. Let’s combine them into a final list of base models, base_models.

base_models = dnn_models + xgboost_lightgbm_models

Now, we are ready to train the meta-model using these base models. But first, we have to decide on the meta-model algorithm, where a few concepts come into play:

  1. Most academic papers on stacked ensembles recommend choosing a simple linear-based algorithm for the meta-model. This is to avoid the meta-model overfitting to the predictions from the base models.
  2. H2O recommends the usage of a Generalized Linear Model (GLM) over a Linear Regression (for regression tasks) or Logistic Regression (for classification tasks). This is because the GLM is a flexible linear model that does not impose the key assumptions of normality and homoscedasticity that the latter do, allowing it to model the true behavior of the target values better, since such assumptions can be difficult to be met in practice. Further explanations on this can be found in this academic thesis, on which H2O’s work was based upon.

As such, we will instantiate the meta-model using H2OStackedEnsembleEstimator with metalearner_algorithm='glm', and use optuna to tune the hyperparameters of the GLM meta-model to optimize performance.

def objective(trial):
    #GLM params to tune
    meta_model_params = {
        'alpha': trial.suggest_float('alpha', 0, 1),  #regularization distribution between L1 and L2
        'family': trial.suggest_categorical('family', ['gaussian', 'tweedie']),  #read the documentation here on which family your target may fall into: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html
        'standardize': trial.suggest_categorical('standardize', [True, False]),
        'non_negative': True  #predictions of each base model cannot be subtracted from one another
    }

    ensemble = H2OStackedEnsembleEstimator(metalearner_algorithm='glm',
                                             metalearner_params=meta_model_params,
                                             metalearner_nfolds=5,
                                             base_models=base_models,  
                                             seed=seed)

    ensemble.train(x=x, y=y, training_frame=train)

    #get cross-validation rmse
    cv_metrics_df = ensemble.cross_validation_metrics_summary().as_data_frame()
    cv_rmse_index = cv_metrics_df[cv_metrics_df[''] == 'rmse'].index
    cv_rmse = cv_metrics_df['mean'].iloc[cv_rmse_index]
    return cv_rmse

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)

Notice that the cross-validation predictions of each base model were not explicitly passed into H2OStackedEnsembleEstimator. This is because H2O does this automatically under the hood, making things easier for us! All we had to do was set keep_cross_validation_predictions=True when training our base models previously, and instantiate H2OStackedEnsembleEstimator with the parameter base_models=base_models.

Now, we can finally build the best_ensemble model, using the optimal hyperparameters found by optuna.

best_meta_model_params = study.best_params
best_ensemble = H2OStackedEnsembleEstimator(metalearner_algorithm='glm',
                                            metalearner_params=best_meta_model_params,
                                            base_models=base_models,
                                            seed=seed)

best_ensemble.train(x=x, y=y, training_frame=train)

And voila, we have successfully trained a stacked ensemble in H2O! Let’s take a look at it.

best_ensemble.summary()
Image by author
Image by author

Notice that the stacked ensemble uses only 16 out of the 40 base models we passed to it, of which 3 are XGBoost/LightGBM and 13 are Deep Neural Networks. This is due to the hyperparameter alpha that we tuned for the GLM meta-model, which represents the distribution of regularization between L1 (LASSO) and L2 (Ridge). A value of 1 entails only L1 regularization, while a value of 0 entails only L2 regularization.

As reflected above, its optimal value was found to be alpha=0.16, thus a mix of L1 and L2 was employed. Some of the base models’ predictions had their coefficients in the regression set to 0 under L1 regularization, meaning that these base models were not used in the stacked ensemble at all, therefore fewer than 40 base models ended up being used.

The key takeaway here is that our setup above also performs automatic selection of which base models to use for optimal performance, through the meta-model’s regularization hyperparameters, instead of simply using all 40 base models provided.


3. Comparing Performance: Stacked Ensemble Versus Standalone Base Models

To demonstrate the power of stacked ensembles, let’s use it to generate predictions for the validation set, which was held out from the beginning. The RMSE figures below are specific only to the dataset I am using, but feel free to run this article’s codes on your own dataset too, and see the difference in model performance for yourself!

ensemble_val_rmse = best_ensemble.model_performance(val).rmse()
ensemble_val_rmse   #0.31475634111745304

The stacked ensemble produces an RMSE of 0.31 on the validation set.

Next, let’s dig into the performance of each of the base models on this same validation set.

base_val_rmse = []
for i in range(len(base_models)):
    base_val_rmse = base_models[i].model_performance(val).rmse()

models = ['H2ODeepLearningEstimator'] * len(dnn_models) + ['H2OXGBoostEstimator'] * len(xgboost_lightgbm_models)

base_val_rmse_df = pd.DataFrame([models, base_val_rmse]).T
base_val_rmse_df.columns = ['model', 'val_rmse']
base_val_rmse_df = base_val_rmse_df.sort_values(by='val_rmse', ascending=True).reset_index(drop=True)
base_val_rmse_df.head(15)  #show only the top 15 in terms of lowest val_rmse
Image by author
Image by author

Compared to the stacked ensemble which achieved an RMSE of 0.31, the best-performing standalone base model achieved an RMSE of 0.35.

This means that Stacking was able to improve predictive performance by 11% on unseen data!


Now that you’ve witnessed the power of stacked ensembles, it’s your turn to try them out!

I had a lot of fun writing this, and if you had fun reading, I would really appreciate if you took a second to leave some claps and a follow!

See you in the next one! Sheila

The post Stacked Ensembles for Advanced Predictive Modeling With H2O.ai and Optuna appeared first on Towards Data Science.

]]>