Destin Gong, Author at Towards Data Science https://towardsdatascience.com/author/destingong/ The world’s leading publication for data science, AI, and ML professionals. Wed, 05 Mar 2025 13:49:20 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Destin Gong, Author at Towards Data Science https://towardsdatascience.com/author/destingong/ 32 32 6 Common LLM Customization Strategies Briefly Explained https://towardsdatascience.com/6-common-llm-customization-strategies-briefly-explained/ Mon, 24 Feb 2025 19:27:50 +0000 https://towardsdatascience.com/?p=598387 From Theory to practice: understanding RAG, agents, fine-tuning, and more

The post 6 Common LLM Customization Strategies Briefly Explained appeared first on Towards Data Science.

]]>
Why Customize LLMs?

Large Language Models (Llms) are deep learning models pre-trained based on self-supervised learning, requiring a vast amount of resources on training data, training time and holding a large number of parameters. LLM have revolutionized natural language processing especially in the last 2 years, demonstrating remarkable capabilities in understanding and generating human-like text. However, these general purpose models’ out-of-the-box performance may not always meet specific business needs or domain requirements. LLMs alone cannot answer questions that rely on proprietary company data or closed-book settings, making them relatively generic in their applications. Training a LLM model from scratch is largely infeasible to small to medium teams due to the demand of massive amounts of training data and resources. Therefore, a wide range of LLM customization strategies are developed in recent years to tune the models for various scenarios that require specialized knowledge.

The customization strategies can be broadly split into two types:

  • Using a frozen model: These techniques don’t necessitate updating model parameters and typically accomplished through in-context learning or prompt engineering. They are cost-effective since they alter the model’s behavior without incurring extensive training costs, therefore widely explored in both the industry and academic with new research papers published on a daily basis.
  • Updating model parameters: This is a relatively resource-intensive approach that requires tuning a pre-trained LLM using custom datasets designed for the intended purpose. This includes popular techniques like Fine-Tuning and Reinforcement Learning from Human Feedback (RLHF).

These two broad customization paradigms branch out into various specialized techniques including LoRA fine-tuning, Chain of Thought, Retrieval Augmented Generation, ReAct, and Agent frameworks. Each technique offers distinct advantages and trade-offs regarding computational resources, implementation complexity, and performance improvements.

This article is also available as a video here

How to Choose LLMs?

The first step of customizing LLMs is to select the appropriate foundation models as the baseline. Community based platform e.g. “Huggingface” offers a wide range of open-source pre-trained models contributed by top companies or communities, such as Llama series from Meta and Gemini from Google. Huggingface additionally provides leaderboards, for example “Open LLM Leaderboard” to compare LLMs based on industry-standard metrics and tasks (e.g. MMLU). Cloud providers (e.g., AWS) and AI companies (e.g., OpenAI and Anthropic) also offer access to proprietary models that are typically paid services with restricted access. Following factors are essentials to consider when choosing LLMs.

Open source or proprietary model: Open source models allow full customization and self-hosting but require technical expertise while proprietary models offer immediate access and often better quality responses but with higher costs.

Task and metrics: Models excel at different tasks including question-answering, summarization, code generation etc. Compare benchmark metrics and test on domain-specific tasks to determine the appropriate models.

Architecture: In general, decoder-only models (GPT series) perform better at text generation while encoder-decoder models (T5) handle translation well. There are more architecture emerging and showing promising results, for instance Mixture of Experts (MoE) model “DeepSeek”.

Number of Parameters and Size: Larger models (70B-175B parameters) offer better performance but need more computing power. Smaller models (7B-13B) run faster and cheaper but may have reduced capabilities.

After determining a base LLM, let’s explore 6 most common strategies for LLM customization, ranked in order of resource consumption from the least to the most intensive:

  • Prompt Engineering
  • Decoding and Sampling Strategy
  • Retrieval Augmented Generation
  • Agent
  • Fine Tuning
  • Reinforcement Learning from Human Feedback

If you’d prefer a video walkthrough of these concepts, please check out my video on “6 Common LLM Customization Strategies Briefly Explained”.

LLM Customization Techniques

1. Prompt Engineering

Prompt is the input text sent to an LLM to elicit an AI-generated response, and it can be composed of instructions, context, input data and output indicator.

Instructions: This provides a task description or instruction for how the model should perform.

Context: This is external information to guide the model to respond within a certain scope.

Input data: This is the input for which you want a response.

Output indicator: This specifies the output type or format.

Prompt Engineering involves crafting these prompt components strategically to shape and control the model’s response. Basic prompt engineering techniques include zero shot, one shot, and few shot prompting. User can implement basic prompt engineering techniques directly while interacting with the LLM, making it an efficient approach to align model’s behavior to on a novel objective. API implementation is also an option and more details are introduced in my previous article “A Simple Pipeline for Integrating LLM Prompt with Knowledge Graph”.

Due to the efficiency and effectiveness of prompt engineering, more complex approaches are explored and developed to advance the logical structure of prompts.

Chain of Thought (CoT) asks LLMs to break down complex reasoning tasks into step-by-step thought processes, improving performance on multi-step problems. Each step explicitly exposes its reasoning outcome which serves as the precursor context of its subsequent steps until arriving at the answer.

Tree of thoughts extends from CoT by considering multiple different reasoning branches and self-evaluating choices to decide the next best action. It is more effective for tasks that involve initial decisions, strategies for the future and exploration of multiple solutions.

Automatic reasoning and tool use (ART) builds upon the CoT process, it deconstructs complex tasks and allows the model to select few-shot examples from a task library using predefined external tools like search and code generation.

Synergizing reasoning and acting (ReAct) combines reasoning trajectories with an action space, where the model search through the action space and determine the next best action based on environmental observations.

Techniques like CoT and ReAct are often combined with an Agentic workflow to strengthen its capability. These techniques will be introduced in more detail in the “Agent” section.

Further Reading

2. Decoding and Sampling Strategy

Decoding strategy can be controlled at model inference time through inference parameters (e.g. temperature, top p, top k), determining the randomness and diversity of model responses. Greedy search, beam search and sampling are three common decoding strategies for auto-regressive model generation. ****

During the autoregressive generation process, LLM outputs one token at a time based on a probability distribution of candidate tokens conditioned by the pervious token. By default, greedy search is applied to produce the next token with the highest probability.

In contrast, beam search decoding considers multiple hypotheses of next-best tokens and selects the hypothesis with the highest combined probabilities across all tokens in the text sequence. The code snippet below uses transformers library to specify the the number of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) during the model generation process.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained(model_name)
outputs = model.generate(**inputs, num_beams=5)

Sampling strategy is the third approach to control the randomness of model responses by adjusting these inference parameters:

  • Temperature: Lowering the temperature makes the probability distribution sharper by increasing the likelihood of generating high-probability words and decreasing the likelihood of generating low-probability words. When temperature = 0, it becomes equivalent to greedy search (least creative); when temperature = 1, it produces the most creative outputs.
  • Top K sampling: This method filters the K most probable next tokens and redistributes the probability among those tokens. The model then samples from this filtered set of tokens.
  • Top P sampling: Instead of sampling from the K most probable tokens, top-p sampling selects from the smallest possible set of tokens whose cumulative probability exceeds the threshold p.

The example code snippet below samples from the top 50 most likely tokens (top_k=50) with a cumulative probability higher than 0.95 (top_p=0.95)

sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
) 

Further Reading

3. RAG

Retrieval Augmented Generation (or RAG), initially introduced in the paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, has been demonstrated as a promising solution that integrates external knowledge and reduces common LLM “hallucination” issues when handling domain specific or specialized queries. RAG allows dynamically pulling relevant information from knowledge domain and generally does not involve extensive training to update LLM parameters, making it a cost-effective strategy to adapt a general-purpose LLM for a specialized domain.

A RAG system can be decomposed into retrieval and generation stage. The objective of retrieval process is to find contents within the knowledge base that are closely related to the user query, by chunking external knowledge, creating embeddings, indexing and similarity search.

  1. Chunking: Documents are divided into smaller segments, with each segment containing a distinct unit of information.
  2. Create embeddings: An embedding model compresses each information chunk into a vector representation. The user query is also converted into its vector representation through the same vectorization process, so that the user query can be compared in the same dimensional space.
  3. Indexing: This process stores these text chunks and their vector embeddings as key-value pairs, enabling efficient and scalable search functionality. For large external knowledge bases that exceed memory capacity, vector databases offer efficient long-term storage.
  4. Similarity search: Similarity scores between the query embeddings and text chunk embeddings are calculated, which are used for searching information highly relevant to the user query.

The generation process of the RAG system then combines retrieved information with the user query to form the augmented query which is parsed to the LLM to generate the context rich response.

Code Snippet

The code snippet firstly specifies the LLM and embedding model, then perform the steps to chunk the external knowledge base documents into a collection of document. Create index from document, define the query_engine based on the index and query the query_engine with the user prompt.

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"

document = Document(text="\\n\\n".join([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])                                    
query_engine = index.as_query_engine()
response = query_engine.query(
    "Tell me about LLM customization strategies."
)

The example above shows a simple RAG system. Advanced RAG improve based on this by introducing pre-retrieval and post-retrieval strategies to reduce pitfalls such as limited synergy between the retrieval and generation process. For example rerank technique reorders the retrieved information using a model capable of understanding bidirectional context, and integration with knowledge graph for advanced query routing. More use cases can be found on the llamaindex website.

Further Reading

4. Agent

LLM Agent was a trending topic in 2024 and will likely remain a main focus in the GenAI field in 2025. Compared to RAG, Agent excels at creating query routes and planning LLM-based workflows, with the following benefits:

  • Maintaining memory and state of previous model generated responses.
  • Leveraging various tools based on specific criteria. This tool-using capability sets agents apart from basic RAG systems by giving the LLM independent control over tool selection.
  • Breaking down a complex task into smaller steps and planning for a sequence of actions.
  • Collaborating with other agents to form a orchestrated system.

Several in-context learning techniques (e.g. CoT, ReAct ) can be implemented through the Agentic framework and we will discuss ReAct in more details. ReAct, stands for “Synergizing Reasoning and Acting in Language Models”, is composed of three key elements – actions, thoughts and observations. This framework was introduced by Google Research at Princeton University, built upon Chain of Thought by integrating the reasoning steps with an action space that enables tool uses and function calling. Additionally, ReAct framework emphasizes on determining the next best action based on the environmental observations.

This example from the original paper demonstrated ReAct’s inner working process, where the LLM generates the first thought and acts by calling the function to “Search [Apple Remote]”, then observes the feedback from its first output. The second thought is then based on the previous observation, hence leading to a different action “Search [Front Row]”. This process iterates until reaching the goal. The research shows that ReAct overcomes prevalent issues of hallucination and error propagation as more often observed in chain-of-thought reasoning by interacting with a simple Wikipedia API. Furthermore, through the implementation of decision traces, ReAct framework additionally increases the model’s interpretability, trustworthiness and diagnosability.

Example from “ReAct: Synergizing Reasoning and Acting in Language Models” (Yu et. al., 2022)

Code Snippet

This demonstrates an ReAct-based agent implementation using llamaindex. Firstly, it defines two functions (multiply and add). Secondly, these two functions are encapsulated as FunctionTool, forming the Agent’s action space and executed based on its reasoning.

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool

# create basic function tools
def multiply(a: float, b: float) -> float:
    return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)

def add(a: float, b: float) -> float:
    return a + b
add_tool = FunctionTool.from_defaults(fn=add)

agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)

The advantages of an Agentic Workflow are more substantial when combined with self-reflection or self-correction. It is an increasingly growing domain with a variety of Agent architecture being explored. For instance, Reflexion framework facilitate iterative learning by providing a summary of verbal feedback from environmental and storing the feedback in model’s memory; CRITIC framework empowers frozen LLMs to self-verify through interacting with external tools such as code interpreter and API calls.

Further Reading

5. Fine-Tuning

Fine-tuning is the process of feeding niche and specialized datasets to modify the LLM so that it is more aligned with a certain objective. It differs from prompt engineering and RAG as it enables updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM through backpropogation, which requires large memory to store all weights and parameters and may suffer from significant reduction in ability on other tasks (i.e. catastrophic forgetting). Therefore, PEFT (or parameter efficient fine tuning) is more widely used to mitigate these caveats while saving the time and cost of model training. There are three categories of PEFT methods:

  • Selective: Select a subset of initial LLM parameters to fine tune which can be more computationally intensive compared to other PEFT methods.
  • Reparameterization: Adjust model weights through training the weights of low rank representations. For example, Lower Rank Adaptation (LoRA) is among this category that accelerates fine-tuning by representing the weight updates with two smaller matrices.
  • Additive: Add additional trainable layers to model, including techniques like adapters and soft prompts

The fine-tuning process is similar to deep learning training process., requiring the following inputs:

  • training and evaluation datasets
  • training arguments define the hyperparameters e.g. learning rate, optimizer
  • pretrained LLM model
  • compute metrics and objective functions that algorithm should be optimized for

Code Snippet

Below is an example of implementing fine-tuning using the transformer Trainer.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
		output_dir=output_dir,
		learning_rate=1e-5,
		eval_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Fine-tuning has a wide range of use cases. For instance, instruction fine-tuning optimizes LLMs for conversations and following instructions by training them on prompt-completion pairs. Another example is domain adaptation, an unsupervised fine-tuning method that helps LLMs specialize in specific knowledge domains.

Further Reading

6. RLHF

Reinforcement learning from human feedback, or RLHF, is a reinforcement learning technique that fine tunes LLMs based on human preferences. RLHF operates by training a reward model based on human feedback and uses this model as a reward function to optimize a reinforcement learning policy through PPO (Proximal Policy Optimization). The process requires two sets of training data: a preference dataset for training reward model, and a prompt dataset used in the reinforcement learning loop.

Let’s break it down into steps:

  1. Gather preference dataset annotated by human labelers who rate different completions generated by the model based on human preference. An example format of the preference dataset is {input_text, candidate1, candidate2, human_preference}, indicating which candidate response is preferred.
  2. Train a reward model using the preference dataset, the reward model is essentially a regression model that outputs a scalar indicating the quality of the model generated response. The objective of the reward model is to maximize the score between the winning candidate and losing candidate.
  3. Use the reward model in a reinforcement learning loop to fine-tune the LLM. The objective is that the policy is updated so that LLM can generate responses that maximize the reward produced by the reward model. This process utilizes the prompt dataset which is a collection of prompts in the format of {prompt, response, rewards}.

Code Snippet

Open source library Trlx is widely applied in implementing RLHF and they provided a template code that shows the basic RLHF setup:

  1. Initialize the base model and tokenizer from a pretrained checkpoint
  2. Configure PPO hyperparameters PPOConfig like learning rate, epochs, and batch sizes
  3. Create the PPO trainer PPOTrainer by combining the model, tokenizer, and training data
  4. The training loop uses step() method to iteratively update the model to optimized the rewards calculated from the query and model response
# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

# initiate the pretrained model and tokenizer
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

# define the hyperparameters of PPO algorithm
config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

# initiate the PPO trainer with reference to the model
ppo_trainer = PPOTrainer(
	config=config, 
	model=ppo_model, 
  tokenizer=tokenizer, 
  dataset=dataset["train"],
  data_collator=collator
)                      
                        
# ppo_trainer is iteratively updated through the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)

RLHF is widely applied for aligning model responses with human preference. Common use cases involve reducing response toxicity and model hallucination. However, it does have the downside of requiring a large amount of human annotated data as well as computation costs associated with policy optimization. Therefore, alternatives like Reinforcement Learning from AI feedback and Direct Preference Optimization (DPO) are introduced to mitigate these limitations.

Further Reading

Take-Home Message

This article briefly explains six essential LLM customization strategies including prompt engineering, decoding strategy, RAG, Agent, fine-tuning, and RLHF. Hope you find it helpful in terms of understanding the pros/cons of each strategy as well as how to implement them based on the practical examples.

The post 6 Common LLM Customization Strategies Briefly Explained appeared first on Towards Data Science.

]]>
7 Tips to Future-Proof Machine Learning Projects https://towardsdatascience.com/7-tips-to-future-proof-machine-learning-projects-582397875edc/ Sat, 24 Feb 2024 14:26:25 +0000 https://towardsdatascience.com/7-tips-to-future-proof-machine-learning-projects-582397875edc/ An Introduction to Developing More Collaborative, Reproducible and Reusable ML Code

The post 7 Tips to Future-Proof Machine Learning Projects appeared first on Towards Data Science.

]]>
7 Tips to Future-Proof ML Project (image by author)
7 Tips to Future-Proof ML Project (image by author)

There can be a knowledge gap when transitioning from exploratory Machine Learning projects, typical in research and study, to industry-level projects. This is due to the fact that industry projects generally have three additional goals: collaborative, reproducible, and reusable, which serve the purpose of enhancing business continuity, increasing efficiency and reducing cost. Although I am no way near finding a perfect solution, I would like to document some tips to transform a exploratory, notebook-based ML code to industry-ready project that is designed with more scalability and sustainability.

I have categorized these tips into three key strategies:

  • Improvement 1: Modularization – Break Down Code into Smaller Pieces
  • Improvement 2: Versioning – Data, Code and Model Versioning
  • Improvement 3: Consistency – Consistent Structure and Naming Convention

Improvement 1: Modularization – Break Down Code into Smaller Pieces

Problem Statement

One struggle I have faced is to have only one notebook for the entire data science project – which is common while learning data science. As you may experience, there are repeatable code components in a data science lifecycle, for instance, same data preprocessing steps are applied to transform both train data and inference data. If not handled properly, it results in different versions of the same function are copied and reused at multiple locations. Not only does it decrease the consistency of the code, but it also makes troubleshooting the entire notebook more challenging.

Bad Example

train_data = train_data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
train_data[numeric_cols] = train_data[numeric_cols].fillna(train_data[numeric_cols].mean())
train_data['Month'] = pd.to_datetime(train_data['Date']).dt.month.apply(str)

inference_data = inference_data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
inference_data[numeric_cols] = inference_data[numeric_cols].fillna(inference_data[numeric_cols].mean())
inference_data['Month'] = pd.to_datetime(inference_data['Date']).dt.month.apply(str)

Tip 1: Reuse code where possible by creating and importing functions and modules

Good Example 1

def data_preparation(data):
    data = data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
    numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
    data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())
    data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
    return data
train_preprocessed = data_preparation(train_data)
inference_preprocessed = data_preparation(inference_data)

In this example, we extract the common processing steps as a data_preparation function and apply it to train_data and inference_data. Breaking down a long script into self-contained components like this makes it easier to unit test and troubleshoot. Additionally, it reduces the risk of inconsistency when we have multiple copies of the same function steps and accidentally modify or misspell in one of the copies.

Good Example 2

Furthermore, we can store this function in a standalone Python module (i.e. preprocessing.py below) and import the function from this file.

## file preprocessing.py ##
def data_preparation(data):
    data = data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
    numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
    data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())
    data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
    return data
from preprocessing import data_preparation 
train_preprocessed = data_preparation(train_data)
inference_preprocessed = data_preparation(inference_data)

This makes it readily accessible and reusable for applications in other projects or by other team members.

Tip 2: Keep parameters in a separate config file

To further improve upon the script, we can store parameters e.g. dropped columns in another configuration file (i.e. parameter.pybelow) and importing it from the module.

Good Example 3

## parameters.py ##
DROP_COLS = ['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am']
NUM_COLS = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
from parameters import DROP_COLS, NUM_COLS
def data_preparation(data):
    data = data.drop(DROP_COLS, axis=1)
    data[NUM_COLS] = data[NUM_COLS].fillna(data[NUM_COLS].mean())
    data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
    return data

It is beneficial when the parameters remains constant in one iteration of ML pipeline, but can also be mutable as the pipeline changes overtime. While modularizing a simple script like the above might seem unnecessary, it becomes effective as the script becomes more complicated.

This approach is also widely used for storing and parsing model hyperparameters, or storing API tokens and password credentials in a secure location without exposing it in the script.

Improvement 2: Versioning – Data, Code and Model Versioning

Problem Statement

An unexpected trend was identified in the model output which requires revisiting the project. However, only a single output file was generated as the end result of the code. Since production data is changing overtime, regenerating the model output or tracing back to the source has become nearly impossible. Furthermore, the model cannot be reused for future predictions.

Tips 3: Data Versioning

Industry data and production data are hardly static and may update on a daily basis. Therefore, it is crucial to take a snapshot of the data at the point in time when it is used for model training or prediction. A common practice is to use timestamp to version the data.

from datetime import datetime
timestamp = datetime.today().strftime('%Y%m%d')
train_data.to_csv(f'train_data_{timestamp}.csv')
## output 
>>> train_data_20240101.csv

There are more elegant services and solutions in the industry. DVC is a good example if you are looking for tools that make the process more streamlined.

It is also important to save data snapshots throughout the project lifecycle, for example: raw data, processed data, train data, validation data, test data and inference data. This reduces the necessity to rerun the code from scratch each time. Besides, if data drifts are detected in final model output, keeping a record of the intermediate steps helps to identify where the changes occur.

Tip 4: Model Versioning

Depending on training data, preprocessing pipeline and hyperparameters, models developed from the same algorithm can vary significantly from each other, thus it is essential to keeping track of different model configurations during the model experimentation phase. Since models themselves also have a certain level of randomness, even though it is trained on the same dataset and process, the output can be different. This extends beyond the scope of machine learning or deep learning models. PCA and data transformation that required fitting training data would also have a dimension of randomness, which means that using random_seed is important to mitigate the amount of variations in the output.

While learning model experiment tracking is a long journey, the first thing you can do is to save the model to the working directory. There are multiple ways to save a trained model in Python. For example, we can use pickle library.

import pickle
model_filename = 'model.pkl'
pickle.dump(model, model_filename)

You may want to choose a more descriptive name for your filename, and it is always helpful to provide a brief description that explains the model variant.

To load the model:

model = pickle.load('model.pkl')

Tip 5: Code Versioning

The third recommendation is to save the queries that have been used to generate any output data, e.g. the SQL script for extracting the raw data. When executing batch inference, save the script with the precise date instead of a relative date. This helps to keep a record of the time snapshot for future reference.

# use relative date
SELECT ID, Date, MinTemp, MaxTemp
FROM dim_temperature
WHERE Date <= DATEADD(year,-1,GETDATE())
# use precise date
SELECT ID, Date, MinTemp, MaxTemp
FROM dim_temperature
WHERE Date <= '2024-01-01' AND Date >= '2023-01-01'

Furthermore, Git is undoubtedly an essential code versioning tool when collaborating on a Data Science project with other teammates. In short, it helps to track changes and revert back to previous checkpoint when necessary.

Improvement 3: Consistency – Consistent Structure and Naming Convention

Problem Statement

All the data, files, and scripts are stored in one flat directory structure. The code is convoluted together within one notebook. It becomes difficult to figure out any dependencies, and there is the risk of accidentally executing a line of code that overwrites previous output data. Due to lacking reusability and consistency, every ML project is built from scratch without a standard workflow or structure.

Tip 6: Consistent Directory Structure

As the field of Data Science and Machine Learning has matured overtime, consistent framework and project lifecycle have been gradually developed, such as CRISP-DM and TDSP. Therefore, we can build the project directory structure to adhere with a standard framework. For instance, "cookiecutter data science" provides a logical, reasonably standardized, but flexible project structure for doing and sharing data science work. Based on its recommended directory structure, I have adjusted it to a reduced version as below which I am also allowing it to evolve overtime.

Feel free to develop a structure that best suits your workflow and can be used as a template to design all future projects. In addition to the benefit of consistency, it is a powerful way to organize thoughts and create a high level architecture during the development phase.

├── data
│   ├── output      <- The output data from the model. 
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. 
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── code              <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to generate and process data
│   │   ├── data_preparation.py
│   │   └── data_preprocessing.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── inference_model.py
│   │   └── train_model.py
│   │
│   └── analysis  <- Scripts to create exploratory and results oriented visualizations
│       └── analysis.py
│

Tip 7: Consistent Naming Convention

Another way to introduce more consistency and also reduce frictions among team collaboration is keeping a naming convention for your model, data and code. There isn’t a single best practice and it’s about finding a method that suits your specific use cases. You may derive some insights from HuggingFace or Kaggle model hub, for example <model-name>-<parameters>-<model-version> or <model-name>-<data-version>-<use-case>.

And of course, documentation is always preferred to add extra details behind the names. However, it is often easier said than done, we may stick to it for the first few times until we completely forget about maintaining the same convention. One tip I’ve learned is to create a template file with the naming convention, and save it in the working directory. Then it can both serve as a reminder and can be easily duplicated for creating new files.


Thanks for reaching the end. If you would like to read more of my articles on Medium, I would really appreciate your support by signing up Medium membership.


Take Home Message

The article discussed how to future-proof machine learning projects with three key improvements: modularization, versioning, and maintaining consistency.

Improvement 1: Modularization

  • Tip 1: Reuse code when possible by importing functions, modules
  • Tip 2: Keep parameters in a separate config file

Improvement 2: Versioning

  • Tip 3: Data versioning
  • Tip 4: Model versioning
  • Tip 5: Code versioning

Improvement 3: Consistency

  • Tip 6: Consistent directory structure
  • Tip 7: Consistent naming convention

More Related Articles

Get Started in Data Science

Practical Guides to Machine Learning

How to Self-Learn Data Science


Originally published at https://www.visual-design.net on Feb 24th, 2024.

The post 7 Tips to Future-Proof Machine Learning Projects appeared first on Towards Data Science.

]]>
How to Develop a Streamlit Data Analytics Web App in 3 Steps https://towardsdatascience.com/how-to-develop-a-data-analytics-web-app-in-3-steps-92cd5e901c52/ Sat, 25 Feb 2023 04:22:39 +0000 https://towardsdatascience.com/how-to-develop-a-data-analytics-web-app-in-3-steps-92cd5e901c52/ Step-by-Step Guide to Build Your First YouTube Analytics App

The post How to Develop a Streamlit Data Analytics Web App in 3 Steps appeared first on Towards Data Science.

]]>
Photo by Tran Mau Tri Tam on Unsplash
Photo by Tran Mau Tri Tam on Unsplash

For the majority of the time, data science/Data Analytics projects end up as delivering a static report or dashboard, which tremendously downgrades the efforts and thoughts being put into the process. Alternatively, web app is a great way to demonstrate your data analytics work, which can be further expanded as a service on self-served and interactive platforms. However, as data scientists or data analysts, we are not trained for developing softwares or websites. In this article, I would like to introduce tools like Streamlit and Plotly that allows us to easily develop and serve your data analytics projects through a web app, with the following three steps:

Develop Data Analytic Web App in 3 Steps (image from author's website)
Develop Data Analytic Web App in 3 Steps (image from author’s website)
  1. Extract Data and Build Database
  2. Define Data Analytics Process as Functions
  3. Construct Web App Interface

Afterwards, we will be able to create a simple web app like this:

Web App Demo (image by author)
Web App Demo (image by author)

Step 1. Extract and Build Database

Step 1 in Developing Data Analytics Web App (image by author)
Step 1 in Developing Data Analytics Web App (image by author)

We will use YouTube Data as an example here, since it is relevant to our daily life. YouTube Data API allows us to get public YouTube data, such as video statistics (e.g. number of likes, views) or content details (e.g. tags, title, comments). To set up the YouTube API, it is required to sign up a Google Developer account and set up an API key. Here are some resources I found helpful to get myself started on using YouTube API.

These resources takes us through how to create a YouTube API key and install required library (e.g. googleapiclient.discovery). After these dependencies have been resolved, we set up the connection to the API using Python and your own API key, using the following command:

from googleapiclient.discovery import build
youtube = build('youtube', 'v3', developerKey=<your_api_key>)

After establishing the connection, it’s time to explore what data is available for your Data Science projects. To do this, take a look at the YouTube Data API documentation, which provides an overview of the different kinds of data that can be accessed.

YouTube Data API reference list (screenshot by author)
YouTube Data API reference list (screenshot by author)

We will use "Videos" as an example for this project and the list() __ method allows us to request the "Video Resource" by passing the _par_t parameter and several _filter_s. _par_t parameter specifies which components from the Video Resource you would like to fetch and here I am getting _snippet, statistics, and contentDetail_s. Have a look at this documentation which details all fields you can get from videos().list() method. And we specify the following _filte_r parameters to limit the results returned from this request.

  • chart='mostPopular': get the most popular videos
  • regionCode='US': videos from US
  • videoCategoryId=1: get the videos from a specific video category (e.g. 1 is for Film & Animation), which can be found in this catalog of video category ID.
  • maxResults=20: return a maximum number of 20 videos
video_request = youtube.videos().list(
                part='snippet,statistics,contentDetails',
                chart='mostPopular',
                regionCode='US',
                videoCategoryId=1,
                maxResults=20
              )
response = video_request.execute()

We then execute the request using video_request.execute() and the response will be returned as JSON format, which typically looks like the snapshot below.

response in JSON format (image by author)
response in JSON format (image by author)

All information are stored in the "items" in the response. Then we extract the ‘items’ key and create the dataframe video_df by normalizing the JSON format.

video_df = json_normalize(response['items'])

As the result, we manage to tidy up the output into a structure that is easier to manipulate.

video_df (image by author)
video_df (image by author)

To take a step further of working with JSON using Python, I recommend reading the article "How to Best Work with JSON in Python".

Step 2. Define Data Analytics Process as Function

Step 2 in Developing Data Analytics Web App (image by author)
Step 2 in Developing Data Analytics Web App (image by author)

We can package multiple lines of code statements into one function, so that it can be iteratively executed and easily embedded with other web app components at the later stage.

Define extractYouTubeData()

For instance, we can encapsulate the data extraction process above into a function: extractYouTubeData(youtube, categoryId), which allows us to pass a categoryId variable and output the top 20 popular videos under that category as video_df. In this way, we can get user’s input on which category they would like to select, then feed the input into this function and get the corresponding top 20 videos.

def extractYouTubeData(youtube, categoryId):
    video_request = youtube.videos().list(
    part='snippet,statistics,contentDetails',
    chart='mostPopular',
    regionCode='US',
    videoCategoryId=categoryId,
    maxResults=20
    )
    response = video_request.execute()
    video_df = json_normalize(response['items'])
    return video_df

We can use video_df.info()to get all fields in this dataframe.

fields in video_df (image by author)
fields in video_df (image by author)

With this valuable dataset we can carry out a large variety of analysis, such as exploratory data analysis, sentiment analysis, topic modeling etc.

I would like to start with designing the app for some exploratory data analysis on these most popular YouTube videos

  • video duration vs. the number of likes
  • the most frequently occurred tags

In the future articles, I will explore more techniques such as topic modeling and natural language processing to analyze the video title and comments. Therefore, if you would like to read more of my articles on Medium, I would really appreciate your support by signing up Medium membership ☕.

Define plotVideoDurationStats()

I would like to know whether video duration has some correlation with the number of likes for these popular videos. To achieve this, we firstly need to transform the contentDetails.duration from ISO datetime format into numeric values using isodate.parse_duration().total_seconds(). Then we can use scatter plot to visualize the video duration against the likes count. This is carried out using Plotly which allows more interactive experience for end users. The code snippet below returns the Plotly figure which will be embedded into our web app.

import isodate
import plotly.express as px

def plotVideoDurationStats(video_df):
    video_df['contentDetails.duration'] = video_df['contentDetails.duration'].astype(str)
    video_df['duration'] = video_df['contentDetails.duration'].apply(lambda x: isodate.parse_duration(x).total_seconds())
    fig = px.scatter(video_df, x="duration", y='statistics.likeCount', color_discrete_sequence=px.colors.qualitative.Safe)
    return fig
figure output from plotVideoDurationStats (image by author)
figure output from plotVideoDurationStats (image by author)

To explore more tutorials based on Plotly, check out these blogs below:

An Interactive Guide to Hypothesis Testing in Python

How to Use Plotly for More Insightful and Interactive Data Explorations

Define plotTopNTags()

This function creates the figure of top N tags of a certain video category. Firstly, we iterate through all snippet.tags and collect all tags into a tag list. We then create the tags_freq_df that describe the counts of top N most frequent tags. Lastly, we use px.bar() to display the chart.

def plotTopNTags(video_df, topN):
    tags = []
    for i in video_df['snippet.tags']:
        if type(i) != float:
            tags.extend(i)
    tags_df = pd.DataFrame(tags)
    tags_freq_df = tags_df.value_counts().iloc[:topN].rename_axis('tag').reset_index(name='frequency')
    fig = px.bar(tags_freq_df, x='tag', y='frequency')
    return fig
figure output from plotTopNTags() (image by author)
figure output from plotTopNTags() (image by author)

Step 3. Construct Web App Interface

Step 3 in Developing Data Analytics Web App (image by author)
Step 3 in Developing Data Analytics Web App (image by author)

We will use Streamlit to develop the web app interface. It is the easiest tool I found so far for web app development running on top of Python. It saves us the hassle to handle the HTTP request, define routes, or write HTML and CSS code.

Run !pip install Streamlit to install Streamlit to your machine, or use this documentation to install Streamlit in your preferred development environment.

Creating a web app component is very easy using Streamlit. For example displaying a title is as simple as below:

import streamlit as st
st.title('Trending YouTube Videos')

Here we need several components to build the web app.

1) input: a dropdown menu for users to select video category

dropdown menu (image by author)
dropdown menu (image by author)

This code snippet allows us to create a dropdown menu with the prompt "Select YouTube Video Category" and options to choose from ‘Film & Animation’, ‘Music’, ‘Sports’, ‘Pets & Animals’.

videoCategory = st.selectbox(
    'Select YouTube Video Category',
    ('Film &amp; Animation', 'Music', 'Sports', 'Pets &amp; Animals')
)

2) input: a slider for users choose the number of tags

slider (image by author)
slider (image by author)

This defines the slider and specifies the slider range from 0 to 20.

topN = st.slider('Select the number of tags to display',0, 20)

3) output: a figure that displays the video duration vs. number of likes

video duration vs. number of likes (image by author)
video duration vs. number of likes (image by author)

We firstly create the videoCategoryDictto convert the category name into categoryId, then pass the categoryId through the extractYouTubeData() __ function that we defined previously. Check out this page for a full list of the video category their corresponding categoryId.

We then call the plotVideoDuration() function and display the plotly chart using st.plotly_chart().

videoCategoryDict = {'Film &amp; Animation': 1, 'Music': 10, 'Sports': 17, 'Pets &amp; Animals': 15}
categoryId = videoCategoryDict[videoCategory]
video_df = extractYouTubeData(youtube, categoryId)
duration_fig = plotVideoDurationStats(video_df)
fig_title1 = 'Durations(seconds) vs Likes in Top ' + videoCategory + ' Videos'
st.subheader(fig_title1)
st.plotly_chart(duration_fig)

4) output: a figure that displays the top tags in that video category

top tags in the video category (image by author)
top tags in the video category (image by author)

The last component requires us to feed user’s input of number of tags to the function plotTopNTags(), and create the plot by calling the function.

tag_fig = plotTopNTags(video_df, topN)
fig_title2 = 'Top '+ str(topN) + ' Tags in ' + videoCategory + ' Videos'
st.subheader(fig_title2)
st.plotly_chart(tag_fig)

These code statements can be all contained in a single Python file python (e.g. YoutTubeDataApp.py). Then we navigate to the command line and use !streamlit run YouTubeDataApp.pyto run the app in a web browser.


Take-Home Message

Building a web app may seem intimidating for data analysts and data scientists. This post covers following three steps to get your hands on building your first web app and extend your data analytics projects to a self-served platform:

  • Extract Data and Build Database
  • Define Data Analytics Process as Functions
  • Construct Web App Interface

More Resources Like This

Get Started in Data Science

EDA and Feature Engineering Techniques

How to Use Plotly for More Insightful and Interactive Data Explorations


Originally published at https://www.visual-design.net on Feb 23rd, 2023.

The post How to Develop a Streamlit Data Analytics Web App in 3 Steps appeared first on Towards Data Science.

]]>
A Visual Learner’s Guide to Explain, Implement and Interpret Principal Component Analysis https://towardsdatascience.com/a-visual-learners-guide-to-explain-implement-and-interpret-principal-component-analysis-cc9b345b75be/ Wed, 25 Jan 2023 13:20:29 +0000 https://towardsdatascience.com/a-visual-learners-guide-to-explain-implement-and-interpret-principal-component-analysis-cc9b345b75be/ Linear Algebra for Machine Learning  - Covariance Matrix, Eigenvector and Principal Component

The post A Visual Learner’s Guide to Explain, Implement and Interpret Principal Component Analysis appeared first on Towards Data Science.

]]>
(PCA)

Linear Algebra for Machine Learning – Covariance Matrix, Eigenvector and Principal Component

In my previous article, we have talked about applying linear algebra for data representation in Machine Learning algorithms, but the application of linear algebra in ML is much broader than that.

How is Linear Algebra Applied for Machine Learning?

This article will introduce more linear algebra concepts with the main focus on how these concepts are applied for dimensionality reduction, specially Principal Component Analysis (PCA). In the second half of this post, we will also implement and interpret PCA using a few lines of code with the help of Python scikit-learn.


When to Use PCA?

High-dimensional data is a common issue experienced in machine learning practices, as we typically feed a large amount of features for model training. This results in the caveat of models having less interpretability and higher complexity – also known as the curse of dimensionality. PCA can be beneficial when the dataset is high-dimensional (i.e. contains many features) and it is widely applied for dimensionality reduction.

Additionally, PCA is also used for discovering the hidden relationships among features and reveal underlying patterns that can be very insightful. PCA attempts to find linear components that capture as much variance in the data as possible, and the first principal component (PC1) is typically composed of features that contributes the most to model predictions.

How Does PCA Work?

The objective of PCA is to find the principal components that represents the data variance in a lower dimension and we are going to unfold the process into following steps:

  1. represent the data variance using covariance matrix
  2. eigenvector and eigenvalue capture data variance in a lower dimensionality
  3. principal components are the eigenvectors of the covariance matrix

To understand how PCA works, we need to answer the questions of what are covariance matrix and eigenvector/eigenvalue. It is also helpful to fundamentally shift our perspectives of viewing matrix multiplication as a math operation to a visual transformation.

Matrix Transformation

We have previously introduced how matrix dot product is computed from a math operation perspectives. We can also interpret the dot product as a visual transformation which assists in understanding more complex linear algebra concepts. As illustrated below, let us use a 2×2 matrix as an example. We split the matrix vertically into two vectors where the left one represents the basis vector of x-axis, and the right one represents the basis vector of the y-axis. Therefore, a matrix represents a 2D space constructed by the x-axis and y-axis.

matrix transformation - identity matrix (image by author)
matrix transformation – identity matrix (image by author)

It is not hard to understand that an identity matrix has [1,0] as the basis vector on the x-axis and [0,1] as the basis vector on the y-axis, so that the dot product between any vectors and the identity matrix will return the vector itself.

Matrix transformation boils down to changing the scale and shifting the direction of the axis. For example, changing the basis vector of x-axis from [1,0] to [2,0] means that the mapping space has been scaled two times in the x coordinate direction.

matrix transformation - x-axis scaled matrix (image by author)
matrix transformation – x-axis scaled matrix (image by author)

We can additionally combine both the x-axis and y-axis for more complicated scaling, rotating or shearing transformation. A typically example is the mirror matrix where we swap the x and y axis. For a given vector [1,2], we will get [2,1] after the mirror transformation.

matrix transformation - mirror matrix (image by author)
matrix transformation – mirror matrix (image by author)

If you would like to practice these transformations in Python and skip the manual calculations, we can use following code to perform these dot products and visualize the result of the transformation using plt.quiver() function.

import numpy as np
import matplotlib.pyplot as plt
# define matrices and vector
x_scaled_matrix = np.array([[2,0],[0,1]])
mirror_matrix = np.array([[0,1],[1,0]])
v = np.array([1,2])
# matrix transformation
mirrored_v = mirror_matrix.dot(v)
x_scaled_v = x_scaled_matrix.dot(v)
# plot transformed vectors
origin = np.array([[0, 0], [0, 0]])
plt.quiver(*origin, v[0], v[1], color=['black'],scale=10, label='original vector')
plt.quiver(*origin, mirrored_v[0], mirrored_v[1] , color=['#D3E7EE'], scale=10, label='mirrored vector' )
plt.quiver(*origin, x_scaled_v[0], x_scaled_v[1] , color=['#C6A477'], scale=10, label='x_scaled vector')
plt.legend(loc ="lower right")
matrix transformation result in python (image by author)
matrix transformation result in python (image by author)

Covariance Matrix

In Short: covariance matrix represents the pairwise correlations among a group of variables in a matrix form.

Covariance matrix is another critical concept in PCA process that represents the data variance in the dataset. To understand the details of covariance matrix, we firstly need to know that covariance measures the magnitude of how one random variable varies with another random variable. For two random variable x and y, their covariance is formulated as below and higher covariance value indicates stronger correlation between two variables.

covariance formula (image by author)
covariance formula (image by author)

When given a set of variables (e.g. x1, x2, … xn) in a dataset, covariance matrix is used for representing the covariance value between each variable pairs in a matrix format.

covariance matrix (image by author)
covariance matrix (image by author)

Multiplying any vector with the covariance matrix will transform it towards the direction that captures the trend of variance in the original dataset.

Let us use a simple example to simulate the effect of this transformation. Firstly, we randomly generate the variable x0, x1 and then compute the covariance matrix.

# generate random variables x0 and x1
import random
x0 = [round(random.uniform(-1, 1),2) for i in range(0,100)]
x1 = [round(2 * i + random.uniform(-1, 1) ,2) for i in x0]

# compute covariance matrix
X = np.stack((x0, x1), axis=0)
covariance_matrix = np.cov(X)
print('covariance matrixn', covariance_matrix)

We then transform some random vectors by taking the dot product between each of them and the covariance matrix.

# plot original data points
plt.scatter(x0, x1, color=['#D3E7EE'])

# vectors before transformation
v_original = [np.array([[1,0.2]]), np.array([[-1,1.5]]), np.array([[1.5,-1.3]]), np.array([[1,1.4]])]

# vectors after transformation
for v in v_original:
    v_transformed = v.dot(covariance_matrix)
    origin = np.array([[0, 0], [0, 0]])
    plt.quiver(*origin, v[:, 0], v[:, 1], color=['black'], scale=4)
    plt.quiver(*origin, v_transformed[:, 0], v_transformed[:, 1] , color=['#C6A477'], scale=10)

# plot formatting
plt.axis('scaled')   
plt.xlim([-2.5,2.5])
plt.ylim([-2.5,2.5])
vectors transformed by covariance matrix (image by author)
vectors transformed by covariance matrix (image by author)

Original vectors prior to the transformation are in black, and after transformation are in brown. As you can see, the original vectors that are pointing at different directions have become more conformed to the general trend displayed in the original dataset (i.e. the blue dots). Because of this property, covariance matrix is important to PCA in terms of describing the relationship between features.

Eigenvalue and Eigenvector

_In Short: Eigenvector (v) of a matrix (A) remains at the same direction after the matrix transformation, hence_ Av = λv where v represents the corresponding eigenvalue. Representing data using eigenvector and eigenvalue reduces the dimensionality while maintaining the data variance as much as possible.

To bring more intuitions to this concept, we can use a simple demonstration. For example, we have the matrix [[0,1],[1,0]], and one of the eigenvector for matrix is [1,1] and the corresponding eigenvalue is 1.

eigenvector and eigenvalue (image by author)
eigenvector and eigenvalue (image by author)

From matrix transformation, we know that matrix[[0,1],[1,0]] acts as a mirror matrix that swaps the x, y coordinate of the vector. Therefore, the direction of vector [1,1] will not change after the mirror transformation, thus it meets the criteria of being the eigenvector of the matrix. The eigenvalue 1 indicates that the vector remains at the same scale and direction as prior to the transformation. Consequently, we are able to represent the effect of a matrix transform A (i.e. 2 dimensional) using a scalar λ (i.e. 1 dimension) and eigenvalue tells us how much variance are preserved by the eigenvector.

Let’s continue with the example above and use this code snippet to overlay the eigenvector with the greatest eigenvalue (in red color). As you can see, it is aligned with the direction with the greatest data variance.

from numpy.linalg import eig
eigenvalue,eigenvector = eig(covariance_matrix)
plt.quiver(*origin, eigenvector[:,1][0], eigenvector[:,1][1] , color=['red'], scale=4, label='eigenvector')
visualize eigenvector (image by author)
visualize eigenvector (image by author)

Principal Components

Now that we have discussed that covariance matrix can represent the data variance when multiple variables are present and eigenvector can capture the data variance in a lower dimensionality. By computing the eigenvector/eigenvalue of the covariance matrix, we get the principal components. There are more than one eigenvector for a matrix and they are typically arranged in a descending order of the their eigenvalue, denoted by PC1, PC2PCn. The first principal component (PC1) is the eigenvector with the highest eigenvalue which is the red vector shown in the image, which explains the maximum variance in the data. Therefore, when using principal components to reduce data dimensionality, we select the ones with higher eigenvalues as it preserves more information in the original dataset.


Thanks for reaching so far. If you would like to read more of my articles on Medium, I would really appreciate your support by signing up Medium membership ☕.


PCA Implementation in Machine Learning

We have walked through the theory behind PCA and now let’s step into the practical part. Luckily, scikit-learn has provided us an easy implementation of PCA. We will use the public dataset "college major" from fivethirtyeight GitHub repository [1].

1. Standardize data into the same scale

PCA is sensitive to data with different scales, as covariance matrix requires the data at the same scale to measure the correlation between features with a consistent standard. To achieve that, data standardization is applied before PCA, which means that each feature has a mean of zero and a standard deviation of one. We use the following code snippet to perform data standardization. If you wish to know more data transformation techniques such as normalization, min-max scaling, please visit my article on "3 Common Techniques for Data Transformation".

3 Common Techniques for Data Transformation

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

2. Apply PCA on the scaled data

We then import PCA from sklearn.decomposition and specify the number of components to generate. The number of components is determined by how much data variance to explain by the principal components. Here we will generate 3 components to balance the trade off between the explained variance and dimensionality.

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca_data = pca.fit_transform(scaled_df)

3. Visualize explained variance using scree plot

Some information of the original dataset will be lost after shrinking it to a lower dimensionality, hence it is important to keep as much information as possible while limiting the number of principal components. To help us with the interpretation, we can visualize the explained variance using a scree plot. Explained variance of a principal component indicates the magnitude of data variance in the direction of the eigenvector and it correlates to the eigenvalue. Higher explained variance means that it preserves more information and the one with highest explained variance is the first principal component. We can use the explained_variance_ratio_ attribute to get the explained variance. The code snippet below visualizes the explained variance and also the cumulative variance (i.e. sum of variance if we add previous principal components together).

import matplotlib.pyplot as plt
principal_components = ['PC1', 'PC2', 'PC3']
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
plt.figure(figsize=(10, 6))
plt.bar(principal_components, explained_variance, color='#D3E7EE')
plt.plot(principal_components, cumulative_variance, 'o-', linewidth=2, color='#C6A477')

# add cumulative variance as the annotation
for i,j in zip(principal_components, cumulative_variance):
   plt.annotate(str(round(j,2)), xy=(i, j))
scree plot (image by author)
scree plot (image by author)

The scree plot tells us about the explained variances when three principal components were generated. The first principal component (PC1) explains 60% of the variance and 84% of the variance are explained with the first 3 components combined.

4. Interpret the principal components composition

Principal components additionally provide us some evidence of the importance of original features. By evaluating the magnitude and direction of the coefficients for each original feature, we know whether the feature is strongly correlated with the component. As show below, we generate the coefficients of the features with respects to the components.

pca_component_df = pd.DataFrame(pca.components_, columns = df.columns)
pca_component_df
component coefficients (image by author)
component coefficients (image by author)

Additionally, we can use heatmap from seaborn library to highlight the features with high absolute coefficient values.

import seaborn as sns
# create custom color palette
customPalette = sns.color_palette("blend:#D3E7EE,#C6A477", as_cmap=True)

# create heatmap
plt.figure(figsize=(24,3))
sns.heatmap(pca_component_df, cmap=customPalette, annot=True)
component coefficients heatmap (image by author)
component coefficients heatmap (image by author)

If we interpret PC1 (i.e. row 0), we can see there are multiple features have relatively higher association with PC1, such as "Total" (number of enrolled students), "Employed", "Full_time", "Unemployed" etc, indicating that these features contribute more to the data variance. Additionally, you may notice that some features are directly correlated with each other, and PCA brings the extra benefit of removing multicollinearity among these features.

5. Use principal components in ML algorithm

Finally, we have reduced the dimensionality to a handful of principal components which are ready to be utilized as the new features in machine learning algorithms. To do so, we are going to use the transformed dataset from the output of PCA process – pca_df. We can examine the shape of this dataset using pca_df.shape and we get 173 rows and 3 columns. We then add the label (e.g. "Rank") back to this dataset with 3 principal components from the PCA process and this will become the new dataframe to build the ML model.

pca_df = pd.DataFrame(pca_data)
new_df = pd.concat([pca_df,label_df], axis = 1)
new_df.columns = ["PC1", "PC2", "PC3", "Rank"]
new_df (image by author)
new_df (image by author)

The remaining process will follow the standard procedure of a machine learning lifecycle, that is – split the dataset into train-test, building model and then model evaluation. Here we won’t dive into the details of building ML models, but if you are interested, please have a look at my article on classification algorithms as the starting point.

Top 6 Machine Learning Algorithms for Classification


Take-Home Message

In the previous article, we have introduced using linear algebra for data representation in machine learning. Now we introduced another common use case of linear algebra in ML for dimensionality reduction – Principal Component Analysis (PCA). We firstly discussed the theory behind PCA:

  1. represent the data variance using covariance matrix
  2. use eigenvector and eigenvalue to capture data variance in a lower dimensionality
  3. the principal component is the eigenvector and eigenvalue of the covariance matrix

Furthermore, we utilize scikit-learn to implement PCA through the following procedures:

  1. standardize data into the same scale
  2. apply PCA on the scaled data
  3. visualize explained variance using scree plot
  4. interpret the principal components composition
  5. use principal components in ML algorithms

More Articles Like This

How is Linear Algebra Applied for Machine Learning?

Practical Guides to Machine Learning

EDA and Feature Engineering Techniques

Reference

[1] College Major (FiveThirtyEight.) Retrieved from https://github.com/fivethirtyeight/data/tree/master/college-majors [CC-BY-4.0 license]

The post A Visual Learner’s Guide to Explain, Implement and Interpret Principal Component Analysis appeared first on Towards Data Science.

]]>
Linear Algebra for ML | Matrix, Vector and Data Representation https://towardsdatascience.com/how-is-linear-algebra-applied-for-machine-learning-d193bdeed268/ Fri, 30 Dec 2022 04:45:57 +0000 https://towardsdatascience.com/how-is-linear-algebra-applied-for-machine-learning-d193bdeed268/ How is Linear Algebra Applied in Machine Learning Truth be told, the role of linear algebra in machine learning has been perplexing me, as mostly we learn these concepts (e.g. vector, matrix) in a math background while discarding their applications in the machine learning context. In fact, linear algebra has several foundational use cases in […]

The post Linear Algebra for ML | Matrix, Vector and Data Representation appeared first on Towards Data Science.

]]>
How is Linear Algebra Applied in Machine Learning

Truth be told, the role of linear algebra in machine learning has been perplexing me, as mostly we learn these concepts (e.g. vector, matrix) in a math background while discarding their applications in the machine learning context. In fact, linear algebra has several foundational use cases in machine learning, including data representation, dimensionality reduction and vector embedding. Starting from introducing the basic concepts in linear algebra, this article will build an elementary view of how these concepts can be applied for data representation, such as solving a linear equation system, linear regression, and neural networks. However, if you would like to know more about linear algebra for Principal Component Analysis (PCA), you may find this article more helpful.

A Visual Learner’s Guide to Explain, Implement and Interpret Principal Component Analysis


For a video walkthrough of linear algebra for machine learning, I have included my YouTube video at the bottom of this article.


Definition of Scalar, Vector, Matrix, and Tensor

Firstly, let’s address the building blocks of linear algebra – scalar, vector, matrix, and tensor.

  • Scalar: a single number
  • Vector: an one-dimensional array of numbers
  • Matrix: a two-dimensional array of numbers
  • Tensor: a multi-dimensional array of numbers

To implement them, we can use NumPy array np.array() in python.

scalar = 1
vector = np.array([1,2])
matrix = np.array([[1,1],[2,2]])
tensor = np.array([[[1,1],[2,2]], 
                   [[3,3],[4,4]]])

Let’s look at the shape of the vector, matrix, and tensor we generated above.

Matrix and Vector Operations

1. Addition, Subtraction, Multiplication, Division

Similar to how we perform operations on numbers, the same logic also works for matrices and vectors. However, please note that these operations on matrices have restrictions on two matrices being the same size. This is because they are performed in an element-wise manner, which is different from matrix dot product.

2. Dot Product

Dot product is often being confused with matrix element-wise multiplication (which is demonstrated above); in fact, it is a more commonly used operation on matrices and vectors.

Dot product operates by iteratively multiplying each row of the first matrix to the column of the second matrix one element at a time, therefore the dot product between a j x k matrix and k x i matrix is a j x i matrix. Here is an example of how the dot product works between a 3×2 matrix and a 2×3 matrix.

Dot product operation necessitates the number of columns in the first matrix matching the number of rows in the second matrix. We use dot() __ to execute the dot product. The order of the matrices in the operations is crucial – as indicated below, matrix2.dot(matrix1) will produce a different result compared to matrix1.dot(matrix2). Therefore, as opposed to the element-wise multiplication, matrix dot product is not commutative.

3. Reshape

A vector is often seen as a matrix with one column and it can be reshaped into matrix format by specifying the number of columns and rows using reshape(). We can also reshape the matrix into a different layout. For example, we can use the code below to transform the 2×2 matrix to 4 rows and 1 column.

When the size of the matrix is unknown, reshape(-1) is commonly used to reduce the matrix dimension and "flatten" the array into one row. Reshaping matrices can be widely applied in neural network in order to fit the data into the neural network architecture.

4. Transpose

Transpose swaps the rows and columns of the matrix, so that an j x k matrix becomes k x j. To transpose a matrix, we use matrix.T.

5. Identity and Inverse Matrix

Inverse is an important transformation of matrices, but to understand inverse matrix we first need to address what is an identity matrix. An identity matrix requires the number of columns and rows to be the same and all the diagonal elements to be 1. Additionally, a matrix or vector remains the same after multiplying its corresponding identity matrix.

To create a 3 x 3 identity matrix in Python, we use numpy.identity(3).

The dot product of the matrix itself (stated as M below) and the inverse of the matrix is the identity matrix which follows the equation:

There are two things to take into consideration with matrix inverse: 1) the order of the matrix and matrix inverse does not matter even though most matrix dot products are different when the order changes; 2) not all matrices have an inverse.

To compute inverse of the matrix, we can use np.linalg.inv().

At this stage, we have only covered some basic concepts in linear algebra that support the application for data representation; if you would like to go deeper into more concepts, I found the book "Mathematics for Machine Learning" from Deisenroth, Faisal and Ong particularly helpful.


Thanks for reaching so far. If you would like to read more of my articles on Medium, I would really appreciate your support by signing up Medium membership.


Applications of Linear Algebra in ML

We will start with the most straightforward applications of vector and matrix in solving system of linear equations, and gradually generalize it to linear regression, then neural networks.

1. Linear Algebra Application in Linear Equation System

Suppose that we have the linear equation system below, a typical way to compute the value of a and b is to eliminate one element at a time – which can take 3 to 4 steps for two variables.

3a + 2b = 7

a – b = -1

An alternative solution is to represent it using the dot product between matrix and vector. We can package all the coefficients into a matrix and all the variable into a vector, hence we get following:

Matrix representation gives us a different mindset to solve the equation in one step. As demonstrated below, we represent the coefficient matrix as M, variable vector as x and output vector y, then multiply both side of the equation by inverse of the matrix M. Since the dot product between inverse of the matrix and the matrix itself is the identity matrix, we can simplify the solution of the linear equation system as the dot product between the inverse of the coefficient matrix M and the output vector y.

We use the following code snippet to compute the value of variable a and b in one step.

By representing the linear equation systems using matrices, this increase the computational speed significantly. Imagine that we are using the traditional method, it requires using several for-loops to eliminate one element at a time. This may seem to be a small enhancement for such a simple system, but if we expand it to machine learning or even deep learning that consists of massive amount of systems like this, it makes drastic increase in efficiency.

2. Linear Algebra Application in Linear Regression

The same principle shown in solving the linear equation system can be generalized to linear regression models in machine learning. If you would like to refresh your memory of linear regression, please check out my article on "A Practical Guide to Linear Regression".

A Practical Guide to Linear Regression

Suppose that we have a dataset with n features and m instances, we typically represent linear regression as the weighted sum of these features.

What if we represent the formula of an instance using the matrix form? We can store the feature values in a 1 x (n+1) matrix and the weights are stored in an (n+1) x 1 vector. Then we multiply the element with the same color and add them together to get the weighted sum.

When the number of instances increase, we naturally think of using for loop to iterate an item at a time which can be time consuming. By representing the algorithm in the matrix format, the linear regression optimization process boils down to solving the coefficient vector [w0, w1, w2 … wn] through linear algebra operations.

Additionally, popular Python libraries such as Numpy and Pandas build upon matrix representation and utilize "vectorization" to speed up the data processing speed. I found the article "Say Goodbye to Loops in Python, and Welcome Vectorization!" quite helpful in terms of the comparison between the computation time of for-loop and vectorization.

3. Linear Algebra Application in Neural Network

Neural network is composed of multiple layers of interconnected nodes, where the outputs of nodes from the previous layers are weighted and then aggregated to form the input of the subsequent layers. If we zoom into the interconnected layer of a neural network, we can see some components of the regression model.

Take a simple example that we visualize the inner process of the hidden layer i (with node i1, i2, i3) and hidden layer j (with node j1, j2) from a neural network. w11 represents the weight of the input node i1 that feeds into the node j1, and w21 represents the weight of input node i2 that feeds into node j1. In this case, we can package the weights into 3×2 matrix.

This can be generalized to thousands or even millions of instances which forms the massive training dataset of neural network models. Now this process resembles how we represent the linear regression model, except that we we use a matrix to store the weights instead of a vector, but the principle remains the same.

To take a step further, we can expand this to deep neural networks for deep learning. This is where Tensor come into play to represent data with more than two dimensions. For example, in Convolutional Neural Network, we use 3D Tensor for image pixels, as they are often depicted through three different channels (i.e., red, green, blue color channel).

As you can see, linear algebra act as the building block in machine learning and deep learning algorithms, and this is just one of the multiple use cases of linear algebra in Data Science. Hope that in future articles, I could introduce more applications, such as linear algebra for dimensionality reduction. To read more of my articles on Medium, I would really appreciate your support by signing up for a Medium membership.


Take Home Message

The importance of Linear Algebra in machine learning may seem implicit, however, it plays a fundamental role in terms of data representation and more. In this article we start with introducing basic concepts such as:

  • scalar, vector, matrix, tensor
  • addition, subtraction, multiplication, division, dot product
  • reshape, transpose, inverse

Additionally, we discussed how these concepts have been applied in data science and machine learning, including

More Articles Like This

Practical Guides to Machine Learning

TensorFlow Template for Deep Learning Beginners

Get Started in Data Science


Originally published at https://www.visual-design.net on December 28th, 2022.

The post Linear Algebra for ML | Matrix, Vector and Data Representation appeared first on Towards Data Science.

]]>
How to Use Plotly for More Insightful and Interactive Data Explorations https://towardsdatascience.com/dynamic-eda-for-qatar-world-cup-teams-8945970f16be/ Tue, 13 Dec 2022 18:13:34 +0000 https://towardsdatascience.com/dynamic-eda-for-qatar-world-cup-teams-8945970f16be/ This article will introduce the tool, Plotly [1], that brings data visualization and exploratory data analysis (EDA) to the next level. You can use this open source graphing library to make your notebook more aesthetic and interactive, regardless if you are a Python or R user. To install Plotly, use the command !pip install - […]

The post How to Use Plotly for More Insightful and Interactive Data Explorations appeared first on Towards Data Science.

]]>

This article will introduce the tool, Plotly [1], that brings Data Visualization and exploratory data analysis (EDA) to the next level. You can use this open source graphing library to make your notebook more aesthetic and interactive, regardless if you are a Python or R user. To install Plotly, use the command !pip install - upgrade plotly.

We will use the "Historical World Cup Win Loose Ratio Data [2]" to analyze the national teams participated in Qatar World Cup 2022. The dataset contains the win, loose and draw ratio between each "country1-country2" pair, as shown below. For example, the first row gives us the information that among 7 games played between Argentina and Australia, the ratio of wins, looses and draws by Argentina was 0.714286, 0.142857 and 0.142857 respectively.

df = pd.read_csv('/kaggle/input/qatar2022worldcupschudule/historical_win-loose-draw_ratios_qatar2022_teams.csv')

In this exercise, ** we will utilize box plot, bar chart, choropleth map and heatma**p for data visualization and exploration. Furthermore, we will also introduce advanced Pandas functions that are tied closely with these visualization techniques, including:

  • aggregation: df.groupby()
  • sorting: _df.sort_values()_
  • merging: df.merge()
  • pivoting: df.pivot()

Box Plot – Wins Ratio by Country

The first exercise is to visualize the wins ratio of each country when playing against other countries. To achieve this, we can use box plot to depict the distribution of wins ratio for each country and further colored by the continents of the country. Hover over the data points to see the detail information and zoom in box plots to see the max, q3, median, q1 and min values.

Let’s breakdown how we built the box plot step-by-step.

1. Get Continent Data

From the original dataset, we can use the fields "wins" and grouped by "country1" to investigate how the value varies within a country as compared to across countries. To further explore whether the wins ratio is impacted by continents, we need to introduce the "continent" field from the plotly built-in dataset px.data.gapminder().

geo_df = px.data.gapminder()

(Here I am using "continent" as an example, feel free to play around with "lifeExp" and "gdpPercap" as well)

Since only continent information is needed, we drop other columns to select distinct rows using drop_duplicates().

continent_df = geo_df[['country', 'continent']].drop_duplicates()

We then merge the _geodf with original dataset df to get the continent information. If you have used SQL before, then you will be familiar with table joining/merging. df.merge()works the same way by combining the common fields in df (i.e. "country1") and _continentdf (i.e. "country").

continent_df = geo_df[['country', 'continent']].drop_duplicates()
merged_df = df.merge(continent_df, left_on='country1', right_on='country')

2. Create Box Plot

We apply px.box function and specify the following parameters that describe the data fed into the box plot.

fig = px.box(merged_df, 
             x='country1', 
             y='wins', 
             color='continent',
...

3. Format the Plot

Following parameters are optional but help to format the plot and display more useful information in the visuals.

fig = px.box(merged_df, 
             x='country1', 
             y='wins', 
             color='continent',
# formatting box plot
             points='all',
             hover_data=['country2'],
             color_discrete_sequence=px.colors.diverging.Earth,
             width=1300,
             height=600
            )
fig.update_traces(width=0.3)
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')
  • points = 'all' **** means that all data points are shown besides the box plots. Hover each data point to see the details.
  • hover_data=['country2']added "country2" to the hover box content.
  • color_discrete_sequence=px.colors.diverging.Earths specifies the color theme. Please note that _color_discrete_sequence is applied when the field used for coloring is discrete, categorical values. Alternatively, color_continuous_scale_ is applied when the field is continuous, numeric values.
  • width=1300 and height=600 specifies the width and height dimension of the figure.
  • fig.update_traces(width=0.3) updates the width of each box plot.
  • fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')updates the figure background color to transparent.

Bar Chart – Average Wins Ratio by Country

The second exercise is to visualize the average wins ratio per country and sort them in descending order, so that to see the top performed countries.

Firstly, we use the code below for data manipulation.

average_score = df.groupby(['country1'])['wins'].mean().sort_values(ascending=False
  • df.groupby(['country1']): grouped the df by field "country1".
  • ['wins'].mean(): take the mean of "wins" values.
  • sort_values(ascending=False): sort the values by descending order.

We then use pd.DataFrame() __ to convert the _average_scor_e (which is Series datatype) to the table-like format.

average_score_df = pd.DataFrame({'country1':average_score.index, 'average wins':average_score.values})

Feed the _average_scoredf to px.bar function and it follows the same syntax as px.box.

# calculate average wins per team and descending sort
fig = px.bar(average_score_df,
             x='country1',
             y='average wins',
             color='average wins',
             text_auto=True,
             labels={'country1':'country', 'value':'average wins'},
             color_continuous_scale=px.colors.sequential.Teal,
             width=1000,
             height=600
            )
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')

To take a step further, we can also group the bar based on continent to illustrate the top performing countries as per continent, using the code below.

# merge average_score with geo_df to bring "continent" and "iso_alpha"
geo_df = px.data.gapminder()
geo_df = geo_df[['country', 'continent', 'iso_alpha']].drop_duplicates()
merged_df = average_score_df.merge(geo_df, left_on='country1', right_on='country')
# create box plot using merged_df and colored by "continent"
fig = px.bar(merged_df,
             x='country1',
             y='average wins',
             color='average wins',
             text_auto=True,
             labels={'country1':'country', 'value':'average wins'},
             color_continuous_scale=px.colors.sequential.Teal,
             width=1000,
             height=600
            )
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')

Choropleth Map – Average Wins Ratio by Geo Location

The next visualization we are going to explore is to display the average wins ratio of the country through the map. The diagram above gives us a clearer view of which regions around the world had relatively better performance, such as South Americas and Europe.

ISO code is used to identify the location of the country. In the previous code snippet for average wins ratio colored by continent, we have merged _geodf with the original dataset to create _mergeddf with the fields "continent" and "iso_alpha". We will keep using _mergedf for this exercise (shown in the screenshot below).

We then use px.choropleth function and define the parameter locations to be "iso_alpha".

fig = px.choropleth(merged_df, 
                    locations='iso_alpha',
                    color='average wins',
                    hover_name='country',
                    color_continuous_scale=px.colors.sequential.Teal,                
                    width=1000,
                    height=500,
                   )
fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0})

Heatmap – Wins Ratio Between Country Pairs

Lastly, we will introduce heatmap to visualize the wins ratio between each country pair, where the dense area shows that countries on the y axis had a higher ratio of wining. Additionally, hover over the cells to see the wins ratio in a dynamic way.

We need to use df.pivot() function to reconstruct the dataframe structure. The code below specifies the row of the pivot table to be "country1", "country2" as the columns, and keep the "wins" as the pivoted value. As the result, the table on the left has been transformed into the right one.

df_pivot = df.pivot(index = 'country1', columns ='country2', values = 'wins')

We then use the _pivoteddf and px.imshow to create the heatmap through the code below.

# heatmap
fig = px.imshow(pivoted_df, 
                text_auto=True,
                labels={'color':'wins'},
                color_continuous_scale=px.colors.sequential.Brwnyl,
                width=1000,
                height=1000
               )
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')

Thanks for reaching the end. If you would like to read more of my articles on Medium, I would really appreciate your support by signing up Medium membership.


Take-Home Message

Plotly has provided us the capability to create dynamic visualizations and generates more insights than a static figure. We have used the trending World Cup data to explore following Eda techniques:

  • Box Plot
  • Bar Chart
  • Choropleth Map
  • Heatmap

We have also explained some advanced Pandas functions for data manipulation, which has been used applied in the EDA process, including:

  • aggregation: df.groupby()
  • sorting: df.sort_values()
  • merging: df.merge()
  • pivoting: df.pivot()

More Articles Like This

Semi-Automated Exploratory Data Analysis (EDA) in Python

EDA and Feature Engineering Techniques

How to Choose the Most Appropriate Chart?

Reference

[1] Plotly. (2022). Plotly Open Source Graphing Library for Python. Retrieved from https://plotly.com/python/

[2] Kaggle. (2022). Qatar 2022 Football World Cup [CC0: Public Domain]. Retrieved from https://www.kaggle.com/datasets/amineteffal/qatar2022worldcupschudule?select=historical_win-loose-draw_ratios_qatar2022_teams.csv


Originally published at https://www.visual-design.net on December 10th, 2022.

The post How to Use Plotly for More Insightful and Interactive Data Explorations appeared first on Towards Data Science.

]]>
Time Series Analysis Introduction - A Comparison of ARMA, ARIMA, SARIMA Models https://towardsdatascience.com/time-series-analysis-introduction-a-comparison-of-arma-arima-sarima-models-eea5cbf43c73/ Fri, 18 Nov 2022 06:11:14 +0000 https://towardsdatascience.com/time-series-analysis-introduction-a-comparison-of-arma-arima-sarima-models-eea5cbf43c73/ On the Differences Between These Models, and How You Should Use Them

The post Time Series Analysis Introduction - A Comparison of ARMA, ARIMA, SARIMA Models appeared first on Towards Data Science.

]]>
Time Series Analysis Introduction – A Comparison of ARMA, ARIMA, SARIMA Models

On the differences between these models, and how you should use them

What is Time Series?

Time series is a unique type of problem in machine learning where the time component plays a critical role in the model predictions. As observations are dependent on adjacent observations, this violates the assumption that observations are independent to each other followed by most conventional machine learning models. Common use cases of time series analysis are forecasting future numeric values, e.g. stock pricing, revenue, temperature, which falls under the category of regression models. However, time series models can also be applied in classification problems, for instance, pattern recognition in brain wave monitoring, or failure identification in the production process are common applications of time series classifiers.

In this article, we will mainly focus on three time series model – ARMA, Arima, and SARIMA for regression problems where we forecast numeric values. Time series regression differentiates from other regression models, because of its assumption that data correlated over time and the outcomes from previous periods can be used for predicting the outcomes in the subsequent periods.


How to Describe Time Series Data?

Firstly we can describe the time series data through a line chart visualization using sns.lineplot. As shown in the image below, the visualization of "Electric Production [1]" time series data depicts an upward trend with some repetitive patterns.

df = pd.read_csv("../input/time-series-datasets/Electric_Production.csv")
sns.lineplot(data=df,x='DATE', y='IPG2211A2N')py
Time Series Visualization (image by author)
Time Series Visualization (image by author)

To explain the characteristics of the time series data better, we can break it down into three components:

  • trend – T(t): a long-term upward or downward change in the average value.
  • seasonality – S(t): a periodic change to the value that follows an identifiable pattern.
  • residual – R(t): random fluctuations in the time series data that does not follow any patterns.

They can be combined typically through addition or multiplication:

  • Additive Time Series: O(t) = T(t) + S(t) + R(t)
  • Multiplicative Time Series: O(t) = T(t) S(t) R(t)

In Python, we decompose three components from time series data through seasonal_decompose,and decomposition.plot()gives us the visual breakdown of trend, seasonality and residual. In this code snippet, we specify the model to be additive and period = 12 to show the seasonal patterns.

(To access the entire code snippet, please have a look at my website)

from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(x=df['IPG2211A2N'], model='additive', period = 12) 
decomposition.plot()

Stationarity

Time series data can be classified into stationary and non-stationary. Stationarity is an important property, as some models relies on the assumption that data is stationary. However, time series data often possesses the non-stationary property. Therefore, we need to understand how to identify non-stationary time series and how to transform it through various techniques, e.g. differencing.

Stationary data is defined as not depending on the time component and possesses the following characteristics: constant mean, constant variance overtime and constant autocorrelation structure (i.e. the pattern of autocorrelation does not change over time), without periodic or seasonal component.

Techniques to Identify Stationarity

The most straightforward method would be examining the data visually. For example, the time series visualization above indicates that the time series follows an upward trend and its mean values increase over time, suggesting that the data is non-stationary. To quantify it stationarity, we can use following two methods.

Firstly, ADF (Augmented Dickey Fuller) test examines stationarity based on the null hypothesis that data is non-stationary and alternative hypothesis that data is stationary. If the p-value generated from the ADF test is smaller than 0.05, it provides stronger evidence to reject that data is non-stationary.

We can use adfuller from statsmodels.tsa.stattools module to perform the ADF test and generates the ADF value and p-value. In this example, p-value 0.29 is more than 0.05 thus this dataset is non-stationary.

Secondly, ACF (Autocorrelation Function) summarizes the two-way correlation between the current observation against past observations. For example, when the lag=1 (x-axis), ACF value (y-axis) is roughly 0.85, meaning that the average correlation between all observations and their previous observation is 0.85. In the later section, we will also discuss using ACF to determine the moving average parameter.

The code snippet below generates ACF plots using sm.graphics.tsa.plot_acf, showing 40 lags.

import statsmodels.api as sm
sns.lineplot(x=df['DATE'], y=df['IPG2211A2N'], ax=subplot1)
sm.graphics.tsa.plot_acf(df['IPG2211A2N'], lags=40, ax=subplot2) 
fig.show()

For non-stationary data, ACF drops to 0 relatively slowly, because non-stationary data may still appear highly correlated with previous observations, indicating that time component still plays an important role. The diagram above shows the ACF of the original time series data, which decreases slowly thus very likely to be non-stationary.

Stationarity and Differencing

Differencing removes trend and seasonality by computing the differences between an observation and its subsequent observations, differencing can transform some non-stationary data to stationary.

  1. remove trend

We use shift(1) to shift the original time series data (shown on the left) for one row down (shown on the right) and take the difference to remove the trend components. dropna is to remove the empty row when NaN is subtracted.

# remove trend component
diff = df['IPG2211A2N'] – df['IPG2211A2N'].shift(1)
diff = diff.dropna(inplace=False)

We can plot the time series chart as well as the ACF plot after applying trend differencing. As shown below that the trend has been removed from the data and data appear to have constant mean. The next step is to address the seasonal component.

# ACF after trend differencing
fig = plt.figure(figsize=(20, 10))
subplot1 = fig.add_subplot(211)
subplot2 = fig.add_subplot(212)
sns.lineplot(x=df['DATE'], y=diff, ax=subplot1)
sm.graphics.tsa.plot_acf(diff, lags=40, ax=subplot2) 
fig.show()
Trend Differencing (image by author)
Trend Differencing (image by author)

2. remove seasonality

From the ACF plot above, we can see that observations are more correlated when lag is 12, 24, 36 etc, thus it may follow a lag 12 seasonal pattern. Let us apply shift(12) to remove the seasonality and retest the stationarity using ADF – which has a p-value of around 2.31e-12.

# remove seasonal component
diff = df['IPG2211A2N'] – df['IPG2211A2N'].shift(1)
seasonal_diff = diff – diff.shift(12)
seasonal_diff = seasonal_diff.dropna(inplace=False)

After removing the seasonal pattern, the time series data below becomes more random and ACF value drops to a stable range quickly.

Seasonal Differencing (image by author)
Seasonal Differencing (image by author)

Models – ARMA, ARIMA, SARIMA

In this section, we will introduce three different models – ARMA, ARIMA and SARIMA for time series forecasting. Generally, the functionalities of these models can be summarized as follow:

  • ARMA: Autoregressive + Moving Average
  • ARIMA: Autoregressive + Moving Average + Trend Differencing
  • SARIMA: Autoregressive + Moving Average + Trend Differencing + Seasonal Differencing

ARMA – Baseline Model

ARMA stands for Autoregressive Moving Average. As the name suggests, it is a combination of two parts – Autoregressive and Moving Average.

Autoregressive Model – AR(p)

Autoregressive model makes predictions based on previously observed values, which can be expressed as AR(p) where p specifies the number of previous data points to look at. As stated below, where X represents observations from previous time points and φ represents the weights.

For example, if p = 3, then the current time point is dependent on the values from previous three time points.

How to determine the p values?

PACF (Partial Autocorrelation Function) is typically used for determining p values. For a given observation in a time series Xt, it may be correlated with a lagged observation Xt-3 which is also impacted by its lagged values (e.g. Xt-2, Xt-1 ). PACF visualizes the direct contribution of the past observation to the current observations. For example, the PACF below when lag = 3 the PACF is roughly -0.60, which reflects the impact of lag 3 on the original data point, while the compound factor of lag 1 and lag 2 on lag 3 are not explained in the PACF value. The p values for the AR(p) model is then determined by when the PACF falls within the significant threshold (blue area) for the first time, i.e. p = 4 in this example below.

PACF (image by author)
PACF (image by author)

Moving Average Model – MR(q)

Moving average model, MR(q) adjusts the model based on the average predictions errors from previous q observations, which can be stated as below, where e represents the error terms and θ represents the weights. q value determines the number of error terms to include in the moving average window.

How to determine the q value?

ACF can be used for determining the q value. It is typically selected as the first lagged value of which the ACF drops to nearly 0 for the first time. For example, we would choose q=4 based on the ACF plot below.

ACF (image by author)
ACF (image by author)

To build a ARMA model, we can use ARIMA function (which will be explained in the next section) in statsmodels.tsa.arima.model and specify the hyperparameter – order(p, d, q). When the d = 0, it operates as an ARMA model. Here we fit the ARIMA(p=3 and q=4) model to the time series data df"IPG2211A2N".

from statsmodels.tsa.arima.model import ARIMA
ARMA_model = ARIMA(df['IPG2211A2N'], order=(3, 0, 4)).fit()

Model Evaluation

Model evaluation becomes particularly important when choosing the appropriate hyperparameters for time series modeling. We are going to introduce three methods to evaluate time series models. To estimate model’s predictions on unobserved data, I used first 300 records in the original dataset for training and the remaining (from index 300 to 396) for testing.

df_test = df[['DATE', 'IPG2211A2N']].loc[300:]
df = df[['DATE', 'IPG2211A2N']].loc[:299]
  1. Visualization

The first method is to plot the actual time series data and the predictions in the same chart and examine the model performance visually. This sample code firstly generates predictions from index 300 to 396 (same size as df_test) using the ARMA model, then visualizes the actual vs. predicted data. As shown in the chart below, since ARMA model fails to pick up the trend in the time series, the predictions drift away from actual values over time.

# generate predictions
df_pred = ARMA_model.predict(start=300, end=396)
# plot actual vs. predicted
fig = plt.figure(figsize=(20, 10))
plt.title('ARMA Predictions', fontsize=20)
plt.plot(df_test['IPG2211A2N'], label='actual', color='#ABD1DC')
plt.plot(df_pred, label='predicted', color='#C6A477')
plt.legend(fontsize =20, loc='upper left')
ARMA Predictions (image by author)
ARMA Predictions (image by author)

2. Root Mean Squared Error (RMSE)

For time series regression, we can apply general regression model evaluation methods such as RMSE or MSE. For more details, please have a look at my article on "Top 4 Linear Regression Variations in Machine Learning".

Top 4 Linear Regression Variations in Machine Learning

Larger RMSE indicates more difference between actual and predicted values. We can use the code below to calculate the RMSE for the ARMA model – which is around 6.56.

Python">from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(df['IPG2211A2N'][1:], pred_df[1:]))
print("RMSE:", round(rmse,2))

3. Akaike Information Criteria (AIC)

The third method is to use AIC, stated as AIC = 2k – 2ln(L), to interpret the model performance, which is calculated based on log likelihood (L) and number of parameters(k). We would like to optimize for a model to have less AIC, which means that:

  • log likelihood needs to be high, so that models with high predictability would be preferred.
  • the number of parameters is low, so that the model prediction is determined by fewer factors, hence it is less likely to overfit and have a higher interpretability.

We can get the AIC value through summary() function, and the summary result below tells us that the ARMA model has AIC = 1547.26.

ARMA_model.summary()
ARMA Summary (image by author)
ARMA Summary (image by author)

ARIMA: Address Trend

ARIMA stands for Autoregressive Integrated Moving Average, which extends from ARMA model and incorporates the integrated component (inverse of differencing).

ARIMA builds upon autoregressive model (AR) and moving average model (MA) by introducing degree of differencing components (specified as the parameter d) – ARIMA (p, d, q). This is to address when obvious trend observed in the time series data. As demonstrated in the ARMA example, the model didn’t manage to pick up the trend in the data which makes the predicted values drift away from the actual values.

In the "Stationarity and Differencing" section, we explained how differencing is applied to remove trend. Now let us explore how it makes the forecasts more accurate.

How to determine d value?

Since ARIMA incorporates differencing in its model building process, it does not strictly require the training data to be stationary. To ensure that ARIMA model works well, the appropriate degree of differencing should be selected, so that time series is transformed to stationary data after being de-trended.

We can use ADF test first to determine if the data is already stationary, if the data is stationary, no differencing is required hence d = 0. As mentioned previously, the ADF test before differencing gives us the p-value of 0.29.

After applying trend differencing diff = df['IPG2211A2N'] – df['IPG2211A2N'].shift(1) and using ADF test , we found that p value is far below 0.05. Therefore, it indicates it is highly likely that transformed time series data is stationary.

However, if the data is still non-stationary, a second degree of differencing might be necessary, which means applying another level of differencing to diff(e.g. diff2 = diff – diff.shift(1)).

To build the ARIMA model, we use the same function as mentioned in ARMA model and add the d parameter – in this example, d = 1.

# ARIMA (p, d, q)
from statsmodels.tsa.arima.model import ARIMA
ARIMA_model = ARIMA(df['IPG2211A2N'], order=(3, 1, 4)).fit()
ARIMA_model.summary()
ARIMA Summary (image by author)
ARIMA Summary (image by author)

From the summary result, we can tell that the log likelihood increases and AIC decreases as compared to ARMA model, indicating that it has better performance.

The visualization also indicates that predicted trend is more aligned with the test data – with RMSE decreased to 4.35.

ARIMA Predictions (image by author)
ARIMA Predictions (image by author)

SARIMA: Address Seasonality

SARIMA stands for Seasonal ARIMA which addresses the periodic pattern observed in the time series. Previously we have introduced how to use seasonal differencing to remove seasonal effects. SARIMA incorporates this functionality to predict seasonally changing time series and we can implement it using SARIMAX(p, d, q) x (P, D, Q, s). The first term (p, d, q) represents the order of the ARIMA model and (P, D, Q, s) represents the seasonal components. P, D, Q are the autoregressive, differencing and moving average terms of the seasonal order respectively. s is the number of observations in each period.

How to determine the s value?

ACF plot provides some evidence of the seasonality. As shown below, every 12 lags appears to have a higher correlation (as compared to 6 lags) to the original observation.

We have also previously tested that after shifting the data with 12 lags, no seasonality has been observed in the visualization. Therefore, we specify s=12 in this example.

#SARIMAX(p, d, q) x (P, D, Q, s)
SARIMA_model = sm.tsa.statespace.SARIMAX(df['IPG2211A2N'], order=(3, 1, 4),seasonal_order=(1, 1, 1, 12)).fit()
SARIMA_model.summary()
SARIMA Summary (image by author)
SARIMA Summary (image by author)

From the summary result, we can see that AIC further decreases from 1528.48 for ARIMA to 1277.41 for SARIMA.

SARIMA Predictions (image by author)
SARIMA Predictions (image by author)

The predictions now illustrates the seasonal pattern and the RMSE further drops to 4.04.


Thanks for reaching so far, if you’d like to read more articles from Medium and also support my work, I really appreciate you signing up Medium Membership using this affiliate link.


Take-Home Message

This introduction of time series models explains ARMA, ARIMA and SARIMA models in a progressive order.

  • ARMA: Autoregressive + Moving Average
  • ARIMA: Autoregressive + Moving Average + Trend Differencing
  • SARIMA: Autoregressive + Moving Average + Trend Differencing + Seasonal Differencing

Furthermore, we explore concepts and techniques related to time series data, such as Stationarity, ADF test, ACF/PACF plot and AIC.

More Related Articles

Top 4 Linear Regression Variations in Machine Learning

EDA and Feature Engineering Techniques

Practical Guides to Machine Learning

Reference

[1] Original dataset reference: Board of Governors of the Federal Reserve System (US), Industrial Production: Utilities: Electric and Gas Utilities (NAICS = 2211,2) [IPUTIL], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/IPUTIL, November 17, 2022. [Public Domain: Citation Requested].

Originally published at https://www.visual-design.net on Nov 17th, 2022.

The post Time Series Analysis Introduction - A Comparison of ARMA, ARIMA, SARIMA Models appeared first on Towards Data Science.

]]>
TensorFlow Template for Deep Learning Beginners https://towardsdatascience.com/tensorflow-template-for-deep-learning-beginners-3b976d0ee084/ Fri, 03 Jun 2022 14:39:10 +0000 https://towardsdatascience.com/tensorflow-template-for-deep-learning-beginners-3b976d0ee084/ How to Build Your First Deep Neural Network

The post TensorFlow Template for Deep Learning Beginners appeared first on Towards Data Science.

]]>
What is Deep Learning?

Deep Learning is a sub-category of machine learning models that uses neural networks. In a nutshell, neural networks connect multiple layers of nodes and each node can be considered as a mini machine learning model. The output of the model then feeds as the input of the subsequent node.

deep learning model (image by author)
deep learning model (image by author)

TensorFlow is a Python library that primarily focuses on providing deep learning framework. To install and import TensorFlow library:

pip install tensorflow
import tensorflow as tf

How to Build a TensorFlow Deep Neural Network

The skeleton of a deep learning models generally follows the structure below and we can use Keras API to implement a beginner friendly deep learning model. There is a lot of variation we can add at each stage to make the model more complex.

  1. Prepare dataset: data preparation and feature engineering techniques
  2. Define the model: model type, number of layers and units, activation functions
  3. Compile the model: optimizers, learning rate, loss function, metrics
  4. Fit the model: batch size, epoch
  5. Evaluation and Prediction: evaluation metrics

Grab the code template from our Code Snippet section on my website.

1. Prepare Dataset

Deep learning is fundamentally a type of Machine Learning algorithms and consists of both supervised learning and unsupervised learning. For a supervised learning, it requires splitting the dataset into train and test set (sometimes also involve a validation set) as below.

from sklearn.model_selection import train_test_split
X = df.drop(['user-definedlabeln'], axis=1)
y = df['user-definedlabeln']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

There is a lot more you can do with the raw dataset, such as preprocessing and feature engineerings, but let’s keep it simple in this article.

2. Define the Model

One of the the simplest form of deep learning neural network is sequential model, which is composed of a single stack of layers and each layer has only one input tensor and one output tensor. We can create a sequential model by passing multiple Dense layers.

sequential model (image by author)
sequential model (image by author)
model = tf.keras.models.Sequential([
        keras.layers.Dense(units=32, activation="sigmoid"),
        keras.layers.Dense(units=16, activation="sigmoid"),
        keras.layers.Dense(units=1, activation="ReLU")
    ])

number of layers and units: a deep learning model must have an input layer and an output layer. The number of hidden layers between input and output can vary. The number of units per layer is also a hyperparameter that we can experiment on. The article "How to Configure the Number of Layers and Nodes in a Neural Network" provides a guide on how to experiment and perform search to determine number of layers and nodes.

activation function: each layer of the model requires an activation function. You can think of it as a mini statistical model that transforms the node input into output. Activation functions contribute to the non-linearity of neural networks. Hidden layers usually apply the same activation functions and output layer can have a different one depends on the whether it is classification or regression prediction.

Below are common activation functions and each has its pros and cons.

  • Sigmoid: ranging from 0 to 1, sigmoid is suitable for binary classification
  • ReLU: it preserves linear behavior and solves the issue of vanishing gradient in Sigmoid and Tanh, but it can suffer from other problems like saturated or dead units when the input is negative
  • Tanh: the stronger gradient makes it more sensitive to small difference, however it has the issue of saturation and slow learning rate at extreme values
  • Linear: it is suitable for regression problem with continuous numeric output
activation function (image by author)
activation function (image by author)

Recommended Reading: How to Choose an Activation Function for Deep Learning

3. Compile the Model

Deep learning models use backpropogation to learn. Simply put, it learns from the prediction error and adjust the weights allocated to each node in order to minimize the prediction error. At the stage of compiling the model, we need to specify the loss function that is used to measure the error and also the optimizer algorithm to reduce the loss.

from TensorFlow.keras.optimizers import RMSprop
model.compile(optimizer= RMSprop(learning_rate=0.001), loss="binary_crossentropy", metrics=['accuracy'])

optimizer: optimizer defines the optimization algorithm that is used to refine the models with the aim of reducing error. Some examples of optimizers in brief:

  • Gradient Descent: it minimizes the loss function by updating the parameters based on gradient of the function
  • Stochastic Gradient Descent: a popular variant of gradient descent that updates parameters for each training example
  • RMSProp: it computes adaptive learning rate and commonly used in Recurrent Neural Network
  • Momentum: it borrows the idea from physics where the learning speed adapts based on gradient directions, which results in faster convergence
  • Adam: it combines the advantage of both RMSProp and Momentum, and efficient when working with large dataset

Recommended Reading:

A Comprehensive Guide on Deep Learning Optimizers

An overview of gradient descent optimization algorithms

We also need to specify a learning rate for the optimizer, because it determines the speed of updating the parameters/weights to minimize the result of loss function. We can visualize the loss function in a 2D space as below, and the goal of optimizer is to find the find the minimum error point. If the learning rate is too large, we may skip the lowest point and fail to converge. However, if the learning rate is too small, it may take very long time to reach the minimum loss.

large vs. small step size (image by author)
large vs. small step size (image by author)

loss function: loss function measures the error rate and provide an evaluation of model performance. It requires different loss functions in deep learning models for classification prediction vs. regression prediction:

  • loss functions for classification problem: _"binarycrossentropy", "hinge" …
  • loss functions for regression problem: _"mean_squared_error", "mean_squared_logarithmic_error", "mean_absoluteerror"

metrics: it is the evaluation metrics generated after each training iteration and can be passed as a list. We can use the loss functions above, and can also add "accuracy", "auc" for classification and "rmse", "cosine" for regression.

Recommended Reading: Keras Loss Functions: Everything You Need to Know

4. Fit the Model

The model.fit() function fits the training dataset X_train and training labels y_train to the model. The complexity of training process is also controlled by the epochs and batch size.

model.fit(X_train, y_train, epochs = 15, batch_size = 10)
batch and epoch (image by author)
batch and epoch (image by author)

epochs: it control the number of iterations of passing the entire training set required to finish the training.

batch_size: it determines how many training samples required to update the model parameters. If the batch_size is the same as the size of training set, the model will use the entire training dataset to update model parameters. If the batch_size = 1, it will use each data point to update the model parameters.

Recommended Reading: Epoch vs Batch Size vs Iterations

5. Evaluation and Prediction

Remember that we initially split the entire dataset to training and testing. And the test set has been left out from the entire model building process. This is because we need to use the holdout test dataset to evaluate its performance on unseen data. Simply pass the testing dataset as below and it returns the model evaluation metrics that have been specified in the model compilation stage.

model.evaluate(X_test, y_test)

Once you are happy with the model performance, you can deploy it for making predictions.

model.predict(predict_data)

The current model can only be considered as a baseline model, and there are still a lot improvements can be done to the enhance its accuracy. This article provides a useful guide to improve deep learning baseline model performance: How To Improve Deep Learning Performance.


Thanks for reaching the end. If you would like to read more of my articles on Medium, I would really appreciate your support by signing up Medium membership ☕


Take Home Message

This article introduces the deep learning template for beginners. At each stage, we can add variation and complexity to the neural network model by experimenting on the hyperparameters:

  1. Prepare dataset: data preprocessing and feature engineering techniques
  2. Define the model: model type, number of layers and units, activation functions
  3. Compile the model: optimizers, learning rate, loss function, metrics
  4. Fit the model: batch size, epoch
  5. Evaluation and Prediction

More Resources Like This

Top 6 Machine Learning Algorithms for Classification

Get Started in Data Science

How to Self-Learn Data Science in 2022

Originally published at https://www.visual-design.net on June 2nd, 2022.

The post TensorFlow Template for Deep Learning Beginners appeared first on Towards Data Science.

]]>
Statistical Power in Hypothesis Testing – Visually Explained https://towardsdatascience.com/statistical-power-in-hypothesis-testing-visually-explained-1576968b587e/ Mon, 09 May 2022 14:01:42 +0000 https://towardsdatascience.com/statistical-power-in-hypothesis-testing-visually-explained-1576968b587e/ An Interactive Guide to the What/Why/How of Power

The post Statistical Power in Hypothesis Testing – Visually Explained appeared first on Towards Data Science.

]]>
What is Statistical Power?

Statistical Power is a concept in hypothesis testing that calculates the probability of detecting a positive effect when the effect is actually positive. In my previous post, we walkthrough the procedures of conducting a Hypothesis Testing. And in this post, we will build upon that by introducing statistical power in hypothesis testing.

Power & Type 1 Error & Type 2 Error

When talking about Power, it seems unavoidable that Type 1 and Type 2 error will be mentioned as well. They are all well-known hypothesis testing concepts to compare the predicted results against the actual results.

Let’s continue to use the t-test example in my previous post "An Interactive Guide to Hypothesis Testing" to illustrate these concepts.

An Interactive Guide to Hypothesis Testing in Python

Recap: we used one-tail two sample t-test to compare two samples of customers – customers who accepted the campaign offer and customers who rejected the campaign offer.

recency_P = df[df['Response']==1]['Recency'].sample(n=20, random_state=100)
recency_N = df[df['Response']==0]['Recency'].sample(n=20, random_state=100)
  • null hypothesis (H0): there is no difference in Recency between the customers who accept the offer and who don’t accept the offer – represented as the blue line below.
  • alternative hypothesis (H1): customers who accept the offer has lower Recency compared to customers who don’t accept the offer – represented as the orange line below.
Power illustration (image by author)
Power illustration (image by author)

Type 1 error (False Positive): If values fall within the blue area in the chart, even though they occur when null hypothesis is true, we choose to reject the null hypothesis because they are lower than the threshold. As a result, we are making a type 1 error or false positive mistake. It is the same as the significance level (usually 0.05), which means that we allow 5% risk of claiming customers who accept the offer have lower Recency when in fact there is no difference. The result of a type 1 error is that, the company may send out a new campaign offer to people with low Recency value but the response rate is not good.

Type 2 error (False Negative): As highlighted in orange area, it is the probability of rejecting the alternative hypothesis when it is actually true – so claim that there is no difference between two groups when actually difference exists. As in the business context, the marketing team may lose a potential target campaign opportunity with high return on investment.

Statistical Power (True Positive): The probability of correctly accepting the alternative hypothesis when it is true. It is the exact opposite of Type 2 error: Power = 1 – Type 2 error , and we correctly predict that customers who accept the offer are more likely to have lower Recency compared to customers who don’t accept the offer.

Hover over the chart below and you will see how Power, Type 1 error and Type 2 error changes when we apply different threshold. (Check out Code Snippet section on my website, if you want to build this yourself)

Why Use Statistical Power?

Significance level is widely used to determine how statistically significant the hypothesis testing is. However, it only tells part of the story – try to avoid claiming there is a true effect/difference given that no actual difference exists. Everything is based on the assumption that null hypothesis is true. What if we want to see the positive side of the story – the probability of making the right conclusion when the alternative hypothesis is true? We can use Power.

Additionally, Power also plays a role in determining the sample size. A small sample size might give a small p-value by chance, indicating that it is less likely to be a false positive mistake. But it does not guarantee that there is enough evidence for true positive. Therefore, Power is usually defined before the experiments to determine the minimum sample size required to provide sufficient evidence for detecting a real effect.

How to Calculate Power?

The magnitude of power is impacted by three factors: significance level, sample size and effect size. Python function solve_power() calculates the power given the values of parameters – _effectsize, alpha, nobs1.

Let’s run a Power Analysis using the Customer Recency example above.

from statsmodels.stats.power import TTestIndPower
t_solver = TTestIndPower()
power = t_solver.solve_power(effect_size=recency_d, alpha=0.05, power=None, ratio=1, nobs1= 20, alternative='smaller')
  • significance level: we set alpha value as 0.05 which also determines that the Type 1 error rate is 5%. alternative= ‘smaller’ is to specify the alternative hypothesis: the mean difference between two groups is smaller than 0.
  • sample size: nobs1 specifies the size of sample 1 (20 customers) and ratio is the number of observations in sample 2 relative to sample 1.
  • effect size: _effectsize is calculated as the difference between the mean difference relative to pooled standard deviation. For two sample t-test, we use Cohen’s d formula to calculate the effect size. And we got 0.73. In general, 0.20, 0.50, 0.80, and 1.3 are considered as small, medium, large, and very large effect sizes.
# calculate effect size using Cohen's d
n1 = len(recency_N)
n2 = len(recency_P)
m1, m2 = mean(recency_N), mean(recency_P)
sd1, sd2 = np.std(recency_N), np.std(recency_P)
pooled_sd = sqrt(((n1 - 1) * sd1**2 + (n2 - 1) * sd2**2) / (n1 + n2 - 2))
recency_d = (m2 - m1)/pooled_sd

As shown in the above interactive chart "Power, Type1 error and Type2 error", when the significance level is 0.05, the power is 0.74.

How to Increase Statistical Power?

Power is positively correlated with effect size, significance level and sample size.

1. Effect Size

Larger effect size indicates larger difference in mean relative to the pooled standard deviation. When effect size increases, it suggests more observed difference between two sample data. Therefore, Power increases as it provides more evidence that alternative hypothesis is true. Hover over the line to see how Power changes as effect size changes.

2. Significance level / Type I error

There is a trade-off between type 1 error and type 2 error, hence if we allow more type 1 error we will also increase power. If you hover over the line in the first interactive chart "Power, Type 1 and Type 2 error", you will notice that when we try to mitigate type 1 error, type 2 error increases and power decreases. This is because if minimize the false positive mistakes, we are raising the bar and adding more constraints to what we can classify as a positive effect. When the standard is too high, we are also reducing the probability of correctly classifying a positive effect. As a result, we cannot make them both perfect. So a common threshold, with Type 1 error 0.05 and Power – 0.8, is applied to balance this trade off.

3. Sample size

Power also has a positive correlation with sample size. Large sample size brings the variance of the data down, so the average of the samples will be closer to the population mean. As a result, when we observe a difference in the sample data, it is less likely to occur by chance. As seen in the interactive chart, when the sample size is as large as 100, it is easy to reach 100% power with a relatively small effect size.

In hypothesis testing, we often reverse the process and derive the required sample size given the desired power using the code below. For this example, it is required to have around 24 customers in each sample group to run a t-test with Power of 0.8.


Hope you found this article helpful. If you would like to read more articles like this, I would really appreciate your support by signing up Medium membership.


Take Home Message

In this article, we introduce a statistics concept – Power, and answer some questions related to Power.

  • What is Statistical Power? – Power is related to Type 1 error and Type 2 error
  • Why we use Statistical Power? – Power can be used to determine sample size
  • How to calculate Power? – Power is calculated from effect size, significance level and sample size.

More Articles Like This

An Interactive Guide to Hypothesis Testing in Python

Get Started in Data Science

Practical Guides to Machine Learning


Originally published at https://www.visual-design.net on May 8th, 2022.

The post Statistical Power in Hypothesis Testing – Visually Explained appeared first on Towards Data Science.

]]>