Content from Introduction to AI Coding Assistants


Last updated on 2024-11-24 | Edit this page

Overview

Questions

  • How do AI coding assistants work?
  • What are the main AI coding assistants and what are their main characteristics?
  • How to set up Codeium as a coding assistant for the lesson?

Objectives

  • Learn about AI models behind coding assistants
  • Outline the main AI coding assistants and their capabilities
  • Set up Codeium as a coding assistant for the lesson

Introduction


Artificial intelligence (AI) models for coding assistants, such as Codeium and GitHub Copilot, rely on machine learning (ML) techniques, particularly deep learning and natural language processing (NLP), to assist developers. These models are trained on vast amounts of publicly available code and documentation to understand patterns, syntax, and code logic across various programming languages. Here is a schematic process of training an AI coding assistant model.

Callout

It’s important to note that many AI coding assistants are fine-tuned from foundation models rather than trained from scratch. Fine-tuning involves adapting a pre-trained model to specific tasks using smaller, task-specific datasets. This approach is more efficient as it uses the broad capabilities of foundation models, requiring fewer resources and less time than starting from scratch.

Now let’s break down the key characteristics of how these models work:

Characteristic Description Examples
Code understanding through ML AI coding assistants are built on models that analyze and learn from vast amounts of data, including open-source codebases, libraries, and developer behavior. These models break down code into a mathematical representation that the AI can interpret. They learn common patterns, best practices, syntax rules, and how developers approach specific tasks or problems. When you start typing, the AI assistant predicts what you are likely to write next based on patterns it has learned. These predictions can be simple line completions or even more complex, multi-line code suggestions.
Context awareness AI coding assistants are designed to understand context at multiple levels, including code, variables, and functions. This makes them able to suggest relevant code suggestions based on the context. If you’re writing a loop to iterate over a list of items, the AI can suggest the entire loop structure based on what it recognizes in the surrounding code.
NLP AI models are both trained to understand code syntax and to interpret natural language. This is crucial for features like chat-based interaction or when a developer types a comment or command in plain English, expecting the AI to generate code. If you write a comment like "Create a function to fetch user data from an API", the AI can generate the function code based on that request. You can also ask questions, like "How do I format dates in Python?", and the AI will provide an answer or relevant code snippet.
Learning from user interaction AI coding assistants learn and improve from ongoing user interactions. Over time, they adjust to individual developer preferences and coding styles, offering more personalized and relevant suggestions. As users accept or reject suggestions, the assistant refines its future outputs based on this feedback (so-called “feedback loop”). Many AI coding assistants can learn from private codebases (when privacy policies allow) to better align with specific project structures, libraries, or functions common to a particular team or organization (custom adaptations).
Code generation and refactoring Code generation: Based on a high-level description of what you want, the assistant can write larger code blocks.
Code refactoring: AI can help optimize and refactor your code.
You might describe a task such as, “Create a function to handle user authentication using OAuth2”, and the assistant will generate the appropriate code. You might also ask, “Refactor this code to use async/await for better performance”, and the AI will generate the necessary changes to convert synchronous functions into asynchronous ones.
Multilanguage and multimodal support AI coding assistants are designed to support many programming languages and paradigms, enabling them to work across domains (e.g., web development, data science, system programming). They are trained on the syntax, idioms, and patterns of languages like Python, JavaScript, C++, and many others. Codeium supports Assembly, C, C++, C#, Clojure, CMake, CoffeeScript, CSS, CUDA, Dart, Delphi, Dockerfile, Elixir, F#, Go, Groovy, Haskell, HCL, HTML, Java, JavaScript, Julia, JSON, Kotlin, LISP, Less, Lua, Makefile, MATLAB, Objective-C, pbtxt, PHP, Protobuf, Python, Perl, Powershell, R, Ruby, Rust, Sass, Scala, SCSS, shell, Solidity, SQL, Starlark, Swift, TypeScript, TSX, VBA, Vue, YAML.

Callout

Not all AI-powered code assistance tools rely on generative AI. Many utilize traditional machine learning techniques to enhance coding efficiency, such as identifying errors, suggesting optimizations, or automating repetitive tasks. These tools often focus on static analysis, pattern recognition, or rule-based systems to provide real-time support without generating new content. By complementing human expertise, these AI tools streamline development workflows and help maintain code quality, offering reliable solutions for various programming needs.

For instance, PMD, an open-source static code analyzer to find code issues, that focuses on maintainability. SonarQube analyzes code to identify bugs, vulnerabilities, and code smells using static analysis techniques. Finally, Snyk Code uses AI to suggest improvements and detect security issues in code.

With a foundation of how AI models for coding assistants function — how they analyze code, offer suggestions, and improve efficiency — let’s explore some of the most popular AI coding assistants available today. Each tool uses AI in slightly different ways to enhance your coding experience. In the following table, we’ll compare these tools to give you an overview of their key features, pricing, and capabilities. This will help you understand where Codeium fits and what alternatives you might consider.


GitHub Copilot Codeium Amazon CodeWhisperer / Amazon Q Developer Tabnine Hugging Chat Cline
Brief Introduction AI coding assistant developed in collaboration with OpenAI, suggests code as you type within the IDE AI coding assistant that provides code completions and refactoring suggestions AI assistant from Amazon Web Services, offers real-time code recommendations and integrates into various development environments AI code completion tool providing accurate, context-aware code suggestions across multiple languages Leverages open-source models for flexible and customizable AI code assistance AI-powered coding assistant available as a VSCode extension and a browser version
Key Features Integration with GitHub ecosystem

Multilingual and multiline support

Acts as a virtual pair programmer
Advanced code completions

Code refactoring capabilities

Supports multiple programming languages
Code recommendations

Security scans for vulnerabilities

Real-time documentation assistance

Multilingual support
Robust autocompletion

Learns from codebase to suggest relevant snippets and APIs

Extensive language support
Open-source model integration

High flexibility in deployment

Supports diverse coding environments
Autonomous coding capabilities using Claude 3.5 Sonnet

Terminal integration for executing commands

Browser interaction for debugging and testing
Pricing Free for verified students (via GitHub Student Pack)

Monthly subscription fee, often with a free trial month
Free version with basic features

Paid version for full access to advanced features and higher usage limits
Free tier with basic features

Paid tiers with additional features, higher usage limits, and advanced support
Free version with limited features

Pro version subscription-based, available monthly or yearly

Discounts for team licenses
Free access to open-source models

Paid options for advanced support and additional features
A “free, open-source” extension

Callout

To maintain relevance and functionality, AI-powered coding tools must be updated regularly. These updates are critical for addressing evolving user needs, technological advancements, and emerging challenges in security, compliance, and language support.

Now that we’ve compared some of the top AI coding assistants, you can see that each tool offers unique features and benefits. For this lesson, we’ll focus on how to use AI coding assistants with Codeium as our primary example. Codeium offers a powerful, beginner-friendly experience that helps you speed up coding through intelligent completions, refactoring suggestions, and support for multiple programming languages.

For setting up Codeium, you can follow the instructions on this lesson’s setup page.

Using Codeium as your Coding Assistant

In this lesson, we will learn about the three ways Codeium can assist with coding: Autocomplete, Chat, and Command.

Autocomplete: This feature is always working in the background of your coding tool (IDE). It gives suggestions as you type, shown in light gray text. Autocomplete is most helpful when you already have a rough idea of what you want to code and just need to finish it quickly. It helps you stay focused and keeps your work flowing smoothly.

Chat: This feature allows you to ask questions and get answers using simple, everyday language. You can access Chat from the side panel in your coding tool. It’s great for when you’re working with new or unfamiliar code and need quick guidance or explanations.

Command: With Command, you can tell Codeium what changes you want to make to your code in plain language. Codeium will then suggest the changes, which you can choose to accept, reject, or adjust as needed.

Key Points

  • AI coding assistants work by combining ML, contextual code understanding, and NLP to help developers code faster and more efficiently.
  • These tools predict, generate, and even optimize code based on the developer’s input and ongoing context, making them powerful companions in modern software development.
  • There are several AI coding assistants available on the market, and they are designed to optimize your coding experience.
  • To set up Codeium as your coding assistant you need to download and install the extension on our local PC.

Content from Code Generation and Optimization


Last updated on 2024-11-22 | Edit this page

Overview

Questions

  • What are the three main modes of Codeium and how do they differ?
  • How can developers effectively use the Command feature for code generation and refactoring?
  • What role does context awareness play in improving Codeium’s suggestions?
  • How does the Chat feature complement the Command and Autocomplete modes?
  • What are the best practices for writing prompts that get optimal results from Codeium?

Objectives

  • Learn how to use Codeium’s Command mode for code generation and refactoring
  • Master the Chat feature for interactive coding assistance and problem-solving
  • Understand how to leverage Autocomplete for efficient code writing
  • Practice using context awareness to get more relevant code suggestions
  • Develop skills in writing effective prompts for better AI assistance

Codeium accelerates software development through three key modes: Command, Chat, and Autocomplete. Each mode leverages Codeium’s real-time context awareness engine to deliver highly relevant and useful suggestions. After a brief section about the context awareness feature, we will explore how to use these modes to generate, optimize, and refactor code, as well as to identify and fix bugs.

Please note that while using Python is not required since Codeium supports multiple programming languages, all exercises and solutions will be provided in Python.

Code Optimization

Code optimization is the process of making your code faster (reducing runtime), more efficient (using fewer resources like memory and disk), and/or more readable (easier for developers to maintain). Some common strategies for code optimization include:

  • Algorithmic optimization: Improving the efficiency of algorithms to reduce the time or space complexity. Example: Reducing time complexity from O(n²) to O(n log n) by choosing the right sorting algorithm.
  • Code refactoring: Eliminate redundancy, improve readability, and enhance maintainability without changing the code’s functionality. Example: Replace nested loops with more concise operations like map, filter, or comprehensions.
  • Memory optimization: Reducing memory usage by optimizing data structures, avoiding memory leaks, and minimizing unnecessary allocations. Example: Using generators instead of lists to avoid storing all elements in memory.
  • Parallelism and concurrency: Utilize parallel processing or multithreading to split tasks. Example: Process chunks of data simultaneously using multiprocessing in Python.

Much more can be said about code optimization, but these are some common strategies to keep in mind as you work with Codeium.

Context Awareness


Context awareness is one of Codeium’s most powerful features, allowing it to offer personalized and highly relevant suggestions by pulling information from various sources. Traditionally, generating code required training large LLMs on specific codebases, which is resource-intensive and not scalable for individual users. However, Codeium uses a more efficient method known as retrieval-augmented generation (RAG). This applies across the board to Autocompete, Chat, and Command.

RAG
  • Default Context: Codeium automatically pulls context from multiple sources, including the current file and other open files in your IDE, which are typically highly relevant to your ongoing work. Additionally, Codeium indexes your entire local codebase, retrieving relevant snippets even from closed files to assist as you write code, ask questions, or execute commands.
  • Context Pinning: Developers can provide specific guidance by “pinning” custom context through the chat panel’s context tab. You can pin directories, files, repositories, or specific code elements (like functions or classes) for persistent reference across Autocomplete, Chat, and Command. Context Pinning is useful when your current work depends on code from other files; however, it’s best to pin only essential items to avoid slowing performance. Here are some effective uses of Context Pinning:
    • Module Definitions: Pin class/struct definitions from other modules within your repository.
    • Internal Frameworks/Libraries: Pin directories with framework/library usage examples.
    • Specific Tasks: Pin files or folders with interfaces (e.g., .proto files, config templates).
    • Current Focus Area: Pin a directory containing most of the files relevant to your coding session.
    • Testing: Pin a file containing the class you’re writing unit tests for.

For instance, if you’re working on a function and ask Codeium to help refactor it, the tool will pull in relevant context from both your active file and other parts of your codebase to improve the output. This combination of multiple context sources ensures higher-quality code generation, fewer errors, and suggestions that feel tailored to your project.

Command


The Command feature of Codeium allows you to generate or modify code by using natural language inputs. Instead of manually coding everything, you can describe what you want in plain English, and Codeium will help you do it — whether it’s creating new functions or refactoring existing pieces of code.

  • By pressing ⌘(Command)+I on Mac or Ctrl+I on Windows/Linux, you can enter a prompt and receive code suggestions, making it easier and faster to develop code directly within your editor.

  • Codeium will then provide a multiline suggestion that you can accept or reject. You can accept a suggestion by pressing ⌥(Option)+A on Mac or Alt+A on Windows/Linux, reject (⌥+R on mac or Alt+R on Windows/Linux), or follow-up a generation (⌥+F on Mac or Alt+F on Windows/Linux) by using the appropriate shortcuts or by clicking the corresponding code lens above the generated diff.

If you highlight a section of code before invoking Command, then the AI will edit the selection spanned by the highlighted lines. Otherwise, it will generate code at your cursor’s location.

Refactoring, Docstrings, and More

In Codeium, code lenses appear right above your function and class definitions, providing convenient shortcuts for common tasks. These clickable labels allow you to quickly generate docstrings, explain code, or refactor code with just a click, saving time on manual edits. These features are powered by Command, which is invoked when the code lenses are clicked.

  • Refactor Functionality: Clicking the Refactor label triggers Codeium’s refactoring capabilities, providing a dropdown of pre-filled options or allowing you to write a custom instruction. This is particularly useful for improving performance or restructuring code. You can also highlight a block of code and use Command (⌘+I or Ctrl+I) to perform more targeted refactoring.

  • Docstring Generation: For generating documentation, clicking the Docstring label automatically creates a docstring above your function header (or under the function header in Python). This AI-generated documentation will describe what the function does, helping you maintain well-documented, readable code. In Python, for example, the docstring will be correctly placed directly beneath the function header.

Smart Paste

This feature lets you copy code and paste it into a file in your IDE that’s written in a different programming language. Use ⌘+⌥+V (Mac) or Ctrl+Alt+V (Windows/Linux) to activate Smart Paste. Codeium will automatically detect the language of the destination file and translate the code in your clipboard accordingly. With context awareness, Codeium will also adapt the pasted code to integrate seamlessly, such as by referencing relevant variable names.

Here are some potential use cases:

  • Code migration: for instance, converting JavaScript to TypeScript or translating Java into Kotlin.
  • Adapting code from online sources: you found a helpful utility function in Go, but your project is in Rust.
  • Exploring a new language: if you’re curious about Haskell, you can see how your code might look if written in it.

Best Practices

Here are a few things to remember when using Command function of Codeium:

  • The model behind Command is more advanced than the one used for autocomplete, making it slower but more capable of following complex instructions.

  • To edit a specific selection of code, highlight that piece of code before using Command. If not, it will generate new code at the cursor’s position.

  • For effective use, try to give clear and detailed prompts. While simple requests like “Fix this” or “Refactor” can work well due to context awareness, more specific instructions like “Write a function that takes two inputs of type Diffable and implements the Myers diff algorithm” can yield even better results.

Chat


The Codeium Chat feature offers a powerful way to interact with an AI assistant directly within your coding environment, providing instant, contextual feedback on the code. Unlike the Command function, Codeium Chat is designed for a more conversational and responsive interaction, making it easy to discuss complex coding questions and solutions. The base Chat model is available to all Codeium users and it is based on Meta’s Llama 3.1 70B.

In Visual Studio Code, you can access Codeium Chat by clicking on the Codeium icon located in the left sidebar by default.

Chat

To quickly open the chat panel or toggle focus between it and your code editor, use ⌘+⇧+A on Mac or Ctrl+⇧+A on Windows/Linux. For a more flexible experience, you can pop the chat panel out of the IDE entirely by clicking the “pop-out” icon at the top of the chat window.

@-Mentions

In any chat message, you can use the @-Mentions feature to refer to context items from within the chat input by prefixing a word with @. Context items can be function names, class names, directories and files, or even content of your termnal history. By doing this explicitly, Codeium will make the most relevant suggestions for you.

For instance, you can mention classes, as shown below:

@mentions

Prompting

Clear and efficient prompting is a vital element of both Chat and Command features. There are three key components to a good prompt:

  • Clear Objective: Be specific about what you want Chat to produce—whether it’s new code, a refactor, or an explanation.

  • Contextual Details: Use @ mentions to provide context, such as referring to specific functions, classes, or modules.

  • Constraints: If there are specific requirements, such as using certain frameworks or considering performance, include these in your prompt.

Example:

Bad: Refactor rawDataTransform

Good: Refactor @func:rawDataTransform by turning the while loop into a for loop and using the same data structure output as @func:otherDataTransformer

Best Practices for Chat

Note that these best practices apply to both Chat and Command modes, as they help Codeium understand your needs more effectively.

💡 Prompting Best Practices

The prompting strategies we’re exploring here for Codeium aren’t just limited to this tool. These techniques (e.g., like being clear, concise, and specifying outputs) apply to many other AI-powered tools like ChatGPT, Copilot, and beyond. Mastering these skills will make your interactions with all AI tools more effective, no matter the platform!

Other Features

  • Persistent Context: Configure the Context tab in the chat panel to enable continuous contextual awareness during and across conversations. Within this tab, you’ll find:
    • Custom Chat Instructions: Brief prompts like “Respond in Kotlin and assume I have little familiarity with it,” which guide the model’s response style.
    • Pinned Contexts: Key items from your codebase (such as files, directories, or code snippets) that you want the model to actively consider.
    • Active Document: Highlights the currently active file, giving it priority.
    • Local Indexes: Lists repositories that the Codeium context engine has locally indexed.
  • Slash Commands: Use the /explain command at the beginning of a message to request an explanation from the model. Currently, this is the only supported slash command.
  • Copy and Insert: When a response includes code blocks, you can either copy the block to your clipboard or insert it directly into your editor by using the buttons above the code block.
  • Inline Citations: The model can reference specific items from your code, often linking snippets directly in its responses.
  • Regenerate with Context: Codeium assesses whether a question requires general or codebase-specific context by default. You can ensure a response includes code context by pressing ⌘⏎, or for previously answered questions, by clicking the sparkle icon to rerun it with context.
  • Chat History: Access past conversations by selecting the history icon in the chat panel. You can start a new conversation by clicking + or export existing chats using the menu.

Here are some typical use cases of the Chat functionality:

  • Writing Boilerplate Code: Easily generate function headers or repetitive code blocks by providing simple prompts like “Write a function that takes X and Y, performs A, B, C, and returns Z.”

  • Writing Unit Tests: You can use Chat to quickly write unit tests for your functions. For instance, ask it, “Write a unit test for @function-name that tests edge cases for X and Y.”

  • Generating Docstrings and Comments: Use Chat to add useful comments or docstrings to your code. Simply prompt it by saying, “Write a docstring for @function-name,” and it will generate a detailed explanation.

  • Explaining Code: For those new to a codebase or trying to understand complex logic, Chat can explain functions. You might ask, “Explain @function,” and Chat will provide a breakdown of the function’s purpose and workings.

Autocomplete


A key feature of Codeium is its Autocomplete function: with every keystroke, Codeium actively attempts to predict and complete what you’re typing. By analyzing the current file, previous edits, and contextual snippets from your codebase, it offers relevant suggestions as “ghost text”.

Autocomplete Function

This feature can be particularly useful when you’re writing boilerplate code (which refers to repetitive code that often serves as a standard template), as it can save you time and reduce the likelihood of errors. By leveraging Codeium’s Autocomplete function, you can speed up your coding process and focus on the more creative and challenging aspects of your work.

Boilerplate, Formatting, and More

As already mentioned above, one of the most powerful ways to use Codeium is for automating repetitive coding tasks. Whether you’re writing boilerplate code, standard functions, or common design patterns, Codeium’s AI assistant can help speed up the process and reduce the manual effort required.

By recognizing patterns in your codebase, Codeium can predict and suggest repetitive structures such as:

  • Boilerplate code: Generate routine code structures like class definitions, function signatures, or common initialization blocks with minimal effort.
  • Repetitive functions: Quickly replicate commonly used functions or methods that follow a similar pattern, reducing the need for retyping or copy-pasting.
  • Code formatting and styling: Maintain consistency in your code format by allowing Codeium to suggest and automate repetitive formatting tasks, saving you from manual corrections.

For example, when writing a set of functions that handle similar operations, Codeium can recognize the pattern after a few examples and suggest the next method based on prior code. This not only speeds up your workflow but also minimizes the risk of errors from copying or manually creating repetitive code.

Autocomplete Feature for Docstrings

Note that Codeium’s Autocomplete feature may also suggest docstrings as you type, further assisting in creating well-documented functions without needing to manually write them. This can be particularly useful when you need to create specific wordings and have the freedom to customize documents interactively.

Fill-in-the-middle (FIM)

Codeium’s Autocomplete model also includes Fill-in-the-Middle (FIM) capabilities. This is crucial because, more often than not, you’re inserting code within an existing file, rather than just appending new lines. With FIM, Autocomplete analyzes the code both above and below the current line to provide more accurate suggestions.

Let’s compare the FIM-enabled model to the standard model.

Codeium Autocomplete (with FIM):

FIM

Competitor Autocomplete (without FIM):

No FIM

Note that the non-FIM model suggests a generic docstring based on only the preceding line with the function signature. In contrast, the FIM model provides a more contextually relevant suggestion based on the entire function definition.

Inline Comments

You can guide the Autocomplete feature by using comments within your code. Codeium interprets these comments and generates code suggestions to implement what the comment describes.

Inline Comments

Shortcuts

The following shortcuts can be used to speed up your workflow:

  • Tab = accept suggestion
  • Esc = clear suggestions

MacOS

  • Cmd + Left Arrow = accept next word in suggestion
  • Option + ] = next suggestion

Windows/Linux

  • Ctrl + Left Allow = accept next word in suggestion
  • Alt + ] = next suggestion

Best Practices

  • Avoid manually triggering Autocomplete; instead, let it naturally enhance your workflow. Writing prompts as comments is not recommended, but decorating your code with quality comments and information variable/function names yields the best results.

  • To achieve the best results with Autocomplete, it’s important to enhance your code with clear, declarative (not instructive) code, descriptive function names, and good comments with examples of the desired inputs and outputs. See the table below for examples:

    Best Practices for Autocomplete
  • If needed, you can temporarily snooze Autocomplete. This feature is available in VS Code version 1.8.66. Simply click the Codeium button in the bottom right to access it.

Hands-on Practice


In the following exercises, you will have the opportunity to practice using Codeium’s Command, Chat, and Autocomplete features to generate, optimize, and refactor code. Create a python file (for example exercise.py) in your IDE and follow along with the exercises.

Code Generation

During the following exercises, we will be using a dataset containing CO2 concentration measurements taken in Mauna Loa, Hawaii, from 1958 to 2024, grouped by month. The dataset is available in the file co2-mm-mlo.csv on this website, and here is an example data view of it:

CO2 Dataset
  • Date: The date of the measuremen in the format YYYY-MM.
  • Decimal Date: The date in decimal format.
  • Average: The average CO2 concentration in parts per million (ppm) per month.
  • Interpolated: The interpolated CO2 concentration in ppm per month.
  • Trend: The trend of CO2 concentration in ppm per month.
  • Number of Days: The percentage number of daily averages used to compute the monthly average.

For more details about how the data was collected and processed, you can refer to the source.

Let’s start by exploring the Command mode and generating code snippets to analyze a dataset. In Command mode, keeping the python file open, press ⌘(Command)+I on Mac or Ctrl+I on Windows/Linux to open the Command prompt. Then, copy and paste the following text into your editor (you can also break it down in smaller pieces if you prefer):

OUTPUT

Load a [CO2 concentration dataset](https://datahub.io/core/co2-ppm/) from the file `co2-mm-mlo.csv` into a Pandas DataFrame, then generate descriptive statistics and visualize data distributions. Read the dataset using the following URL: https://edu.nl/k6v7x.

1.  Write a function that takes a DataFrame as input and calculates key descriptive statistics, including:

   - Number of rows and columns
   - Data types of each column
   - Summary statistics (e.g., mean, minimum, maximum)

   Compute the statistics only for the numeric columns.

2. Write a function that accepts a DataFrame and a specific column as inputs, and creates a new figure in which it plots its distribution. If the column is numeric (e.g., `int64`, `float64`), create a histogram to display its distribution; if categorical, create a bar plot to show category frequencies. Add the name of the column to the title.

3. Write a function that creates a new figure in which it plots the `Average` and `Interpolated` columns on a single graph, with `Date` on the x-axis, to visualize their distributions over time.

4. In the main, print nicely the information computed in 1., run the function defined in 2. on all columns, and run the function defined in 3. Use the `show()` functionality to display the figures only at the end of the main. 

Here is what you would expect to see in the generated code:

PYTHON

import matplotlib.pyplot as plt
import pandas as pd


def get_descriptive_stats(df):
    """
    Calculate key descriptive statistics for a given DataFrame.
    """
    stats = {"nrow": df.shape[0], "ncol": df.shape[1]}
    for col in df.select_dtypes(include="number").columns:
        stats[col] = {
            "dtype": df[col].dtype,
            "mean": df[col].mean(),
            "min": df[col].min(),
            "max": df[col].max(),
        }
    return stats


def plot_distribution(df, column):
    """
    Plot the distribution of a given column in a DataFrame.
    """
    fig, ax = plt.subplots()
    if df[column].dtype.kind in "bifc":
        df[column].plot.hist(ax=ax, bins=50)
    else:
        df[column].value_counts().plot.bar(ax=ax)
    ax.set_title(column)


def plot_time_series(df):
    """
    Plot the Average and Interpolated columns over time.
    """
    fig, ax = plt.subplots()
    df.plot(x="Date", y=["Average", "Interpolated"], ax=ax)


def main():
    url = "https://edu.nl/k6v7x"
    df = pd.read_csv(url)
    stats = get_descriptive_stats(df)
    print(pd.DataFrame(stats).T)
    for col in df.columns:
        plot_distribution(df, col)
    plot_time_series(df)
    plt.show()


if __name__ == "__main__":
    main()

There is something wrong here, can you spot it? We will address this issue later in the “Bug Fixing” exercise, so keep it in mind as you proceed.

Pseudo-randomness 🔍

You may obtain slightly different results due to the pseudo-randomness of the command mode generation process.

Instructions 🔍

The instructions provided in the text were clear and precise, designed to achieve the expected results accurately using the command mode. Try experimenting with removing, rearranging, or adding details to the instructions. You’ll notice that the assistant might generate slightly different code, which occasionally may not fully meet your intended goal.

This exercise highlights the importance of having a clear understanding of what you want to achieve when seeking help from an assistant. It allows you to refine or adjust the instructions to guide the tool effectively toward your objective. Relying too heavily on the assistant can lead to mistakes, a point we will emphasize repeatedly throughout this lesson.

Docstrings Generation

Now, let’s modify the get_descriptive_stats() and plot_column_distribution() functions’ docstrings you created during the previous exercise to add further details using Codeium’s Refactor lens. Each docstring should:

  • Describe the purpose of the function
  • Document the function’s arguments and expected data types
  • Explain what the function returns (if applicable)
  • Optionally, provide a usage example

To do this, click on the Refactor lens above the function definition and select the Add docstring and comments to the code option. Codeium will add more details to the existing docstring, making it more informative and useful.

Note that if you don’t have a docstring yet in your function definition, another lens will appear to help you generate one, the Generate Docstring lens. Try experimenting with both lenses to see how they can improve your code documentation.

💡 Tip

Try experimenting with different docstring styles! For example, you could also explore the Google-style docstrings using the Refactor lens or the Command mode. The default style used by the lenses should be the NumPy-style.

🔍 Note

While Command mode is not aware of the context of your code and doesn’t maintain a preselected docstring style across different functions, Chat mode can detect and persist a chosen docstring style across multiple functions. This feature is particularly useful when you want to maintain a consistent docstring format throughout your codebase.

Here’s an example of how the get_descriptive_stats() and the plot_column_distribution()functions might look with the refactored docstrings:

PYTHON

def plot_distribution(df, column):
    """
    Plot the distribution of a given column in a DataFrame.

    For numerical columns, a histogram is plotted. For categorical columns,
    a bar plot of the counts is plotted.

    Parameters
    ----------
    df : DataFrame
        The DataFrame to plot the distribution for.
    column : str
        The column to plot the distribution for.

    Returns
    -------
    None
    """
    fig, ax = plt.subplots()
    if df[column].dtype.kind in "bifc":
        # Plot a histogram for numerical columns
        df[column].plot.hist(ax=ax, bins=50)
    else:
        # Plot a bar plot of the counts for categorical columns
        df[column].value_counts().plot.bar(ax=ax)
    ax.set_title(column)

def plot_column_distribution(df, column):
    """
    Plot the distribution of a given column in a DataFrame.

    Parameters
    ----------
    df : DataFrame
        The DataFrame containing the data.
    column : str
        The column name in the DataFrame for which to plot the distribution.
    """
    # Create a new figure and axis for the plot
    fig, ax = plt.subplots()
    
    # Check if the column is of a numeric type
    if df[column].dtype.kind in "bifc":
        # Plot a histogram for numeric data
        df[column].plot.hist(ax=ax, bins=50)
    else:
        # Plot a bar chart for categorical data
        df[column].value_counts().plot.bar(ax=ax)
    
    # Set the title of the plot to the column name
    ax.set_title(column)

Note that you might need to adjust the generated docstring if the function has complex logic or if the generated docstring lacks specific details about edge cases or exceptions.

Bug Fixing (5 min)

Look back at the code generated during the “Code Generation” section. If you look at the head of the DataFrame, what do you notice? Use the Chat feature to discuss the issue with Codeium and ask for suggestions on how to resolve it. Then run again the functions defined in the previous exercise to see if the issue has been resolved.

The issue is that the Date column is used as index column, causing all the other columns to shift by one. Here’s how you might discuss the issue with Codeium in the Chat:

  1. Prompt: “The Date column is being used as the index, causing the other columns to shift by one. How can I read the file without encourring into this issue?”
  2. Discussion: Codeium might suggest resetting the index or using the reset_index() function to address the issue. Alternatively, it might recommend setting index_col=False when reading the CSV file to prevent the Date column from being used as the index.

Correct example of how to resolve the issue:

PYTHON

df = pd.read_csv(url, index_col=False)
  1. Verifiy the suggestion by running the functions again and checking the output.

Code Optimization

Given the following piece of code that processes the previously read the dataset to find the difference between the average and interpolated CO2 concentration for each row:

PYTHON

avg_int = []
for i in range(df.shape[0]):
    avg_int.append(df.iloc[i]['Average'] - df.iloc[i]['Interpolated'])

df['Avg-Int'] = avg_int

We can use the Command (or Chat) feature to optimize it for better performance and readability. Here’s an example of how the optimized code might look:

PYTHON

df['Avg-Int_opt'] = df.apply(lambda x: x['Average'] - x['Interpolated'], axis=1)
assert df['Avg-Int'].equals(df['Avg-Int_opt'])

Or even like this:

PYTHON

df['Avg-Int'] = df['Average'] - df['Interpolated']

This version is faster and more memory-efficient because it uses vectorized operations, which are a key feature of the pandas library.

Code Optimization (5 min)

Similar to the exercise above, execute the code as is to verify it works and examine the output. Then use Codeium’s Chat feature to analyze and suggest potential improvements. Look for ways to enhance performance, readability, and conciseness.

PYTHON

# Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m')

# Filter data for a specific date range
filtered_df = df[(df['Date'] >= '2000-01-01') & (df['Date'] <= '2010-12-31')]

# Extract the year value from the 'Date' column
filtered_df['Year'] = filtered_df['Date'].dt.year

# Group data by year and calculate the average CO2 level for each year
avg_co2_per_year = filtered_df.groupby('Year')['Interpolated'].mean()

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(avg_co2_per_year.index, avg_co2_per_year, label='Average CO2 (ppm)', marker='o')
plt.xlabel('Year')
plt.ylabel('CO2 (ppm)')
plt.title('Average CO2 Levels by Year (2000-2010)')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

PYTHON

# Convert 'Date' column to datetime format and filter data for a specific date range
filtered_df = df[
    (pd.to_datetime(df['Date'], format='%Y-%m') >= '2000-01-01') & 
    (pd.to_datetime(df['Date'], format='%Y-%m') <= '2010-12-31')]


# Group data by year and calculate the average CO2 level for each year
avg_co2_per_year = filtered_df.groupby(pd.to_datetime(filtered_df['Date'], format='%Y-%m').dt.year)['Interpolated'].mean()


# Plot the results
avg_co2_per_year.plot(figsize=(10, 6), marker='o')
plt.xlabel('Year')
plt.ylabel('CO2 (ppm)')
plt.title('Average CO2 Levels by Year (2000-2010)')
plt.grid(True)
plt.tight_layout()
plt.show()

Comparison:

  • Combined the pd.to_datetime conversion and filtering steps into one.
  • Removed the unnecessary filtered_df['Year'] column and used the dt.year accessor to extract the year from the 'Date' column.
  • Simplified the plotting code by using the plot method of the Series object and removing the unnecessary plt.figure call.
  • Removed the label parameter from the plot function, as it is not necessary when using the plot method of the Series object.

Key Points

  • Codeium offers three main modes: Command, Chat, and Autocomplete, each serving different coding needs
  • Context awareness through RAG technology helps Codeium provide more relevant and accurate suggestions
  • The Chat feature enables conversational interaction with code-aware AI assistance
  • Autocomplete includes Fill-in-the-Middle capabilities for more accurate code completion
  • Clear, declarative comments and good naming conventions improve Codeium’s effectiveness

Content from Ethical and Security Considerations


Last updated on 2024-11-24 | Edit this page

Overview

Questions

  • What are the potential biases inherent in AI coding assistants and how can they affect code quality?
  • How can developers effectively validate and test AI-generated code to ensure security and reliability?
  • What measures can be taken to protect sensitive data when using AI coding assistants?

Objectives

  • Identify and analyze the ethical challenges posed by AI coding assistants in software development.
  • Establish best practices for testing and validating AI-generated code.
  • Outline security measures to protect sensitive information when using AI tools.
  • Promote collaborative approaches to code review among development teams.

Introduction to the Practical Exercise


Before diving into the core content on ethical and security considerations for AI coding assistants, let’s begin with a scenario-based exercise to set the stage. This exercise will help you identify potential challenges and think critically about how you might address them.

AI Coding Assistant Ethics Challenge (15 min)

You’re leading a team that’s considering adopting an AI coding assistant for a new project involving sensitive user data. Your task is to create a comprehensive plan that addresses the ethical and security concerns discussed in the lesson.

  1. List at least two potential risks or vulnerabilities that could arise from using an AI coding assistant in this project.

  2. For each risk identified, propose a specific mitigation strategy. Explain how this strategy addresses the risk and aligns with the best practices discussed in the lesson.

  3. Draft a set of at least 5 ethical guidelines for your team to follow when using the AI coding assistant. These should cover areas such as bias prevention, code review processes, and data privacy.

  4. Outline a security protocol that includes at least three specific measures to protect sensitive data and ensure the integrity of the AI-assisted development process.

  5. Design a collaborative code review process that leverages the strengths of both human developers and the AI assistant while mitigating potential risks.

Here are a few examples of sensitive situations you might think about:

  • Handling confidential participant data (e.g., writing code for analyzing participant responses in medical or psychological studies where data includes health records or personal information).
  • Incorporating third-party libraries with unverified security compliance.
  • Securing proprietary algorithms when developing code for cutting-edge research models or simulations that could be misused if exposed.

1. Potential risks or vulnerabilities

  • The AI assistant might inadvertently expose sensitive user data if it’s not properly configured to handle confidential information.
  • The AI assistant might suggest insecure coding practices or outdated libraries, introducing vulnerabilities into the project.

2. Mitigation strategies

  • Implement a local, offline version of the AI coding assistant or create mock data when the AI coding assistant is enable to prevent exposure of real user data.
  • Integrate automated security scanning tools into the CI/CD pipeline (e.g., Synk) to identify and address vulnerabilities in the code suggested by the AI.

3. Ethical guidelines

  • Always document when and how the AI assistant is used in code development.
  • Every piece of AI-generated code must be reviewed by at least one human developer before integration.
  • Never input sensitive user data or proprietary information into the AI assistant.
  • Take responsibility for all code in the project, regardless of whether it was human or AI-generated.
  • Use the AI assistant as a tool to enhance human capabilities, not to replace critical thinking or decision-making.

4. Security protocol

  • If using an offline version of the AI coding assistant, make sure to run it in an isolated, sandboxed environment to prevent unauthorized access to sensitive project data.
  • Implement strict role-based access controls to limit who can use the AI assistant and what parts of the codebase it can access.
  • Regularly update the AI assistant and its underlying libraries to patch security vulnerabilities.

5. Collaborative code review process

  • Always do peer code reviews of AI-generated code to catch any potential security vulnerabilities or ethical concerns.
  • Run the code through automated testing and security scanning tools to catch potential issues missed by human reviewers.
  • Hold regular team meetings to discuss complex issues, AI-suggested patterns, and potential biases or security concerns identified during the review process.

Ethical Considerations


When using AI coding assistants, it is essential to recognize the ethical challenges they pose. Indeed, these tools, while powerful, raise issues of bias, transparency, accountability and privacy. Key ethical considerations include:

  • Biases in AI systems: AI coding assistants can perpetuate or amplify the biases inherent in their training data. This can lead to the generation of code that inadvertently favors specific demographic categories or fails to meet legal standards.

  • Error management: AI coding assistants can produce both random and systematic errors, which can undermine the reliability of the development process. For example, a study conducted by researchers at Bilkent University found that GitHub Copilot generated correct code only 46.3 percent of the time, highlighting the risk of depending solely on these tools for critical tasks.

  • Transparency and explainability: The “black box” nature of AI coding assistants complicates understanding of how and why specific code suggestions are made. For example, if an AI recommends a particular implementation, it may not clarify the rationale behind its choice, making it difficult for developers to validate the suggested solution.

  • Data privacy and confidentiality: Many AI coding assistants retain user input, raising concerns about the risk of inadvertently using confidential code or data in future outputs or for other purposes. This potential misuse of sensitive data can lead to data or intellectual property rights violations. Developers should exercise caution and ensure that sensitive information is not exposed to these systems.

  • Intellectual property issues: The use of AI coding assistants can blur the boundaries of authorship and ownership. Questions arise regarding rights to AI-generated code and whether developers can claim ownership or credit for results heavily influenced by AI suggestions. This poses ethical dilemmas in collaborative coding environments and can impact open-source projects.

Identifying Vulnerabilities


AI coding assistants can also introduce vulnerabilities into code bases if not used carefully. Here are the key areas to watch out for, along with some examples.

Code Vulnerabilities

AI assistants may suggest insecure or incorrect coding practices because they are trained on datasets that include both secure and insecure code. A 2023 Stanford University study found that users who rely on AI assistants write less secure code but are more confident that it is secure, underscoring the risk of blindly trusting AI-generated code.

For example, an AI assistant might suggest a code snippet that does not properly sanitize user input, making the application vulnerable to SQL injection. Developers might trust this code without noticing the missing security checks, leading to exploitable weaknesses.

Data Privacy Issues

AI assistants often require access to codebases and data, potentially exposing sensitive information.

For example, when an AI tool processes a company’s proprietary codebase on cloud-based servers, there is a risk that sensitive business logic or user data could be intercepted during transmission or compromised in the event of a storage breach. Even if the data is anonymized, improper handling could still reveal critical details.

Adversarial Attacks

Hackers could exploit artificial intelligence systems to inject malicious code or manipulate workflows. An attacker could train an AI model to suggest backdoor vulnerabilities in widely used code patterns, spreading insecure practices into many applications. AI assistants could unintentionally recommend this malicious code, causing serious security breaches.

Unreliable Results

AI models sometimes produce errors or “hallucinate” incorrect information, leading to potentially harmful results. AI can provide outdated or fabricated information if its training data lack recent developments or if it fabricates answers in the absence of accurate data. This can be particularly problematic in fields such as software development, scientific research, or legal advice, where current knowledge is critical.

For example, an artificial intelligence might suggest an optimization that causes a memory leak or a system crash that, if unchecked by the developer, could lead to serious system failures in production environments.

Dependence on External Code

AI assistants often draw on external libraries, which could introduce hidden vulnerabilities into the code base.

For example, an AI might suggest a function from an open-source library without verifying that it is up-to-date and secure. If the library has unpatched security flaws, these could propagate into the project, making it vulnerable to exploitation.

Safety measures and best practices


To ensure ethical and safe use of AI coding assistants, developers must adopt a combination of technical safeguards and responsible practices. Many ethical issues intersect with security issues, and addressing them holistically improves the integrity of AI systems and the software development process.

Best Practices for the Ethical Use of AI Assistants in Research

  • Vigilant evaluation of results: Developers should continuously evaluate the results generated by AI coding assistants to identify and reduce potential biases. This active monitoring helps ensure that the code produced is correct and complies with ethical standards.

  • Rigorous testing and validation: Implementing robust testing and validation processes is critical to detecting errors in AI-generated code. Developers must rigorously evaluate AI suggestions to maintain code quality and reliability. Tools such as SonarQube can be integrated into the Continuous Integration/Continuous Deployment (CI/CD) pipeline to automatically assess code quality and safety before deployment.

  • Use offline AI tools: For projects involving proprietary or sensitive data, using offline AI coding assistants can significantly reduce risks. Keeping AI operations local avoids the transmission of sensitive information over the Internet, safeguarding confidentiality and intellectual property. For example, tools such as TabNine can be run locally without the need for Internet access, thus safeguarding sensitive information. Another example may be training a LLaMA model locally, which allows for a customized coding assistant while ensuring data privacy and model performance control.

  • Establish clear guidelines: Developing and adhering to clear guidelines on the use of AI in code generation helps address ethical challenges. Organizations can refer to accessible resources such as the ACM Code of Ethics, which outlines principles for ethical computing, or the European Commission’s Ethical Guidelines on AI. In addition, companies can create their own internal policies that define best practices and ethical considerations for the use of AI assistants in their development processes.

Security Measures

  • Code review processes: Establish rigorous code review protocols to evaluate AI-generated code for potential vulnerabilities and errors. Regular review of AI suggestions helps detect security risks, such as improper sanitization of inputs or use of outdated libraries.

  • Secure development practices: Train developers on secure coding principles and integrate security testing tools into the development pipeline. Automated testing can identify security holes early in the development process, while training sessions help developers remain vigilant against insecure practices suggested by artificial intelligence. Tools such as Snyk can automatically scan for vulnerabilities in dependencies and provide actionable corrective measures.

  • Access controls and data encryption: Protect sensitive code bases by implementing strong access controls and encrypting data. This approach prevents unauthorized access to proprietary code and ensures the security of AI training data, thereby reducing the risk of data breaches or malicious tampering.

  • Continuous monitoring and updates: Regularly monitor the performance of AI coding assistants and keep them up-to-date. Ensuring that these tools use the latest security practices and coding standards minimizes the introduction of vulnerabilities. Using a model such as CI/CD can help automate updates and ensure that the latest security practices are consistently applied.

  • Collaborative Oversight: Encourage a collaborative oversight approach where teams share responsibility for evaluating AI-generated code. This collective effort can enhance code quality, as multiple perspectives help identify potential flaws that an individual developer might overlook. Implementing regular code reviews can foster an environment of collective responsibility, enhancing code quality through diverse perspectives.

By integrating these best practices and security measures, developers can leverage the advantages of AI coding assistants while effectively mitigating the ethical and security risks associated with their use.

Key Points

  • AI coding assistants can introduce biases and errors, impacting the integrity of generated code.
  • Vigilant assessment and robust testing processes are essential for validating AI outputs.
  • Sensitive data protection requires secure development practices and offline AI tools.
  • Collaborative oversight enhances code quality by leveraging diverse perspectives in code reviews.