Launching HTB CWEE: Certified Web Exploitation Expert Learn More

Intro to Assembly Language

This module builds the core foundation for Binary Exploitation by teaching Computer Architecture and Assembly language basics.

4.87

Created by 21y4d

Medium General

Summary

Binary exploitation is a core part of penetration testing, but learning it can be pretty challenging. This is mainly due to the complexity of binary files and their underlying machine code and the way binary files interact with the processor and computer memory.

Learning the basics of Computer Architecture and the Assembly Language is fundamental for understanding binary exploitation. Both can significantly enhance our understanding of how binaries work and interact with system resources.

The Intro to Assembly Language module builds the core foundation for all future Binary Exploitation modules by teaching the basics of:

  1. Computer and Processor Architecture
  2. Debugging and Disassembling
  3. x86_64 Assembly Language
  4. Shellcoding

Having a solid understanding of these topics will make learning basic binary exploitation very straightforward and facilitate learning advanced binary exploitation.

In addition to teaching the above topics, this module will also cover:

  • How High-Level code is compiled into Low-Level machine code
  • Different types and segments of computer memory
  • CPU clock and instruction cycles
  • CISC vs. RISC architectures
  • Different types of registers and memory addresses
  • CPU address endianess
  • Intro to nasm and Assembly File Structure
  • Intro to Assembling and Disassembling files
  • Basics of Binary Debugging with GDB
  • Basic Data and Arithmetic Assembly instructions
  • Intro to Assembly loops and branching
  • Assembly flags and conditional branching
  • Intro to Linux syscalls
  • Assembly procedures and functions
  • Using C and libc functions with Assembly
  • Intro to pwntools for Assembly and Shellcoding
  • Writing scripts with modern tools to extract, run, and debug shellcodes
  • Modern shellcoding techniques compliant with memory protections
  • Using tools for shellcode generation and encoding

We will also be working on a project throughout the module to apply everything we learn.

By the end of the module, we will have created a complete program that takes user input, performs advanced calculations, and outputs the results to the user, using nothing but Assembly language (, i.e., 1's and 0's).

To know more about this module, you may also watch this talk from module author at the HackTheBox Uni CTF 2022 titled First Steps Into Binary Exploitation, as it introduces the first few sections of this module and also shows how this module is benefecial for getting started in Binary Exploitation:

This module is broken down into sections with accompanying hands-on exercises to practice each of the tactics and techniques we cover.
The module ends with a practical hands-on skills assessment to gauge your understanding of the various topic areas.

As you work through the module, you will see example commands and command output for the various topics introduced. It is worth reproducing as many of these examples as possible to reinforce further the concepts presented in each section. You can do this in the PwnBox provided in the interactive sections or your virtual machine.

You can start and stop the module at any time and pick up where you left off. There is no time limit or "grading," but you must complete all of the exercises and the skills assessment to receive the maximum number of cubes and have this module marked as complete in any paths you have chosen.

The module is classified as "Medium" and assumes a working knowledge of the Linux command line and an understanding of information security fundamentals.

A firm grasp of the following modules can be considered prerequisites for successful completion of this module:

  • Learning Process
  • Linux Fundamentals

Assembly Language


Most of our interaction with our personal computers and smartphones is done through the operating system and other applications. These applications are usually developed using high-level languages, like C++, Java, Python, and many others. We also know that each of these devices has a core processor that runs all of the necessary processes to execute systems and applications, along with Random Access Memory (RAM), Video Memory, and other similar components.

However, these physical components cannot interpret or understand high-level languages, as they can essentially only process 1's and 0's. This is where Assembly language comes in, as a low-level language that can write direct instructions the processors can understand. Since the processor can only process binary data "i.e. 1's and 0's", it would be challenging for humans to interact with processors without referring to manuals to know which hex code runs which instruction.

This is why low-level assembly languages were built. By using Assembly, developers can write human-readable machine instructions, which are then assembled into their machine code equivalent, so that the processor can directly run them. This is why some refer to Assembly language as symbolic machine code. For example, the Assembly code 'add rax, 1' is much more intuitive and easier to remember than its equivalent machine shellcode '4883C001', and easier to remember than the equivalent binary machine code '01001000 10000011 11000000 00000001'. As we can see, without Assembly language, it is very challenging to write machine instructions or directly interact with the processor.

Machine code is often represented as Shellcode, a hex representation of machine code bytes. Shellcode can be translated back to its Assembly counterpart and can also be loaded directly into memory as binary instructions to be executed.


High-level vs. Low-level

As there are different processor designs, each processor understands a different set of machine instructions and a different Assembly language. In the past, applications had to be written in assembly for each processor, so it was not easy to develop an application for multiple processors. In the early 1970's, high-level languages (like C) were developed to make it possible to write a single easy to understand code that can work on any processor without rewriting it for each processor. To be more specific, this was made possible by creating compilers for each language.

When high-level code is compiled, it is translated into assembly instructions for the processor it is being compiled for, which is then assembled into machine code to run on the processor. This is why compilers are built for various languages and various processors to convert the high-level code into assembly code and then machine code that matches the running processor.

Later on, interpreted languages were developed, like Python, PHP, Bash, JavaScript, and others, which are usually not compiled but are interpreted during run time. These types of languages utilize pre-built libraries to run their instructions. These libraries are typically written and compiled in other high-level languages like C or C++. So when we issue a command in an interpreted language, it would use the compiled library to run that command, which uses its assembly code/machine code to perform all the instructions necessary to run this command on the processor.


Compilation Stages

Compilation Stages

Let's take a basic 'Hello World!' program that prints these words on the screen and show how it changes from high-level to machine code. In an interpreted language, like Python, it would be the following basic line:

print("Hello World!")

If we run this Python line, it would be essentially executing the following C code:

#include <unistd.h>

int main()
{
    write(1, "Hello World!", 12);
    _exit(0);
}

Note: the actual C source code is much longer, but the above is the essence of how the string 'Hello World!' is printed. If you are ever interested in knowing more, you can check out the source code of the Python3 print function at this link and this link

The above C code uses the Linux write syscall, built-in for processes to write to the screen. The same syscall called in Assembly looks like the following:

mov rax, 1
mov rdi, 1
mov rsi, message
mov rdx, 12
syscall

mov rax, 60
mov rdi, 0
syscall

As we can see, when the write syscall is called in C or Assembly, both are using 1, the text, and 12 as the arguments. This will be covered more in-depth later in the module. From this point, Assembly code, shellcode, and binary machine code are mostly identical but written in different formats. The previous Assembly code can be assembled into the following hex machine code (i.e., shellcode):

48 c7 c0 01
48 c7 c7 01
48 8b 34 25
48 c7 c2 0d
0f 05

48 c7 c0 3c
48 c7 c7 00
0f 05

Finally, for the processor to execute the instructions linked to this machine, it would have to be translated into binary, which would look like the following:

01001000 11000111 11000000 00000001
01001000 11000111 11000111 00000001
01001000 10001011 00110100 00100101
01001000 11000111 11000010 00001101 
00001111 00000101

01001000 11000111 11000000 00111100 
01001000 11000111 11000111 00000000 
00001111 00000101

A CPU uses different electrical charges for a 1 and a 0, and hence can calculate these instructions from the binary data once it receives them.

Note: With multi-platform languages, like Java, the code is compiled into a Java Bytecode, which is the same for all processors/systems, and is then compiled to machine code by the local Java Runtime environment. This is what makes Java relatively slower than other languages like C++ that compile directly into machine code. Languages like C++ are more suitable for processor intensive applications like games.

We now see how computer languages progressed from assembly language unique for each processor to high-level languages that can work on any device without even needing to be compiled.


Value for Pentesters

Understanding assembly language instructions is critical for binary exploitation, which is an essential part of penetration testing. When it comes to exploiting compiled programs, the only way to attack them would be through their binaries. To disassemble, debug, and follow binary instructions in memory and find potential vulnerabilities, we must have a basic understanding of Assembly language and how it flows through the CPU components.

This is why once we start learning binary exploitation techniques, like buffer overflows, ROP chains, heap exploitation, and others, we will be dealing a lot with assembly instructions and following them in memory. Furthermore, to exploit these vulnerabilities, we will have to build custom exploits that use assembly instructions to manipulate the code while in memory and inject assembly shellcode to be executed.

Learning Intel x86 Assembly Language is crucial for writing exploits for binaries on modern machines. In addition to Intel x86, ARM is becoming more common, as most modern smartphones and some modern laptops like the M1 MacBook Pro feature ARM processors. Exploiting binaries in these systems requires ARM Assembly knowledge. This module will not cover ARM Assembly Language. That being said, Assembly Language basics will undoubtedly be helpful to anyone willing to learn ARM Assembly since the two languages have a lot of similarities.

Sign Up / Log In to Unlock the Module

Please Sign Up or Log In to unlock the module and access the rest of the sections.

Relevant Paths

This module progresses you towards the following Paths

Intro to Binary Exploitation

Binary exploitation is a core tenet of penetration testing, but learning it can be daunting. This is mainly due to the complexity of binary files and their underlying machine code and how binary files interact with computer memory and the processor. To learn the basics of binary exploitation, we must first have a firm grasp of Computer Architecture and the Assembly Language. To move into more advanced binary exploitation, we must have a firm grasp on basic buffer overflow attacks, principles such as CPU architecture, and CPU registers for 32-bit Windows and Linux systems. Furthermore, a strong foundation in Python scripting is essential for writing and understanding exploit scripts.

Hard Path Sections 62 Sections
Required: 170
Reward: +50
Path Modules
Easy
Path Sections 14 Sections
Reward: +10
Automating tedious or otherwise impossible tasks is highly valued during both penetration testing engagements and everyday life. Introduction to Python 3 aims to introduce the student to the world of scripting with Python 3 and covers the essential building blocks needed for a beginner to understand programming. Some advanced topics are also covered for the more experienced student. In a guided fashion and starting soft, the final goal of this module is to equip the reader with enough know-how to be able to implement simple yet useful pieces of software.
Medium
Path Sections 24 Sections
Reward: +20
This module builds the core foundation for Binary Exploitation by teaching Computer Architecture and Assembly language basics.
Medium
Path Sections 13 Sections
Reward: +10
Buffer overflows are common vulnerabilities in software applications that can be exploited to achieve remote code execution (RCE) or perform a Denial-of-Service (DoS) attack. These vulnerabilities are caused by insecure coding, resulting in an attacker being able to overrun a program's buffer and overwrite adjacent memory locations, changing the program's execution path and resulting in unintended actions.
Medium
Path Sections 11 Sections
Reward: +10
This module is your first step into Windows Binary Exploitation, and it will teach you how to exploit local and remote buffer overflow vulnerabilities on Windows machines.

SOC Analyst Prerequisites

The SOC Analyst Prerequisites path is designed for those looking to become SOC/Security Analysts. It dives into fundamental IT and Information Security subjects including networking, Linux and Windows operating systems, basic programming and scripting, as well as working with Assembly. In addition, students will be exposed to the fundamental concepts of information security and penetration testing. This skill path is made up of modules that will assist learners in developing and strengthening a foundational understanding before proceeding with learning more complex security topics.

Easy Path Sections 173 Sections
Required: 350
Reward: +110
Path Modules
Fundamental
Path Sections 30 Sections
Reward: +10 UPDATED
This module covers the fundamentals required to work comfortably with the Linux operating system and shell.
Easy
Path Sections 10 Sections
Reward: +10
This module covers the basics needed for working with Bash scripts to automate tasks on Linux systems. A strong grasp of Bash is a fundamental skill for anyone working in a technical information security role. Through the power of automation, we can unlock the Linux operating system's full potential and efficiently perform habitual tasks.
Fundamental
Path Sections 14 Sections
Reward: +10
This module covers the fundamentals required to work comfortably with the Windows operating system.
Easy
Path Sections 23 Sections
Reward: +10
As administrators and Pentesters, we may not always be able to utilize a graphical user interface for the actions we need to perform. Introduction to Windows Command Line aims to introduce students to the wide range of uses for Command Prompt and PowerShell within a Windows environment. We will cover basic usage of both key executables for administration, useful PowerShell cmdlets and modules, and different ways to leverage these tools to our benefit.
Fundamental
Path Sections 21 Sections
Reward: +10 UPDATED
As an information security professional, a firm grasp of networking fundamentals and the required components is necessary. Without a strong foundation in networking, it will be tough to progress in any area of information security. Understanding how a network is structured and how the communication between the individual hosts and servers takes place using the various protocols allows us to understand the entire network structure and its network traffic in detail and how different communication standards are handled. This knowledge is essential to create our tools and to interact with the protocols.
Fundamental
Path Sections 16 Sections
Reward: +10
Active Directory (AD) is present in the majority of corporate environments. Due to its many features and complexity, it presents a vast attack surface. To be successful as penetration testers and information security professionals, we must have a firm understanding of Active Directory fundamentals, AD structures, functionality, common AD flaws, misconfigurations, and defensive measures.
Fundamental
Path Sections 8 Sections
Reward: +10
This module introduces the topic of HTTP web requests and how different web applications utilize them to communicate with their backends.
Fundamental
Path Sections 15 Sections
Reward: +10 UPDATED
This module teaches the penetration testing process broken down into each stage and discussed in detail. We will cover many aspects of the role of a penetration tester during a penetration test, explained and illustrated with detailed examples. The module also covers pre-engagement steps like the criteria for establishing a contract with a client for a penetration testing engagement.
Easy
Path Sections 12 Sections
Reward: +10
Nmap is one of the most used networking mapping and discovery tools because of its accurate results and efficiency. The tool is widely used by both offensive and defensive security practitioners. This module covers fundamentals that will be needed to use the Nmap tool for performing effective network enumeration.
Medium
Path Sections 24 Sections
Reward: +20
This module builds the core foundation for Binary Exploitation by teaching Computer Architecture and Assembly language basics.