Masterarbeit, 2021
54 Seiten, Note: 2,0
Glossary
1 Introduction
1.1 Problem Statement
1.2 Thesis Structure
2 Background
2.1 Code Analysis
2.1.1 Dynamic Code Analysis
2.1.2 Static Code Analysis
2.2 Code Property Graphs
2.2.1 Components and Structure
2.2.2 Generation Process
2.3 Incremental Parsing
3 Related Work
3.1 Graph-Based Code Analysis
3.1.1 Joern
3.1.2 ProgQuery
3.1.3 Codyze: Code Property Graphs for Security Assessments
3.2 Incremental Parsing
4 Improving the Construction Performance for Incremental Changes
4.1 Approach Overview
4.2 Concurrent Source Code Parsing
4.3 Change Classification
4.4 Performing In-Place Updates
4.4.1 FieldDeclaration Changes
4.4.2 FunctionDeclaration Changes
4.5 Graph Equality Checking Algorithm
5 Evaluation
5.1 Concurrent Source Code Parsing
5.1.1 Experiment Setup
5.1.2 Results
5.2 Incremental Graph Updates
5.2.1 Experiment Setup
5.2.2 Results
6 Conclusion
7 Future Work
7.1 Using a Listener System for Reference Resolution
7.2 Incremental AST Parsing
7.3 Parallelizing the Whole CPG Generation Process
List of Figures
List of Algorithms
Bibliography
Graph-based code analysis systems are a versatile tools for reasoning about the correctness of complex software projects. One area in which they are widely used is in source code auditing: Security vulnerabilities, for example using cryptographic functions with insecure algorithms, can be introduced by coding patterns that spread over the boundaries of several methods, classes or even files in the project. This is where graph-based analysis makes finding these vulnerabilities easier, by creating a framework where the source code can be represented as a graph and vulnerable patterns can be found by performing efficient graph-based pattern matching queries.
There are several graph representations for source code, but this thesis is based on the Code Property Graph (CPG) approach, as it conveniently combines a program's Abstract Syntax Tree (AST), Control Flow Graph (CFG) and Data Flow Graph (DFG) in a single data structure. One drawback of this representation is that its generation is a complex task that can be very time consuming for larger projects. Considering that security analyses are desirably carried out repeatedly, to find out if recent changes introduced new vulnerabilities, long graph generation times may slow down the feedback cycle.
This is why this thesis presents an improved CPG generation process that can take advantage of the fact that most of the changes to a project will not affect the whole program, but will rather have a limited scope: Portions of the CPG generated for the previous code version can be reused, if the corresponding code did not change. The actual file changes are then modeled as incremental updates to the graph, which avoids having to re-generate the CPG for all files in case only a few have been modified.
This technique has been evaluated by analyzing the source code of the CPG implementation that serves as the basis for this thesis, an actively maintained project containing over 200 Java files: While the previous implementation took 60 minutes to sequentially parse 50 commits in this repository, the new incremental approach showed a performance improvement of nearly 80%, taking only 13 minutes for the same commit range.
Graphenbasierte Codeanalysesysteme sind ein vielseitiges Werkzeug, um Rückschlüsse auf die Korrektheit komplexer Softwareprojekte zu ziehen. Ein Bereich, in dem sie häufig eingesetzt werden, ist das Auditieren von Quellcode: Sicherheitsschwachstellen, z. B. die Verwendung kryptographischer Funktionen mit unsicheren Algorithmen, können durch Programmiermuster entstehen, die sich über die Grenzen mehrerer Methoden, Klassen oder sogar Dateien im Projekt erstrecken. Hier erleichtert die graphenbasierte Analyse das Auffinden dieser Schwachstellen, indem sie einen Rahmen bietet, in dem der Quellcode als Graph dargestellt werden kann und unsichere Muster durch effizienten graphenbasierten Musterabgleich gefunden werden können.
Es gibt verschiedene Graphdarstellungen für Quellcode, aber diese Arbeit basiert auf dem Code Property Graph (CPG) Ansatz, da er den abstrakten Syntaxbaum (AST), den Kontrollflussgraphen (CFG) und den Datenflussgraphen (DFG) eines Programms bequem in einer einzigen Datenstruktur kombiniert. Ein Nachteil dieser Darstellung ist, dass ihre Erstellung eine komplexe Aufgabe ist, die bei größeren Projekten sehr zeitaufwendig sein kann. In Anbetracht der Tatsache, dass Sicherheitsanalysen wünschenswerterweise wiederholt durchgeführt werden, um herauszufinden, ob die letzten Änderungen neue Schwachstellen eingeführt haben, kann eine lange Generierungszeit des Graphen den Feedbackzyklus verlangsamen.
Aus diesem Grund wird in dieser Arbeit ein verbesserter CPG-Generierungsprozess vorgestellt, der sich die Tatsache zunutze macht, dass die meisten Änderungen an einem Projekt nicht das gesamte Programm betreffen, sondern eher einen begrenzten Umfang haben werden: Teile des für die vorherige Codeversion generierten CPG können wiederverwendet werden, wenn sich ihr entsprechender Code nicht geändert hat. Die tatsächlichen Datei-Änderungen werden dann als inkrementelle Graphaktualisierungen modelliert, wodurch vermieden wird, dass der CPG für alle Dateien neu generiert werden muss, obwohl nur einige wenige geändert worden sind.
Diese Technik wurde durch die Analyse des Quellcodes der CPG-Implementierung bewertet, die als Grundlage für diese Arbeit dient, ein aktiv gepflegtes Projekt mit über 200 Java-Dateien: Während die vorherige Implementierung 60 Minuten benötigte, um 50 Commits in diesem Repository sequentiell zu analysieren, zeigte der neue inkrementelle Ansatz eine Leistungsverbesserung von fast 80% und benötigte nur 13 Minuten für denselben Commit-Bereich.
Abbildung in dieser Leseprobe nicht enthalten
Automated code analysis tools accompany programmers throughout all of their software development lifecycle: Parsing code on the fly and providing completion hints or other insights has been a core feature for Integrated Development Environments (IDEs) for a long time already. These days more sophisticated analyses give instant feedback about more nuanced types of issues, like erroneous data flows or potential crashes.
Then, after the code has been written and tested locally, most companies and open source projects have Continuous Integration (CI) systems in place, which perform further quality assurance checks. These may only comprise a dynamic test suite, ensuring that all features behave as expected, or contain additional layers that perform static code analysis. One widely used product in the field of static code analysis in a CI context is SonarQube1, a tool that detects several kinds of bugs and vulnerabilities in different programming languages.
Due to its high importance for the whole software industry, improving existing analysis techniques and developing new ones has been a prominent field of research in the past years. One approach for modeling the syntactic and semantic structure of a program needed for further analysis has been to create a graph-based representation, such as the Code Property Graph (CPG) proposed by Yamaguchi et al. in [Yam+14]. This graph enriches the syntactic structure of a program with additional semantic properties that can then be taken into account in subsequent analyses.
Graph-based representations have the benefit that not only the source code is modeled this way but the same also applies to error types and vulnerabilities that are to be detected. In consequence, identifying problematic code happens by finding patterns in the graph that correspond to one of the predefined vulnerability graphs. This can be done by saving it in a graph database, like Neo4j2, and then formulating queries that correspond to these vulnerability patterns. It is also possible to find vulnerabilities that originate from data flows across several methods, which may even be located in different files.
This thesis extends a modified CPG approach that is able to operate on multiple programming languages, i.e. C/C++, Java, Python and Golang, available on GitHub3 [Fra21a]. More in-depth information about the structure of this concrete CPG implementation, as well as an example of how it can be integrated into a system for automated detection of selected vulnerabilities, can be found in [Hop19]. Furthermore, Banse et al. describe how it can be used in the context of cloud security assessments [Ban+21].
In a CI environment, in-depth vulnerability scans and other checks are usually only invoked once per commit. This makes it acceptable for these analyses to take several minutes to complete, as the developer can continue working and check the results asynchronously. But the earlier security analyses are integrated into the development process, the easier it is to fix potential problems. One example for this is the usage of cryptography: If the developer gets instant feedback that they are about to use an API in an insecure way, they can react immediately and fix it before this change would require several other modifications all over the program code.
Thus, finding ways to reduce the runtime of analyses from several minutes down to one minute or even only a few seconds is a crucial requirement when trying to make the resolution of problems as seamless as possible. When trying to achieve this with graph-based analyses, we have two possibilities: On the one hand, we can improve the performance of the query process, where the graph is scanned for patterns that resemble vulnerabilities. But on the other hand, we can also look at the process of how this graph is generated from the source code and reduce its runtime, which is what will be the goal of this thesis:
1. After a program has been analyzed for the first time, any subsequent changed version should not require a full re-generation of the CPG but rather be modeled as an incremental update to the previously generated graph.
2. Unchanged files must not be parsed a second time, and, if possible without much overhead, small source code changes should be integrated as in-place updates that reuse as much as possible of the previous sub-graph.
3. All enhancements that will be proposed need to ensure that they produce the exact same CPG as a full analysis would, in order to provide the same modeling accuracy as before. This is especially important in a security context, as the ability to find vulnerabilities depends on the accuracy of the underlying graph. Performance improvements for incremental changes thus must not produce a different graph as conventional methods.
Chapter 2 provides background information that is relevant for the rest of the thesis, in particular an introduction to CPGs: What components do they contain, how are they generated from a set of source files and how exactly are they integrated into a security analysis workflow? Following that, Chapter 3 highlights some related work in the fields of graph-based and incremental code analysis. Chapter 4 explains the approach of how the current CPG generation process can be extended to incorporate capabilities for incremental graph construction, followed by an evaluation of the practical implementation in Chapter 5. Here, we will see how the proposed changes to the CPG generation process affect its performance in real-world scenarios. Finally, Chapter 6 concludes the thesis and gives a verdict whether the presented approach is in fact suitable for practical use. Chapter 7 provides an outlook to future research possibilities that can further improve the CPG and its new incremental parsing features, especially focusing on ideas that promise substantial speed-ups for the construction process but would entail rather significant changes to the overall architecture of [Fra21a].
Ensuring that a program performs its intended tasks correctly and in a secure way is very important, especially for software that processes sensitive personal data. This is why developers need to perform regular analyses that detect potential bugs and vulnerabilities in their projects. Numerous techniques have been developed that aim to support time-consuming manual code audits, or even fully automate the process of finding code problems as much as possible. These can be classified into two high level categories: Static and dynamic code analysis.
One prominent example of dynamic code analysis is using a suite of tests that are executed to verify key features of a program. Most software projects have some kind of test suite, often containing tests on different abstraction levels, starting at unit tests for individual methods, integration tests that ensure correct interaction between separate program modules and going up to full end-to-end tests that simulate executions of the final product. There are also other ways to dynamically test code, e.g. using fuzzing which automatically creates random inputs and monitors if they crash the program. This is a technique employed at a big scale by Google and other companies to test their own software and selected open source projects [Ary+19].
What all the different kinds of dynamic code analysis have in common, though, is that they rely on executing the program. Static analysis, on the other hand, operates directly on the source code, which typically leads to much faster detection times than dynamic testing. This enables static analysis tools to alert the developer about potential issues even before compiling the code, which makes them an integral part of modern IDEs.
According to the OWASP Foundation, there are several types of techniques that are usually implemented in static analysis tools:
- Lexical Analysis: The raw source code is converted into a tokenized form, where individual words are grouped together in order to form a single token. Analysis is then carried out on the token level.
- Control Flow Analysis: The statements are analyzed in their execution order.
- Data Flow Analysis: Here, we look at the way data is exchanged inside the program, as well as interaction with external data providers.
- Taint Analysis: At which places does the program work with data that has been provided by the user, leading to potential security issues?
[OWA20]
One way of approaching static code analysis is to model the source code as a graph, with a prominent example being the Code Property Graph proposed by Yamaguchi et al. in 2014 [Yam+14]. This concept combines a program's Abstract Syntax Tree with edges that model data and control flow, giving the analyst rich information about the semantic behavior of the source code they are currently auditing.
The following sections will provide an overview of a CPG's structure and how [Fra21a] currently produces the graph from a piece of source code and. Section 3.1.3 will show how a CPG can be used for software security analysis.
A CPG is a labeled directed graph Gcpg = (Vcpg, E cpg , A cpg , ^ cpg). It is a multigraph, meaning that any two nodes p, q G Vcpg can be connected by multiple edges (p, q) C E cpg , each marked with a label provided by the labeling function A cpg . Additionally, its name already implies that a CPG is a property graph. As such, function h cpg assigns an arbitrary amount of properties to any node or edge, with a property consisting of a key and a value, e.g. name BinaryOperator .
Abstract Syntax Tree
The Abstract Syntax Tree forms the basis of a CPG: The source code of a file is broken down into a tree that represents the program's syntactical structure, grouping the simple text form into tokens defined by the language, e.g. statements or declarations. Figure 2.1 shows an example, visualizing the Abstract Syntax Tree (AST) of a small Java program (leaf nodes are depicted in green, nodes with children in blue):
A TranslationUnit represents a whole file, forming the root of an AST. It then has one or more declarations as child nodes. In the case of Java, those are always classes (represented by RecordDeclarations), C and C++ may also have FunctionDeclarations on this level. RecordDeclarations contain further children, in this example one FieldDeclaration and one
Abbildung in dieser Leseprobe nicht enthalten
Figure 2.1: Abstract Syntax Tree of a “Hello World!” program
FunctionDeclaration1. sayHello also has a child node that contains a list of statements and this breakdown goes on until we reach the leaf nodes [SBT19, pp. 5-8].
Formally2, an AST is a labeled directed property graph G ast = ( V ast, E ast, A ast, H ast), with its relationship to the overall CPG being Vast = Vcpg , E ast C E cpg, A ast C A cpg and p ast = H cpg.
Abbildung in dieser Leseprobe nicht enthalten
Figure 2.2: Control Flow Graph of a “Hello World!” program
Control Flow Graph
While an AST allows us to explore the structure of source code, we still do not see what possible execution paths will be during runtime. When we want to find out the order in which the different statements of a function are being executed, for example, the CPG needs to contain further edges that model this piece of information. This sub-graph is called the Control Flow Graph (CFG), formally G cfg = ( V ast , E cfg , ^ cfg , H ast), with E cfg C E cpg and X cfg C X cpg. Consider Figure 2.2 as an example: Here, we added red control flow edges between the statements in the AST of Figure 2.1.
Abbildung in dieser Leseprobe nicht enthalten
Figure 2.3: Evaluation Order Graph of a “Hello World!” program
A CFG is the simplest representation for control flow, as it only shows the ordering of complete statements. It is also the representation that has been described in the original definition of a CPG in [Yam+14], but in practice, more fine-grained representations are often needed. One example for such a more detailed representation of runtime behavior is the Evaluation Order Graph (EOG) used by [Fra21a] (see the red EOG edges in Figure 2.3): This graph not only shows that greeting is first declared before being used in a MemberCallExpression, but also that more complex nodes with multiple children are evaluated in a specific order that might be relevant for more nuanced vulnerability checks.
For incremental construction of CPGs however, it makes no difference for us whether we are dealing with an EOG or a more general control flow representation, which is why the term CFG will be used as a collective term for control flow graphs of arbitrary granularity for the rest of this thesis.
Data Flow Graph
Abbildung in dieser Leseprobe nicht enthalten
Figure 2.4: Data Flow Graph of a “Hello World!” program
Arguably the most important aspect when analyzing a program is its data flow: Techniques like taint analysis rely on the fact that we are able to track how user data moves throughout the program at runtime, which is not yet possible in the combination of AST and CFG. This is where the Data Flow Graph (DFG) comes into play, visualized in green in Figure 2.4: Each DFG edge describes that there is data flowing between its source and sink node, for example here you see that “Hello world” flows from the initial String literal into the greeting variable and later on into the argument for System.out.printin(greeting) , leading to this phrase being printed to the console.
As mentioned before, DFG edges are needed for taint analysis, and they are also useful for detecting hard-coded credentials, insecure cipher configurations being passed to encryption functions and many more situations, all by simply tracking DFG paths of arbitrary length throughout the whole program.
In mathematical notation, the DFG is defined as Gdfg = ( V ast,E dfg,^ dfg,H ast), with E dfg C E cpg and A-dfg C X cpg. Combining all sub-graphs yields the final definition of a CPG: G cpg = ( V AST ( E AST u e cfg u e DFG)z ( ^ AST U A-cfg u A- DFG)z H AST).
Figure 2.5 shows the current process how a CPG is built for a set of source files: First, the AST is built for each file, followed by several passes that add the necessary information needed to transform an AST into a CPG.
Abbildung in dieser Leseprobe nicht enthalten
Figure 2.5: Process of generating a CPG for multiple source files from scratch
AST Parsing
In order to get the AST representation of a piece of source code, a so-called language fronted is used: Depending on the programming language, the target file is passed to an adequate third-party library that is able to parse it and produce the correct AST (Java code is handled by the JavaParser [SBT19], C and C++ by the Eclipse CDT parser [SK11] and there are also experimental frontends for Python and Golang).
A frontend's purpose is to translate the custom AST formats provided by the different parser into a uniform and language-agnostic internal representation. This mapping step makes it possible to efficiently share many of the following graph enrichment steps, e.g. variable resolution, between different languages with only slight modifications.
Pass System
As we have seen in previous sections, a CPG needs further components than just the AST that we just retrieved from external parsers. In order to add CFG and DFG edges, [Fra21a] uses a number of passes that are executed one after another, each iterating over the current graph while performing necessary enhancements.
[...]
1 https://www.sonarqube.org/
2 https://neo4j.com/
3 https://github.com/Fraunhofer-AISEC/cpg
1 In this case we are actually dealing with a special case of a FunctionDeclaration, a so-called MethodDeclaration, as it is contained within a class. In the context of this thesis we will still mostly refer to it as a FunctionDeclaration for the sake of simplicity.
2 Note that while a CPG formally only has AST nodes, actual implementations might add additional implicit or inferred nodes. E.g. [Fra21a] also models type information as nodes or includes implicit return statements in the graph, even if the AST does not contain them.
Der GRIN Verlag hat sich seit 1998 auf die Veröffentlichung akademischer eBooks und Bücher spezialisiert. Der GRIN Verlag steht damit als erstes Unternehmen für User Generated Quality Content. Die Verlagsseiten GRIN.com, Hausarbeiten.de und Diplomarbeiten24 bieten für Hochschullehrer, Absolventen und Studenten die ideale Plattform, wissenschaftliche Texte wie Hausarbeiten, Referate, Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Dissertationen und wissenschaftliche Aufsätze einem breiten Publikum zu präsentieren.
Kostenfreie Veröffentlichung: Hausarbeit, Bachelorarbeit, Diplomarbeit, Dissertation, Masterarbeit, Interpretation oder Referat jetzt veröffentlichen!
Kommentare