Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests

Masterarbeit, 2014
98 Seiten, Note: 1,0

VWL - Statistik und Methoden

Leseprobe

1. INTRODUCTION AND PROBLEM DESCRIPTION

1.1 INTENTION OF THIS THESIS

1.2 PROCEEDING

2. INTRODUCTION TO KEY FIGURE ANALYSIS

2.1 THE PRINCIPLE OF KEY FIGURES

2.2 THE CLASSICAL KEY FIGURE ANALYSIS APPROACH

2.3 MODERN KEY FIGURE ANALYSIS APPROACHES

2.4 LIMITATIONS OF ANNUAL REPORT ANALYSIS

3. THE AVAILABLE DATASET

3.1 DESCRIPTION OF THE DATASET

3.2 DATA CLEAN-UP

4. KEY FIGURE SELECTION

4.1 SIGNIFICANT KEY FIGURE REQUIREMENTS

4.2 THE SELECTED KEY FIGURES OF THIS ANALYSIS

4.2.1 Selected class variable

4.2.2 Selected qualitative key figures

4.2.3 Selected absolute key figures

4.2.4 Selected relative key figures

4.3 CLASS ANALYSIS

5. CLASSIFICATION TREES AND FORESTS

5.1 PRECONSIDERATIONS

5.2 CLASSIFICATION TREES

5.2.1 A simple example

5.2.2 Generation of classification trees

5.2.3 Pruning an existing tree

5.2.4 Relevant properties of CART trees

5.3 RANDOM FOREST

5.3.1 Classification process of a random forest

5.3.2 Generation of random forest

5.3.3 Relevant properties of random forests

6. CLASSIFICATION RESULTS

6.1 CLASSIFICATION TREE RESULTS

6.1.1 Examination of the most precise tree

6.1.2 Key indicator importance ranking

6.1.3 Transfer to data from 2011

6.2 CLASSIFICATION FOREST RESULTS

6.2.1 Transfer to data from 2011

6.2.2 Key indicator importance ranking

7. CONCLUSION

7.1 CRITICAL ASSESSMENT

7.2 OUTLOOK

Research Objectives & Core Topics

The primary objective of this thesis is to evaluate whether stakeholders can utilize classification trees and random forests to predict exceptionally growing German firms at the beginning of a calendar year, based on annual report key figures from previous years. The research addresses the challenge of analyzing large, complex datasets by implementing a data mining approach based on the CRISP-DM reference model.

Predicting corporate lucrativeness using classification trees and forests.
Comparative analysis of different data mining models on a large-scale financial dataset.
Evaluation of classification performance based on key figures from annual reports.
Investigation of methodologies to handle imbalanced datasets and missing data.
Assessment of the importance of specific financial indicators for predicting growth.

Excerpt from the Book

2.4 Limitations of annual report analysis

At the end of this chapter it is important to point out important general aspects of analysing annual statement data because these facts directly influence the quality of the created model.

First of all, annual reports are not originally designed to be used as a foundation for predicting growth but rather concern the past by telling how wealthy the company is and why its assets has changes. This means that the annual report is diverted from its intended use (Franken 2007, 3).

Another problem, especially in context of small and middle-size companies, is that their success strongly depends on the manager of this company. Unfortunately, most used datasets do not contain any information like age, gender and education of this person (Anders und Szczesny 1999, 1-2).

Furthermore, there is often no information about the enterprise’s strategic goals, its capability to be innovative, the professionalism of the manager and his staff, and the customer focus. All these aspects influence whether a company is going to be successful but cannot be used because they are either not available at all or very hard to operationalize and, therefore, require controversial generalisations (Moro und Schäfer 2004, Fritz 1993, 1, Feldo 2011, 8).

Summary of Chapters

1. INTRODUCTION AND PROBLEM DESCRIPTION: Introduces the research context, the importance of predictive models for finance, and the application of the CRISP-DM methodology for data mining.

2. INTRODUCTION TO KEY FIGURE ANALYSIS: Examines the principles, advantages, and shortcomings of traditional key figure analysis, contrasting them with modern data mining approaches.

3. THE AVAILABLE DATASET: Details the structure and content of the "Amadeus" database used for the analysis, including necessary steps for data cleaning and preparation.

4. KEY FIGURE SELECTION: Discusses the criteria for selecting meaningful financial indicators and defines the class variables used for identifying lucrative firms.

5. CLASSIFICATION TREES AND FORESTS: Provides a detailed explanation of the CART algorithm, the generation of classification trees, random forests, and methods for pruning and variable importance estimation.

6. CLASSIFICATION RESULTS: Presents the findings of the classification tasks, comparing the performance of different models on 2010 data and their predictive capability when transferred to 2011.

7. CONCLUSION: Summarizes the key insights, evaluates the effectiveness of the chosen DM approach, and provides an outlook on future potential improvements.

Keywords

Data Mining, Classification Trees, Random Forest, Financial Statements, Annual Reports, Predictive Modeling, Lucrativeness, Key Figures, CRISP-DM, Corporate Growth, German Firms, Big Data, R, CART, Model Performance

Frequently Asked Questions

What is the core focus of this research?

The research focuses on utilizing data mining techniques—specifically classification trees and random forests—to predict which companies will achieve exceptional growth based on historical financial statement data.

What are the central thematic areas?

The central themes include financial key figure analysis, classification algorithms, data processing of large financial datasets, and the practical application of the CRISP-DM methodology.

What is the primary research question?

The main question is whether a stakeholder can effectively use classification trees or random forests to predict the lucrativeness of German firms at the start of a year, using data from previous years.

Which scientific methodology is employed?

The study employs a supervised learning data mining approach, specifically using the CART algorithm and Random Forests, structured within the CRISP-DM lifecycle.

What topics are discussed in the main section?

The main section covers the selection and justification of financial key figures, the technical generation and pruning of classification trees, the creation of random forests, and a rigorous performance evaluation of these models on real-world datasets.

What characterize the key terms of this study?

Key terms center around predictive analytics in a corporate finance context, focusing on quantitative metrics, model accuracy, and the ability to interpret model results for decision-makers.

Why are Random Forests used in addition to simple trees?

Random Forests are used because they often provide better classification performance and variance reduction compared to individual "weak learner" trees, though they offer less transparency.

How does the research address the issue of missing data?

The research leverages the built-in capabilities of classification trees for handling missing data and discusses specific imputation methods for random forests to ensure the large "Amadeus" dataset remains usable.

Ende der Leseprobe aus 98 Seiten - nach oben

Details

Titel: Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests
Hochschule: Universität Duisburg-Essen (Wirtschaftswissenschaften)
Veranstaltung: Masterarbeit
Note: 1,0
Autor: B. Sc. Jurij Weinblat (Autor:in)
Erscheinungsjahr: 2014
Seiten: 98
Katalognummer: V273792
ISBN (eBook): 9783656656258
ISBN (Buch): 9783656658870
Dateigröße: 1294 KB
Sprache: Englisch
Schlagworte: Data Mining Big Data Classification Tree Random Forest balance sheet Classification
Produktsicherheit: GRIN Publishing GmbH
Preis (Ebook): US$ 42,99
Preis (Book): US$ 55,99

Arbeit zitieren: B. Sc. Jurij Weinblat (Autor:in), 2014, Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests, München, Page::Imprint:: GRINVerlagOHG, https://www.diplomarbeiten24.de/document/273792