Practical DevOps for Big Data/Fraud Detection
This section will expose the purpose of fraud detection and its issues, then why and how Big Data could be a part of the solution, and finally an example of Big Data application helping fraud detection with BigBlu platform.
In the European Union, tax frauds represent over € 1,000,000,000,000 (1 trillion) and “it is more than the total spent on health across the EU” according to Jose-Manuel Barroso, president of the European Commission between 2004 and 2014. Indeed countries have to spend significant sums on fraud detection and countermeasures and that reduces considerably the amounts granted for public services such as hospitals, schools and public transports. Tax frauds represent a huge problem for governments, causing them a big loss of money each year. For more than 145 governments impacted by this phenomenon, most of them suffering from repetitive economic crisis, the issue is the protection of billions of euros of revenue streams. However, when we step back to see the overall picture, we realize that tax fraud is not only about money. It is just the tip of the iceberg which hides many threats. Governments need to have a more efficient control on how money circulates, and, to a greater extent, how it is used.
Governments are increasingly using Big Data in multiple sectors to help their agencies manage their operations, and to improve the services they provide to citizens and businesses. In this case, Big Data has the potential to make tax agencies faster and more efficient. However, detecting fraud is actually a difficult task because of the high number of tax operations performed each year and the differences inherent in the way the taxes are calculated. The main measure of the EU against that is developing cooperation and exchange of information among governments in particular through IT tools. Thus, an application providing an instantaneous and secure access to all the tax information would speed up the fraud detection process. However some issues remain:
- Volume: with more and more taxpayers who generate each year more tax declarations, the data volume is strictly increasing;
- Variety: data come from multiple sources (electricity bill, local taxes information, etc.);
- Adaptability: new fraud types emerge regularly, so the authorities have to adapt their detection method continuously;
- Security: it manipulates sensitive data, so it is not an option;
- Velocity, variability, volatility: the main aim is to do it quickly, and data arrive and change rapidly;
- Veracity, validity and value: because of the sensitiveness of data.
Big Data technology solves all the issues above. Indeed “the appellation ‘Big Data’ is used to name the inability of traditional data systems to efficiently handle new datasets, which are too big (volume), too diverse (variety), arrive too fast (velocity), change too rapidly (variability and volatility), and/or contain too much noise (veracity, validity, and value).” (ISO, Big data – Preliminary Report 2014).
This application would not replace human detection but would bring out potential fraudsters to the authorities.
Some technologies processing Big Data currently exist. This section will present some free open source Big Data technologies.
Apache Spark is an open source cluster-computing framework that performs complex analyses on a large scale, such as Big Data processing. Spark uses a very simple architecture with master and worker nodes, the whole constitutes a cluster that is qualified as vertical if the nodes are in separate machines or as horizontal if they all are in the same machine. Fundamentally, a Spark application configures the job specifying the master information (port, IP address) and the job to execute. Then it submits it to the master node that connects to the cluster of workers to manage the resources and execute the job. Usually the job is implemented in an object-oriented programming language such as Java or Scala. The workers execute this job, so it has to be on each machine where there are workers. This constraint imposes a DevOps development because the developers need to know the infrastructure to set the configuration properly in the Spark application.
Apache Cassandra is a free open source distributed NoSQL database management system. The nodes are organized in a ring and each one has an identical role and communicate with the others thanks to a peer-to-peer protocol. The data is stored as in key/value pair and is distributed into the different nodes thanks to a hash value of the key. Therefore, each node has a partition value to easily find data in the cluster.
When a client does a request, one node will become the coordinator and the client will communicate only with this node. The coordinator then works with the other nodes and sends the result to the client once the work is finished. Thus, while a node is the coordinator of a client, another node could be the coordinator of another client.
BigBlu : a Big Data platform to support fraud detection
BigBlu is a Big Data platform, developed as part of the DICE project, determining how to use Big Data technologies to support fraud detection. The main purpose was to explore tax declarations in order to extract potential fraudsters. Then, BigBlu also provides a scalable module to create and launch customized queries upon the database, to enable the authorities to adapt themselves to new types of fraud without further development.
BigBlu is divided into three main modules: the dashboard, the detection module and the administration module. This section will present each module
After authentication through connection interface, the user accesses the dashboard containing the table of jobs launched with their statutes and information about it such as the detection criteria used or the launch date.
The finished detections have a Show Results button (in the Actions column) that displays a table with all the people who meet the detection criteria and statistics. For example, there is a pie chart about gender distribution. Furthermore, some statistics panels give quick information about the number, the average age and income of the potential fraudsters.
That is important to distinguish detections and jobs. A detection is a user-created query with several criteria, a name and a description. A job corresponds to a detection launched on a database at a given time. Thus, all jobs are launched from the detection table: the user click on the Launch button of the detection he/she wants to launch, this displays a pop-up to select the database on which the detection will be launched and then redirects to the dashboard.
From the detection table, it is possible to edit, delete and create a detection. Only detections created by the logged in user are available. For example, the user lparant will not be able to access the detections of the user yridene, and vice versa. In addition, the detection criteria can be consulted simply by clicking on the detection line and will appear below the table in list form. The last version has four criteria available:
- Income decrease (percentage): select taxpayers whose income decreases more than the percentage chosen by the user compared to previous year;
- Income interval (two values);
- Gender (Male or Female);
- Age interval (two values).
The user can combine these four criteria to create custom queries. For example, he/she can make a query selecting women under 30 with an income between 5k and 10k.
Finally, the website administrator has access to pages to manage databases and users and information about the servers BigBlu uses to handle Big Data queries. Because administration pages do not use Big Data technologies, this section will be concise. In brief, there are two CRUD services: one for the databases and another for the users. Information about servers is static because the servers are determined by the application development and represent the Spark and Cassandra cluster.
Briefly, BigBlu is compound of three main parts:
- The user interface or Web client: a web application developed with HTML, CSS, jQuery and Bootstrap.
- The Big Data application: a Spark application developed in Java, using Spark API and Cassandra database.
- The Web service: the link between the Web client and the Big Data application. It also operates the management of users, databases and launched detections.
Process: evolution of the platform
In order to explore the fraud detection through Big Data application we have used methods pointing out the feasibility and incremental featuring. Thus, we started with a POC (Proof of Concept), then a MVP (Minimum Viable Product) and finally the Version 1.0. We highly recommend this process because it allows the developers to discover safely and the technologies through the POC and then elaborate the product maximizing the value of what they have learnt during the last step to finally incrementally test, correct and complete the final product.
POC : proof of concept
First, a POC consists in implementing a demonstrator to prove that an idea is feasible. Therefore, we started by exploring the Big Data technologies such as Apache Spark and Apache Cassandra. Furthermore, we proceeded a technology watch to better understand Big Data and its issues. We recommend you the book Principles and best practices of scalable real-time data systems written by Nathan Marz and James Warren, which explains the issues of processing Big Data and how to solve them.
MVP : minimum viable product
Then, we started to create an MVP with two types of indicators. The first one was about income decrease: when a taxpayer has a huge income decrease he/she is probably frauding by not declaring some incomes. The second one was when a taxpayer has low taxes with a high salary : there is potentially a fraud. During this step, we implement all the architecture with a basic interface. This tool could already perform Big Data queries and save their results in Cassandra.
The last version developed was an incremental enhancement of the MVP. First of all, the HMI was re-designed and we added a module to complete the evolutivity requirement: the new module detection now allows the creation of custom queries with a combination of four criteria. A CRUD service is implemented to enable the user to create, edit and delete his/her queries.
|DICE Tools||Benefit for the use case|
|DICE IDE||Useful hub for the other tools. Development activities and tools validation were made inside this IDE.|
|DICER||Rapid design of an execution environment by using concrete concepts. The various concepts specific to Cassandra and Spark are easier to understand with this graphical modelling language.|
|Deployment Service||Increased productivity. This tool bears for us the burden to learn how to install and configure Cassandra and Spark on the cloud infrastructure.|
|Configuration Optimisation||By looking at the logs of Spark, this tool was able to find a better configuration.|
|Optimisation Tool||This tool is useful to evaluate the cost impact of the implementation of privacy mechanisms.|
|Simulation Tool||Given a set of requirements, a tax agency can use this tool to evaluate alternative architectures and measure the impact of business logic changes. The simulated models can also be used for documentation purpose.|
- NIST Big Data Public Working Group (2015-09). "NIST Big Data Interoperability Framework: Volume 1, Definitions. Final Version 1". NIST. doi:10.6028/NIST.SP.1500-1. https://bigdatawg.nist.gov/_uploadfiles/NIST.SP.1500-1.pdf.