“transaction_dt”: “11-02-2018 00:00:00”
Architecture and ApproachHaving understood the various kinds of data involved in this project, it’s time to understand how to approach the task of building a solution to this problem statement. The following diagram will help you understand how the entire architecture should look like. Read ahead to understand more.
The data from the several POS systems will flow inside the architecture through a queuing system like Kafka. The POS data from Kafka will be consumed by the streaming data processing framework to identify the authenticity of the transactions. Note: One of the SLAs of the company is to complete the transaction within 1 second. Hence, the framework should be accordingly chosen to facilitate this SLA. Once the POS data enters into the Stream processing layer, it is assessed based on some parameters defined by the rules. Only, when the results are positive for these rules, the transaction is allowed to complete. Now, what are these rules? How can one obtain these parameters and where are they stored? These questions will get answered in some time. Once the status of a transaction is determined as a “Genuine” or “Fraudulent”, the details of the transaction, along with the status, are stored in the card_transactions table. Now, let’s understand the various parameters defined by the rules required to determine the authenticity of the transactions. Here are the three parameters that we will use to detect whether a transaction is fraudulent or not. 1. Upper Control Limit: Every card user has an upper limit on the amount per transaction that is different from the maximum transaction limit on each card. This parameter is basically an indicator of transaction pattern associated with a particular customer. This upper bound, also known as the “Upper Control Limit” or UCL, can be used as a parameter to authenticate a transaction. Suppose you have a past record of making transactions with an average amount of $20,000, and one day the system observes a transaction of $200,000 through your card. This can be a possible case of a fraud. In such cases, the cardholder receives a call from the credit card company executives to validate the transaction. UCL is derived using the following formula: UCL=(Moving Average)+3×(Standard Deviation) The above formula is used to derive the UCL value for each card_id. The Moving average and the Standard Deviation for each card_id is calculated based on the last 10 amount credited with a ‘Genuine’. 2. Credit score of each member: This is a straightforward rule, where we have a member_scoretable in which member ids and their respective scores are available. These scores are updated from a third-party service. If the score is less than 200, that member’s transaction is rejected as he/she could be a defaulter. This rule simply defines the financial reputation of each customer. 3. Zip code distance: The whole purpose of this rule is to keep a check on the distance between the card owner's current and last transaction location with respect to time. If the distance between the current transaction and the last transaction location with respect to time is greater than a particular threshold, then this raises suspicion on the authenticity of the transaction. Suppose at time t=t0 minutes, a transaction is recorded in Mumbai and at timet=(t0+10) minutes, a transaction from the same card_id is recorded in New York. A flight flies with a cruising speed of about 900km/hr which means that someone travelling by Airbus can travel a KM in 4 Secs, KM=4 Sec. This can be a possible case of fraud. Such cases happen very often when someone acquires your credit card details and make transactions online using those details. In such cases, the cardholder receives a call from the credit card company executive to validate the transaction.
Use the postcode library (will be provided ahead for checking distance between two zip codes) to get the distance between two zip codes. Now that you know each of the parameters, let’s understand the approach to calculate these. Let’s start with the upper control limit (UCL). The historical transactional data is stored in the card_transactions table, as defined earlier in the table description. UCL value has to be calculated for each card_id for the last 10 transactions. One approach could be to trigger the computation of this parameter for a card_id every time a transaction occurs. However, considering the 1 second SLA, this may not be a very good practice as batch jobs are always associated with huge time delays.
Another approach could be to have a lookup table which stores the UCL value based on the moving average and standard deviation of the last 10 transactions of each card_id. Whenever a transaction occurs, the record corresponding to the card_id can be easily fetched from this lookup table rather than calculating the UCL value at the time of the transaction. This lookup table needs to be updated at regular intervals by running queries on data stored in the AWS RDS and the NoSQL database. Note: You need to use a NoSQL distributed database to implement the look-up table. The database must be scalable and consistent. Use a NoSQL database which gives schema evolution, schema versioning, row-level lookups (efficient reads), and tunable consistency. For every ‘card_id’, this database must store the UCL value. Use appropriate ingestion methods available to bring card_member and member_score data from AWS RDS and card_transactions data from the NoSQL database into the Hadoop platform. This data is then processed by running batch jobs to fill data in the look-up table. After the initial load, there will be incremental loads that should be considered before designing the architecture of the solution.
Once the transactions data is imported into Hadoop, batch jobs are run on the data stored to calculate the moving average and standard deviation of the last 10 transactions for each card_id. Once the moving average and standard deviation are obtained, the UCL value for each card_id is calculated. Then the relevant data is entered in the look-up table.
Note: While wrangling data, keep the stages in place for better backtracking of your transformations (keeping track of steps) as a best practice. Let’s suppose that you are calculating UCL in Hive. You might first need a raw table, then in the next staging table, you might want to derive the moving average and standard deviation and then the final table with UCL value. This helps manage the layers of your wrangled data. This UCL value is then updated in the look-up table. Since the transactions_table will be continuously incremented with the new transactions, it’s important that UCL value gets continuously updated at regular intervals. For this problem statement, the UCL value will be updated after every 4 hours in the look-up table. This implies the following:
The second parameter, i.e., the credit card score also gets updated daily by a service in the RDBMS for each card_id. You need to update the credit card score in the same look-up table where the UCL value corresponding to each card_id is getting stored. This job needs to be triggered after every 4 hours (the interval after which the credit card score might also get updated in the RDBMS by the service). The third parameter is based on the zip code analysis. Store the ‘postcode’ and ‘transaction_dt’ parameters pertaining to the last transaction of each card_id in the look-up table. Whenever a new transaction occurs, retrieve the ‘postcode’ and ‘transaction_dt’ attributes from the look-up table and compare these with the current ‘postcode’ and ‘transaction_dt’ data. Use the API to calculate the speed at which the user moved from the origin. If it is ahead of the imaginable speed, this can be a possible case of fraud. In such cases, the cardholder receives a call from the credit card company executive to validate the transaction. You also need to update the last transaction data in the lookup table with the current data after the comparison is complete. This has to be done in two steps:
- The added data needs to be processed every 4 hours from the transactions_table in NoSQL database to Hadoop
- A batch job needs to be run to calculate the UCL value every time the data has to be processed
- The new UCL value needs to be updated into the lookup table every time the new UCL value is calculated for each card_id
Copyright © 2019 | Truelancer.com