PART 01 DATA METHODS H A AND APPLICATIONS PART 01 PART 02 DATA PROJECT FRAMEWORKS N D DATA METHODS AND APPLICATIONS PART 02 B O DATA PROJECT FRAMEWORKS DATA ANALYTICS AND DIGITAL O FINANCIAL SERVICES K ACKNOWLEDGEMENTS IFC and The MasterCard Foundation’s Partnership for Financial Inclusion would like to acknowledge the generous support of the institutions who participated in the case studies for this handbook: Airtel Uganda, Commercial Bank of Africa, FINCA Democratic Republic of Congo, First Access, Juntos, Lenddo, MicroCred, M-Kopa, Safaricom, Tiaxa, Tigo Ghana, and Zoona. Without the participation of these institutions, this handbook would not have been possible. IFC and The MasterCard Foundation would like to extend special thanks to the authors Dean Caire, Leonardo Camiciotti, Soren Heitmann, Susie Lonie, Christian Racca, Minakshi Ramji, and Qiuyan Xu, as well as to the reviewers and contributors: Sinja Buri, Tiphaine Crenn, Ruth Dueck-Mbeba, Nicolais Guevara, Joseck Mudiri, Riadh Naouar, Laura Pippinato, Max Roussinov, Anca Bogdana Rusu, Matthew Saal, and Aksinya Sorokina. Lastly, the authors would like to extend a special thank you to Anna Koblanck and Lesley Denyes for their extensive editing support. ISBN Number: 978-0-620-76146-8 First Edition 2017 H A N D B O DATA ANALYTICS AND DIGITAL O FINANCIAL SERVICES K Foreword This is the third handbook on digital and also illustrates a range of practical Part 1: Data Methods and Applications financial services (DFS) produced and applications and cases of DFS providers Chapter 1.1: Discusses data science in the published by the Partnership for Financial that are translating their own or external context of DFS and provides an overview of Inclusion, a joint initiative of IFC and data in to business insights. It also offers a the data types, sources and methodologies The MasterCard Foundation to expand framework to guide data projects for DFS and tools used to derive insights from data. microfinance and advance DFS in Sub- providers that wish to leverage data insights Saharan Africa. The first handbook in the Chapter 1.2: Describes how to apply data to better meet customer needs and to series, the Alternative Delivery Channels analytics to DFS. The chapter summarizes improve operations, services and products. and Technology Handbook, provides a techniques used to derive market insights The handbook is meant as a primer on data comprehensive guide to the components of from data, and describes the role data and data analytics, and does not assume can play in improving the operational digital financial technology with particular any previous knowledge of either. However, management of DFS. The chapter includes focus on the hardware and software it is expected that the reader understands seminal, real-life examples and case building blocks for successful deployment. The second handbook, Digital Financial DFS, and is familiar with the products, the studies of lessons learned by practitioners function of agents, aspects of operational in the field. It ends with an outline of how Services and Risk Management, is a guide to management, and the role of technology. practitioners can use data to develop the risks associated with mobile money algorithm-based credit scoring models for and agent banking, and offers a framework The handbook is organized as follows: financial inclusion. for managing these risks. This handbook is Introduction: Introduces the handbook intended to provide useful guidance and and establishes the broad platform and Part 2: Data Project Framework support on how to apply data analytics to expand and improve the quality of definitions for DFS and data analytics. Chapter 2.1: Offers a framework for data financial services. project implementation and a step-by-step guide to solve practical business problems This handbook is designed for any type by applying this framework to derive value ics of financial services provider offering or from existing and potential data sources. intending to offer digital financial services. lyt s app Da a od li c Chapter 2.2: Provides a directory of data th DFS providers include all types of institutions & m an ta ions sources and technology resources as well as at such as microfinance institutions, banks, a e Dat mobile network operators, fintechs and a list of performance metrics for assessing payment service providers. Technology- data projects. It also includes a glossary that provides descriptions of terms used in enabled channels, products and processes the handbook and in industry practice. generate hugely valuable data on customer Ma a p interactions; at the same time, linkages to s da Conclusion: Includes lessons learned from ce na the increasingly available pools of external gi data projects thus far, drawing on IFC’s t ro ng a ur data can be enabled. The handbook offers so experience in Sub-Saharan Africa with the jec Re an overview of the basic concepts and t MasterCard Foundation’s Partnership for identifies usage trends in the market, Financial Inclusion program. 4 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Data analytics and methods CONTENTS FOREWORD 4 ACRONYMS 7 EXECUTIVE SUMMARY 10 Data applications INTRODUCTION 14 PART 1: DATA METHODS AND APPLICATIONS 16 Chapter 1.1: Data, Analytics and Methods ............................................................................................................................. 16 Defining Data 16 Sources of Data 19 Data Privacy and Customer Protection 23 Data Science: Introduction 26 Managing a data project Methods 29 Tools 32 Chapter 1.2: Data Applications for DFS Providers .......................................................................................................... 34 1.2.1 Analytics and Applications: Market Insights ​36 1.2.2 Analytics and Applications: Operations and Performance Management​ 54 1.2.3 Analytics and Applications: Credit Scoring 79 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 5 Resources PART 2: DATA PROJECT FRAMEWORKS 100 Chapter 2.1: Managing a Data Project ...................................................................................................................................100 The Data Ring​ 100 Structures and Design​ 102 GOAL(S)​ 104 Quadrant 1: TOOLS 107 Quadrant 2: SKILLS​ 112 Quadrant 3: PROCESS​ 117 Quadrant 4: VALUE ​124 APPLICATION: Using the Data Ring ​126 Chapter 2.2: Resources​......................................................................................................................................................................... 136 2.2.1 Summary of Analytical Use Case Classifications ​136 2.2.2 Data Sources Directory​ 137 2.2.3 Metrics for Assessing Data Models​ 141 2.2.4 The Data Ring and the Data Ring Canvas​ 141 CONCLUSIONS AND LESSONS LEARNED​ 145 GLOSSARY ​149 AUTHOR BIOS ​157 6 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES ACRONYMS ADC Alternative Delivery Channel AI Artificial Intelligence AML Anti-Money Laundering API Application Programming Interface ARPU Average Revenue Per User ATM Automated Teller Machine BI Business Intelligence CBA Commercial Bank of Africa CBS Core Banking System CDO Chief Data Officer CDR Call Detail Records CFT Countering Financing of Terrorism CGAP Consultative Group to Assist the Poor COT Commission on Transaction CRISP-DM Cross Industry Standard Process for Data Mining CRM Customer Relationship Management CSV Comma-separated Values DB Database DFS Digital Financial Services DOB Date of Birth DRC Democratic Republic of Congo ETL Extraction-Transformation-Loading EU European Union FI Financial Institution DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 7 FSD Financial Sector Deepening FSP Financial Services Provider FTC Federal Trade Commission GLM Generalized Linear Model GPS Global Positioning System GSM Global System for Mobile Communications GSMA Global System for Mobile Communications Association ICT Information and Communication Technology ID Identification Document IFC International Finance Corporation IP Intellectual Property IT Information Technology JSON JavaScript Object Notation KCB Kenya Commercial Bank KPI Key Performance Indicator KRI Key Risk Indicator KYC Know Your Customer LOS Loan Origination System MEL Monitoring, Evaluation and Learning MFI Microfinance Institution MIS Management Information System MNO Mobile Network Operator MSME Micro, Small and Medium Enterprise MVP Minimum Viable Product NDA Non-Disclosure Agreement 8 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES NLP Natural Language Processing NPL Non-Performing Loan OLA Operating Level Agreement OTC Over the Counter P2P Person to Person PAR Portfolio at Risk PBAX Private Branch Automatic Exchange PIN Personal Identification Number POS Point of Sale PSP Payment Service Provider QA Quality Assurance RCT Randomized Control Trial RFP Request for Proposal SIM Subscriber Identity Module SLA Service Level Agreements SME Small and Medium Enterprise SMS Short Message Service SNA Social Network Analysis SQL Structured Query Language SVM Support Vector Machine SVN Support Vector Network TCP Transmission Control Protocol TPS Transactions Per Second UN United Nations USSD Unstructured Supplementary Service Data DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 9 Executive Summary International Finance Corporation (IFC) supports institutions seeking to develop digital financial services (DFS) for the expansion of financial inclusion and is engaged in multiple projects across a range of markets through its portfolio of investments and advisory projects. As of 2017, through its work with The MasterCard Foundation and other partners, IFC works with DFS providers across Sub-Saharan Africa on expanding financial inclusion through digital products and services. Interactions with clients as well as the broader “Let the dataset change your industry in the region and beyond have identified the need for a handbook on how to use mindset.” – Hans Rosling the emerging field of data science to unlock value from the data emerging from these implementations. Even though data analytics offers an opportunity for DFS providers to know their customers at a granular level and to use this knowledge to offer higher-quality services, many practitioners are yet to implement a systematic, data-driven approach in their operations and organizations. There are a few examples that have received a lot of attention due to their success in certain markets, such as the incorporation of alternative data in order to evaluate credit risk of new types of customers. However, the promise of data goes beyond one or two specific case applications. Common barriers to the application of data insights for DFS include a lack of knowledge, scarcity of skill and discomfort with an unfamiliar approach. This handbook seeks to provide an overview of the opportunity for data to drive financial inclusion, along with steps that practitioners can take to begin to adopt a data-driven approach into their businesses and to design data-driven projects to solve practical business problems. In the past decade, DFS have transformed the customer offering and business model of the financial sector, especially in developing countries. Large numbers of low-income people, micro-entrepreneurs, small-scale businesses, and rural populations that previously did not have access to formal financial services are now digitally banked by a range of old and new financial services providers (FSPs), including non-traditional providers such as mobile network operators (MNOs) and emerging fintechs. This has proven to impact quality of life as illustrated in Kenya, where a study conducted by researchers at the Massachusetts Institute of Technology (MIT) has demonstrated that the introduction of technology- enabled financial services can help reduce poverty.1 The study estimates that since 2008, 1 Suri and Jack, ‘The Long Run Poverty and Gender Impacts of Mobile Money’, Science Vol. 354, Issue 6317 (2015): 1288-1292. 10 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES access to mobile money services that to be even richer in data. As the costs of These emerging sources of data have the allow users to store and exchange money smartphones fall, mobile internet access is capacity to positively impact financial increased daily per capita consumption set to rise from 44 percent in 2015 to 60 inclusion. Analytics can improve the levels for 194,000 people, or roughly two percent in 2020. In Sub-Saharan Africa, business processes of institutions that percent of Kenyan households, in effect, smartphone usage is predicted to rise serve low-income households by allowing lifting them out of extreme poverty. from 25 percent in 2015 to 50 percent them to identify and engage new The impact was most prominent among of all connections by 2020.5 Everyday customers more efficiently. Thus, data households headed by women, often objects are also increasingly being enabled can help financial institutions (FIs) acquire considered particularly economically to send and receive data, connecting new and previously excluded people. It marginalized. This is a good argument for and communicating directly with one also deepens financial inclusion as existing broader and deeper financial inclusion in another and through user-interfaces in customers increase their use of financial Sub-Saharan Africa and other emerging smart-phone applications, known as the products. At the same time, policymakers economies. Data and data analytics can Internet of Things.6 While this is primarily a and other public stakeholders can now help achieve this. developed country phenomenon, there are obtain a detailed view of financial inclusion also examples from the developing world. by looking at access, usage and other It is estimated that approximately 2.5 In East Africa for example, there are solar trends. This evidence can play a role in quintillion bytes of data are produced in devices that produce information about developing future policies and strategies to the world every day.2 To get a sense of the the unit’s usage and DFS repayments improve financial inclusion. quantity, this amount of data exceeds 10 made by the owner. Data are then used billion high-definition DVDs. Most of these to perform instant credit assessments The increased availability of data presents data are young – 90 percent of the world’s that can ultimately drive new business. challenges as well as opportunities. existing data were created in the last two For DFS providers, data can be drawn The major challenge is how to leverage years.3 The recent digital data revolution from an ever-expanding array of sources: the utility of data while also ensuring extends as much to the developing world transactional data, mobile call records, call people’s privacy. A large proportion as to the developed world. In 2016, there center recordings, customer and agent of newly available data are passively were 7.8 billion mobile phone subscriptions registrations, airtime purchase patterns, produced as a result of our interactions in the world, of which 74 percent were in credit bureau information, social media with digital services such as mobile phones, developing nations.4 The future is expected posts, geospatial data, and more. internet searches, online purchases, 2 ‘The 4 Vs of Big Data’, IBM Big Data Hub, accessed April 3, 2017, https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html 3 ‘The 4 Vs of Big Data’, IBM Big Data Hub, accessed April 3, 2017, https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html 4 ‘The Mobile Economy 2017’, GSMA Intelligence 5 ‘Global Mobile Trends’, GSMA Intelligence 6 Internet of Things. In Wikipedia, The Free Encyclopedia, accessed April 3, 2017, https://en.wikipedia.org/w/index.php?title=Internet_of_things&oldid=773435744 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 11 and electronically stored transactions. management and credit scoring. The and mobile phone usage are sources of Characteristics about individuals can be handbook makes extensive use of case new data, which allow DFS providers to inferred from complex algorithms that studies in order to demonstrate the use of make a more accurate risk assessment of make use of these data, made possible data analytics for practitioners. Notably, previously excluded people who do not due to advances in analytical capability. the universe of data is ever-expanding and have formal financial histories to support Thus, privacy is further compromised analytical capabilities are also improving their loan applications. by the fact that primary generators of with gains in technological capacity. data are unaware of the data they are As such, the potential for the use of data The handbook describes the steps that generating and the ways in which they can extends far beyond the applications practitioners may take to understand the be used. As such, companies and public described in this handbook. essential elements required to design a sector stakeholders must put in place data project and implement it in their own the appropriate safeguards to protect Developing data-driven market insights institutions. Two tools are introduced to privacy. There must be clear policies is key to developing a customer-centric guide project managers through these steps: and legal frameworks both at national business. Understanding markets and the Data Ring and the complementary Data and international levels that protect the clients at a granular level will allow Ring Canvas. The Data Ring is a visual checklist, producers of data from attacks by hackers practitioners to improve client services and whose circular form centers the ‘heart’ of and demands from governments, while resolve their most important needs, thereby any data project as a strategic business goal. also stimulating innovation in the use of unlocking economic value. A customer- The goal-setting process is discussed, data to improve products and services. centric business understands customer followed by a description of the core At the institutional level as well, there should needs and wants, ensuring that internal resource categories and design structures be clear policies that govern customer opt and customer-facing processes, marketing needed to implement the project. These in and opt out for data usage, data mining, initiatives and product strategy is the result elements include hard resources, such as re-use of data by third parties, transfer, of data science that promotes customer the data itself, along with software tools, and dissemination. loyalty. From an operations perspective, processing and storage hardware; as well data play an important role in automating as soft resources including skills, domain The usage of data is relevant across the processes and decision-making, allowing expertise and human resources needed life cycle of a customer in order to gain institutions to become scalable quickly for execution. This section also describes a deeper understanding of their needs and efficiently. Here data also play an how these resources are applied during and preferences. There are three broad important role in monitoring performance project execution to tune results and applications for data in DFS: developing and providing insights into how it can be deliver value according to a defined market insights, improving operational improved. Finally, widespread internet implementation strategy. 12 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES The complementary tool incorporates management, analysis, visualization and long-term vision and commitment. these structural design elements into a dashboard reporting. There is also a list It may require changes to organizational Canvas, a space where project managers of metrics for assessing data models that culture and upgrades to existing internal can articulate and lay-out the key would be commonly discussed by external capacities. Importantly, institutions must resources and definitions in an organized consultants or analytic vendors. Copies of ensure that processes through which data and interconnected way. The tools help the Data Ring tools may be downloaded are collected, stored and analyzed respect to define the interconnected relationships for reference or use. individual privacy. across project design structures – to visually see how the pieces link together, to identify The handbook makes extensive use of The handbook is intended to provide useful where gaps may exist, or where resource case studies in order to illustrate the guidance and support to DFS providers to requirements need adjustment. The Canvas experiences of a diverse set of DFS expand financial inclusion and to improve approach also serves as a communications providers in implementing data projects institutional performance. Data science tool, providing a high-level project design within their organizations. While these offers a unique opportunity for DFS schematic on one sheet of paper that may practitioners are primarily based in Africa providers to know their customers, agents be updated and discussed throughout and are offering DFS to their customers and merchants as well as improve their project implementation. in the form of mobile money or agent internal operational and credit processes, banking, this is not to say that data driven Finally, resource tables are provided. insights cannot be used by any type of using this knowledge to offer higher- The data directory enumerates prominent FSP using different business models. quality services. Data science requires sources of data available to DFS A common thread seen in all of these firms to embrace new skills and ways of practitioners and a brief overview of their cases is that institutions can systematically thinking, which may be unfamiliar to them. potential application in a data project. develop their data capabilities starting However, these skills are acquirable and The technology database lists essential with small steps. Becoming a data-led will allow DFS practitioners to optimize tools in the data science industry and organization with competitive data- both institutional performance and prominent commercial products for data driven activities is a journey that requires financial inclusion. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 13 Introduction Previously unbanked individuals in emerging markets are increasingly accessing formal financial services through digital channels. Ubiquitous computing power, pervasive connectivity, mass data storage, and advanced analytical technologies are being harnessed to deliver tailored financial products and services more efficiently and more directly to a broader range of customers; collectively, these products and services are referred to as digital financial services (DFS). DFS providers, i.e., institutions that leverage DFS to provide financial services, comprise a diverse set of institutions including traditional FSPs, such as banks and microfinance institutions (MFIs), as well as emerging FSPs such as MNOs, fintechs and payment service providers (PSPs). Data is a term used to describe pieces of information, facts or statistics that have been gathered for any kind of analysis or reference purpose. Data exist in many forms, such as numbers, images, text, audio, and video. Having access to data is a competitive asset. However, it is meaningless without the ability to interpret it and use it to improve customer centricity, drive market insights and extract economic value. Analytics are the tools that bridge the gap between data and insights. Data science is the term given to the analysis of data, which is a creative and exploratory process that borrows skills from many disciplines including business, statistics and computing. It has been defined as ‘an encompassing and multidimensional field that uses mathematics, statistics, and other advanced techniques to find meaningful patterns and knowledge in recorded data’.7 Traditional business intelligence (BI) tools have been descriptive in nature, while advanced analytics can use existing data to predict future customer behavior. The interdisciplinary nature of data science implies that any data project needs to be delivered through a team that can rely on multiple skill sets. It requires input from the technical side. However, it also requires involvement from the business team. As Figure 1 illustrates, the translation of data into value for firms and financial inclusion is a journey. Understanding the sources of data and the analytical tools is only one part of the process. This process is incomplete without contextualizing the data firmly within the business realities of the DFS provider. Furthermore, the provider must embed the insights from analytics into its decision-making processes. 7 ‘Analytics: What is it and why it matters?’, SAS, accessed April 3, 2017, https://www.sas.com/en_za/insights/analytics/what-is-analytics.html 14 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Analytics DECISION-MAKING Data Applications Figure 1: The Data Value Chain: From Data to Decision-Making For DFS providers, data analytics presents be employed more generally to increase data they are sharing with DFS providers a unique opportunity. DFS providers are operational efficiency. Whatever the goal, and to ensure that they have access to the particularly active in emerging markets and a data-driven DFS provider has the ability same data that the provider can access. increasingly serve customers who may not to act based on evidence, rather than In order to develop policies, stakeholders have formal financial histories such as credit anecdotal observation or in reaction to such as providers, policymakers, regulators, records. Serving such new markets can be what competitors are doing in the market. and others will need to come together particularly challenging. Uncovering the to discuss the implications of privacy preferences and awareness levels of new At the same time, it is important to raise concerns, possible solutions and a way types of customers may take extra time the issue of consumer protection and forward. For those in the financial inclusion and effort. As the use of digital technology privacy as the primary producers of data sector, providers can proactively educate and smartphones expands in emerging may often be unaware of the fact that data customers about how information is markets, DFS providers are particularly are being collected, analyzed and used for being collected and how it will be used, well-positioned to take advantage of specific purposes. Inadequate data privacy and pledge to only collect data that are data and analytics to expand customer can result in identity theft and irresponsible necessary without sharing this information base and provide a higher-quality service. lending practices. In the context of digital with third parties. Data analytics can be used for a specific credit, policies are required to ensure that purpose such as credit scoring, but can also people understand the implications of the DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 15 ics yt app Da al o d s li c th & m an ta ions PART 1 at a e Dat Data Methods and Applications Ma a p s da ce na gi t ro ng a ur jec t Re so Chapter 1.1: Data, Analytics and Methods The increasing complexity and variety of data being produced has led to the development of new analytic tools and methods to exploit these data for insights. The intersection of data and their analytic toolset falls broadly under the emerging field of data science. For digital FSPs who seek to apply data- driven approaches to their operations, this section provides the background to identify resources and interpret operational opportunities through the lens of the data, the scientific method and the analytical toolkit. Defining Data Data are samples of reality, recorded as measurements and stored as values. The manner in which the data are classified, their format, structure and source determine which types of tools can be used to analyze them. Data can be either quantitative or qualitative. Quantitative data are generally bits of information that can be objectively measured, for example, transactional records. Qualitative data are bits of information about qualities and are generally more subjective. Common sources of qualitative data are interviews, observations or opinions, and these types of data are often used to judge customer sentiment or behavior. Data are also classified by their format. In the most basic sense, this describes the nature of the data; number, image, text, voice, or biometric, for example. Digitizing data is the process of taking these bits of measured or observed ‘reality’ and representing them as numbers that computers understand. The format of digitized data describes how a given measurement is digitally encoded. There are many ways to encode information, but any piece of digitized information converts things into numbers that can drive an analysis, thus serving as a source of potential insight for operational value. The format classification is critical because that format describes how to turn the digital information back into a representation of reality and how to use the right data science tools to obtain analytic insights. 16 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES To be available for analysis, data must predefined nor enforced; this practically be stored. They can be stored in either infinite combination of words and a structured or unstructured way. letters exemplifies unstructured data. Structured data have a set of attributes As a whole, the tweet is therefore semi- and relationships that are defined during structured data. the database design process; these data fit into a predetermined organization, Data are also classified by their source. also known as a schema. In a structured FSPs tend to categorize data sources database, all elements in the database as either traditional or non-traditional, will have the same number of attributes where traditional data sources refer to in a specific sequence. Transactional data internal data sources such as core account are generally structured; they have the management system transactions, client same characteristics and are saved in the surveys, registration forms, or demographic same way. Structured data are more easily information. Traditional data sources also queried and analyzed. Unstructured data are includes external sources such as credit not organized according to predetermined bureaus. They are typically structured schemas. They are flexible to grow in form data. Non-traditional data, or alternative and shape, where reliable attributes may data, can be structured, semi-structured or may not exist. This makes them more or unstructured, and they may not always difficult to analyze; but this is an advantage be related to financial services usage. as more data are quickly generated from Examples of these kinds of data include new sources such as social media, emails, voice and short message service (SMS) mobile applications, and personal devices. usage data from MNOs, satellite imagery, Unstructured data have the advantage geospatial data, social media data, emails, of being able to be saved as-is, without the need to check if they satisfy any or other proxy data. These types of data organizational rules. This makes storing sources are increasingly used by FSPs to them fast and flexible. There are also data extend or deepen customer understanding, that are considered semi-structured data. or are used in combination with traditional Consider a Twitter tweet, for example, data for operational insights. For example, which is limited to 140 characters. This is an MFI that wishes to partner with a a predetermined organizational structure, dairy cooperative to extend loans to dairy and the service is programmed to check farmers might use milk yields as a proxy that each and every tweet satisfies this for salary in order to assess the ability to requirement. However, the content of provide credit to farmers who lack any what is written in a tweet is neither formal credit history.8 8 Transcript of the session ‘Deploying Data to Understand Clients Better’ The MasterCard Foundation Symposium on Financial Inclusion 2016, accessed April 3 2017 http://mastercardfdnsymposium.org/resources/ DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 17 1.1_DATA ANALYTICS AND METHODS Big data is typically the umbrella term used to describe the vast scale and unprecedented nature of the data that are being produced. Big data has What is five characteristics. Early big data specialists identified the first three characteristics listed below and still refer to ‘the three-Vs’ today. Since then, Big Data? big data characteristics have grown to the longer list of five: 1. Volume: The sheer quantity of data currently produced is mindboggling. The maturity of these data are also increasingly young, meaning that the amount of data that are less than a minute old is rising consistently. It is expected that the amount of data in the world will increase 44 times between 2009 and 2020. 2. Velocity: A large proportion of the data available are produced and made available on a real-time basis. Every minute, 204 million emails are sent. As a consequence, these data are processed and stored at very high speeds. 3. Variety: The digital age has diversified the kinds of data available. Today, 80 percent of the data that are generated are unstructured, in the form of images, documents and videos. 4. Veracity: Veracity refers to the credibility of the data. Business managers need to know that the data they use in the decision-making process are representative of their customers’ needs and desires. It is therefore important to ensure a rigorous and ongoing data cleaning process. 5. Complexity: Combining the four attributes above requires complex and advanced analytical processes. Advanced analytical processes have emerged to deal with these large datasets. 18 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Sources of Data Client and Agent Data Primary Market Research This section focuses on the key sources Practitioners collect a vast amount of Market research is generally used to information about their customers during better understand customers and market of information that DFS providers might registration and loan application processes segments, track market trends, develop consider for possible operational or market for both business reasons and to comply products, and seek customer feedback. insights. Importantly, a data source should with regulation. Similarly, they also collect It can be either qualitative or quantitative, not be considered in isolation; combining information about their agents as part and it may be helpful to understand both multiple sources of data will often lead to of the application process and during how and why customers use products. an increasingly nuanced understanding of monitoring visits. For both categories, Mystery shopping is a common market the realities that the data encode. Chapter this may include variables such as gender, research method to test whether agents 2.2 on DFS data collection and storage location and income. Some of these data provide good customer service, while provides an overview of the most common are verified by official documents, while some DFS providers seek direct customer traditional and alternative sources of data some are discussed and captured during feedback with surveys that create a Net available to DFS providers. interviews. In the case of borrowers, Promoter Score gauging how willing much of this client information is captured customers are to recommend a product Traditional Sources of Data digitally in a loan origination system (LOS) or service. As mentioned above, FSPs have traditionally or an origination module in the core sourced data from customer records, banking system (CBS). It is surprisingly Call Center Data transactional data and primary market common for such information to remain Call center data are a good source for research. Much of the credit-relevant data only on paper or in scanned files. understanding what issues customers have been stored as documents (hard or face and how they feel about a provider’s Third Parties soft paper copies), and only basic customer products and customer service. Call center Credit bureaus and registries are excellent registration and banking activity data were data can be analyzed by categorizing call sources of objective and verifiable data. kept in centralized databases. A challenge types and resolution times and by using They provide a credibility check on the for FSPs today is to ensure that these types speech analytics to examine the audio information reported by loan applicants of traditional data are also stored in a digital logs. Call center data are particularly useful and can often reveal information that the format that facilitates data analysis. This to understand issues that customers, applicant may not willingly disclose. Most may require a change in how the data are credit bureau reports and public registries agents or merchants are having with collected, or the introduction of technology can now be queried online with relevant products or new technology that has just that converts data to a digital format. data accessed digitally. However, a been launched. Although new technology is available to challenge is that not all emerging markets digitize traditional data, digitization may have fully functioning credit reporting be too big a task for legacy data. infrastructure. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 19 1.1_DATA ANALYTICS AND METHODS Number Image Text Voice Biometric Figure 2: Examples of Prominent Data Formats Used in Data Analytics Transactional Databases Alternative Sources of Data denomination. In addition, this information As more of our communication and can be matched to cell tower signals to Transactional data offer information on business is done via mobile phones, tablets generate locations of customer activity. activity levels and product usage trends. and computers, there are more sources MNOs that offer mobile money services Simple comparisons of transaction by value of digitized data that may provide insight have access to both CDR data and the versus volume may offer very different into the financial capacity and character of DFS transactional database, and when insights into consumer behavior. For FIs customers. These sources can tell us how combined for analysis, this information is such as banks or MFIs, data on customers’ people spend their time and money, and more likely to help predict customer activity usage of bank accounts (deposits, debits where and with whom they spend it. and usage than simple demographic and credits) and other services (cards, loans, data. In some markets, MNOs and FSPs payments, and insurance) are normally MNO Call Detail Records (CDRs) partner with each other to benefit from captured in the CBS. Use of bank accounts the combined data. Airtime top-ups From their core operations, MNOs have and services leaves objective data trails can, for example, be a good indicator of access to CDRs and coordinates of Cell that can be analyzed for patterns signaling discretionary income. Customers who run Towers. MNOs analyze CDRs to conduct different levels of financial capacity and targeted marketing campaigns and their airtime down to zero and routinely sophistication. Different usage patterns promotions and to adjust pricing, for and frequently make small top-ups are may also signal different levels of risk. example. At a minimum, a CDR includes 1) likely to have less discretionary income To process loan applications, FIs may require voice calls, talk time, data services usage than those who top-up less frequently but documentation from other institutions and SMS data on sender, receiver, time, in larger installments. such as credit bureaus, however these tend and duration, and 2) airtime, data top-up to be on paper and are difficult to digitize. information including time, location and 20 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Agent-assisted Transaction Data demand-side factors such as level of and online web behavior including the Understanding which locations and agents financial inclusion, customer location, timing, location, frequency, and sequence are the most active can provide insights to levels of poverty, and mobile voice and of a website or a series of websites. help improve agent network performance. data usage, with supply-related factors Social media may also be indicative of For many DFS providers, agents are the such as agent activity, rural or urban an individual’s socio-economic status. primary face to the customer, and tracking characteristics, presence of infrastructure, For example, people with a LinkedIn the pattern of agent usage and activity and similar. This can offer insights that profile that has many connections may, on may reveal insights about both customer may be helpful to customer acquisition average, be lower-risk than those without. preferences and agent performance. Such and marketing strategies, agents or branch That is not because signing up for a LinkedIn information may be directly recorded expansion, and competitor or general account indicates an ability to service debt from mobile phones, point of sale (POS) market analysis. Geospatial data can offer per se, but rather because LinkedIn targets devices or transaction-point computers. more granular insights than typical socio- professionals and, on average, professionals Alternatively, it could be indirectly economic indicators, which are generally earn higher wages than laborers. Public associated, such as agent registration only available in aggregate format. profiles from social media can also be forms, needing to be merged into the useful to verify contact details and basic Social Media Profiles personal customer information. Social transactional data pipeline for an analysis to be conducted. Increasingly, potential and existing media as a data source has its limitations customer markets are developing online though. FSPs can generally only gain access Geospatial Data and maintain a presence on social to the social media accounts of customers Geospatial data refers to data that contain media sites such as Facebook, Twitter who opt in, and it may be difficult to get locational information, such as global and LinkedIn. Online behavior data may enough customers to agree to this to build positioning system (GPS) coordinates, offer information on customer feedback, a large enough database for meaningful addresses, cities, and other geographic or attitudes, lifestyles, goals, and how financial analysis. Some customers may also not be proximity identifiers. In recent years, very services can play a role in customer lives. active on social media, because of choice granular geospatial data have allowed DFS Social media network data include data or circumstances. Profile data, even when providers to examine and cross-reference on social connectedness, traffic initiated, available, may also be biased. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 21 1.1_DATA ANALYTICS AND METHODS Sources of Operational Data Business Intelligence (BI) Peripheral Internal Data There are many business processes System Reports Private Branch Automatic required to run a DFS operation, with each When DFS products are new and there is a Exchange (PBAX) Data department working towards completing relatively low volume of data, it is common The PBAX controls the calls coming into a tasks and meeting performance targets for businesses to create customized call center, and it can provide data on the while relying on data from multiple reports from raw data using simple tools volume of incoming calls, number of calls sources. Possible external and internal data such as Excel. As the business and data dropped before they are answered and sources are illustrated in the figure below grow, and the analysis required becomes the amount of time spent on calls. These and listed in fuller detail in Chapter 2.2. more complex, this soon becomes data are vital for the efficient planning of Each department both generates and shift patterns and size, as well as overall consumes data across this ecosystem. Some unmanageable. Most large DFS systems team performance measurement and of the most important data sources are: will put in place a data warehouse that uses improvement. BI systems to draw on multiple sources of Core System Data data, which come with some basic reports Ticketing Systems The core system provides the bulk of as well as the ability to customize. The ticketing system tracks the process the data. The transactional engine is responsible for managing the workflow of resolving business problems, and Technical Log Files of transactions and interactions, sending provides a wealth of information, from as much granular data and metadata A rich source of data can be found in the the types of problems that occur, to issue as feasible to the relevant databases. technical log files. More advanced DFS resolution times. This includes the movement of funds providers proactively use dashboards to plus fees and commissions, as well as any continuously assure system health and business rules around commission splits provide early fault detection. It is also and tax rules. It should also provide fully common to have performance monitors auditable workflow trails of non-financial and alerts built into the monitoring system activities such as Personal Identification Number (PIN) changes, balance enquiries, that can provide valuable information. mini-statements, and data downloads, as Providers that only access these data when well as internal functions such as transfers specific forensic analysis is required miss of funds between accounts. out on available and useful data. 22 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Data Privacy and In Kenya, many digital credit providers have emerged to meet the demand for credit, Customer Protection but operate outside the regulatory purview The new analytical and data collection of the central bank.9 One such provider methodologies raise several questions included in their terms and conditions that related to customer privacy rights and the provider was free to post the names of consumer protection. First, as discussed defaulters on their website and post directly earlier, much of the data produced and to the social media walls of defaulters. collected are done so passively, that In cases such as this one, customers may not is to say, without the knowledge of be aware that they are agreeing to suspend the producer of the data. Sometimes, their privacy rights until it is too late. these data can be shared with third This can be particularly true in developing parties without the knowledge of the country contexts where both literacy and data producer. This can have negative awareness of the issues are low. implications on the individual’s ability to obtain loans or insurance. The problem Notably, even in countries where user is compounded when the individual is consent is prevalent, consumers may unaware of this negative information or not understand the permissions they does not have recourse to dispute the are granting. As an example, users in negative information. There are currently sophisticated markets may not be aware of no standard opt-in policies for data sharing. all of the applications in their smartphone Some DFS providers with apps that are that make use of location data. Research installed on the mobile phones of their shows that 80 percent of mobile users customers may be able to sweep customer have concerns over sharing their personal internet usage information and other data information while using the mobile including SMS messages, contacts and internet or apps.10 Nevertheless, 82 percent location data, among others. of users agree to privacy notices without reading them because they tend to be too Figure 3: Example of Request to Save With the diversity of DFS providers, not all long or use terminology that is unfamiliar. and Access User Location History Data providers fall under the same supervisory Due to security concerns and the stated via Google Maps App regime, thereby leading to differing data willingness of customers to stop using apps privacy policies for each. Some of the they find too intrusive or lacking in security, breaches to individual rights to privacy most apps nowadays offer simple ways to could have negative reputational impacts. opt in and opt out. 9 Ombija and Chege, ‘Time to Take Data Privacy Concerns Seriously in Digital Lending’, Consultative Group Against Poverty Blog, October 24, 2016, accessed April 3, 2017, https://www.cgap.org/blog/time-take-data-privacy-concerns-seriously-digital-lending 10 ‘Mobile Privacy: Consumer research insights and considerations for policymakers’, GSMA DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 23 1.1_DATA ANALYTICS AND METHODS App Permissions App Permissions App Permissions Camera ON Body Sensors ON Calendar ON Contacts OFF Camera OFF Camera OFF Location OFF Contacts OFF Contact ON Microphone OFF Location ON Location OFF Phone ON Microphone ON Microphone OFF SMS ON Phone OFF Phone ON Storage OFF SMS ON SMS OFF Figure 4: Examples of Smartphone Application Permissions Settings Privacy laws, where they exist, vary comprehensive federal data protection to exchange the information with each widely by jurisdiction and even more so law exists. The EU issued data protection other where technically possible.12 This by degree of enforcement. In the context regulations in 2016, which mandate that kind of regulation provides empowerment of developed markets, in the European all data producers should be able to to the consumer while enhancing Union (EU) the right to privacy and data receive back the information they provide competition, as consumers can now move protection is heavily regulated and actively to companies, to send the information to between providers with their transaction enforced,11 while in the United States no other companies, and to allow companies history intact. In the United States, the 11 Regulation governing data protection in the EU includes the EU Data Protection Directive 95/46 EC and the EU Directive on Privacy and Electronic Communications 02/58 EC (as amended by Directive 2009/136) 12 Regulation (EU) 2016/679 of the European Parliament and of the Council (2016), accessed April, 3 2017, http://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679&from=EN 24 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Federal Trade Commission (FTC) is the Cross-border flows of data constitute a development. The UN emphasizes the regulating body on data privacy. However, delicate issue, especially as they can affect need to accelerate the development and the FTC Code of Fair Information Principles national security matters. Regulation in adoption of legal, technical, geospatial, and is only a set of recommendations for countries such as Angola, South Africa statistical standards in regard to: maintaining privacy-friendly, consumer- and Tanzania specifically stipulates that oriented data collection practices – it is • Openness and the exchange of metadata data can only be transferred to countries not enforceable by law. In the absence of where the law provides the same or higher • Protection of human data rights15 any federal overarching privacy rule, the standards of protection for the personal Thus, at the moment, no uniform policy United States has developed federal and data in question. Zambia goes even further exists to govern data privacy issues. The state statutes and regulations to address by forbidding any off-shore transfers of first step to understanding privacy’s personal information privacy and data data that are not anonymized.13 At the implications is to ensure a sector-wide security, both in a general sense and on other end of the spectrum, the proposed discussion involving DFS providers, an industry-sector basis to which every Kenya Bill on Data Protection of 2016 regulators, policymakers, other public relevant business must adhere. has been harshly criticized by experts for sector stakeholders, investors, and including no provision for extraterritorial When it comes to Sub-Saharan Africa, development FIs in order to devise jurisdiction.14 Ghana, South Africa and Uganda seem solutions and standards. At the same to stand out as having the best regional Nevertheless, customer data privacy is time, in the financial inclusion sector, DFS practices. What sets these three countries a new policy area, and countries such as providers must acknowledge that while apart is the fact that regulation is guided by Mozambique and Zimbabwe still rely on the data represent an opportunity to improve a customer centricity principle and, as such, Constitution to interpret privacy rights as a the bottom line, they also underscore regulation focuses on: result of not having dedicated regulatory an obligation to add value. This can be bills. In this context, emerging markets achieved by using the data to improve • Empowering the consumer to make access to financial services. DFS providers frequently look to more established pertinent decisions about their personal can attempt to educate the people about markets and regulators for cues on how to data usage, especially in relation to how their personal information will be address the issues at hand. automated decision-making used while only collecting information that • Stipulating clear mechanisms through Given this context, but aware of the is necessary. which the consumer can seek differences between technology usage compensation in emerging and developed markets, • Giving the customer the ‘right to be the United Nations (UN) has offered forgotten’ some general guidance in terms of policy 13 ‘Global Data Privacy Directory’, Norton Rose Fulbright 14 Francis Monyango, ‘Consumer Privacy and data protection in E-commerce in Kenya’, Nairobi Business Monthly, April 1, 2016, accessed April 3, 2017, http://www.nairobibusinessmonthly.com/politics/consumer-privacy-and-data-protection-in-e-commerce-in-kenya/ 15 ‘A World That Counts: Mobilizing the Data Revolution for Sustainable Development’, United Nations Secretary-General’s Independent Expert Advisory Group on a Data Revolution for Sustainable Development DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 25 1.1_DATA ANALYTICS AND METHODS Data Science: sector expertise. It is an exploratory and creative discipline, driven to find innovative Introduction solutions to complex issues through Data science is the interdisciplinary use of an analytical approach. The science of scientific methods, processes and systems data refers to the scientific method of to extract insights and knowledge from analysis: data scientists engage in problem various forms of data to solve specific solving by setting a testable hypothesis problems. It combines numerical science and assiduously testing and refining such as statistics and applied mathematics, that hypothesis to obtain reliable and with computer science and business and validated results. 01 Make observations What do I see in nature? This can be from one’s 06 own experiences, 02 thoughts or reading. Communicate Thinking of results interesting questions Draw conclusions and Why does that report findings for others pattern occur? to understand and replicate. Refine, alter expand or reject hypotheses 05 03 Gather data to Formulate test predictions hypotheses Relevant data found What are the general from literature, new observations / formal 04 causes of the phenomenon I am experiments. Thorough testing required wondering about? replication to verify results. Develop testable predictions If my hypothesis is correct then I expect a,b,c. Figure 5: The Scientific Method, the Analytic Process that is Similarly Used for ‘Data Science’ 26 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Data Science The term data scientist was coined in 2008 by DJ Patil and Jeff Statistics / Hammerbacher to describe their job functions at LinkedIn and Facebook. Mathematics They emphasized that their roles were not just about crunching numbers and finding patterns in those numbers, but they applied a creative and exploratory process to build connections across those patterns. “Data science Data is about using complex data to tell stories,” said Patil, adding that it drew as much Science from journalism as from computer science. For this reason, Patil and Hammerbacher Computer considered an alternative title for their jobs: Data Artist. Business Expertise Science Figure 6: Data Science, the Intersection of Several Disciplines In order to deliver BI, all data-related useful insights can be derived from data principle use cases: descriptive, diagnostic, analysis must start by defining business large and small, traditional and alternative. predictive, and prescriptive. The least complex goals and identifying the right business Faster computers and complex algorithms methodologies are often descriptive in questions, or hypothesis. The scientific augment analytic possibilities, but neither nature, providing historical descriptions method provides helpful guidance (see replace nor displace time-tested tools and of institutional performance, aggregated Figure 5). Importantly, it is not a linear approaches to deliver data-driven insights figures and summary statistics. They are process. Instead, there is always a learning to solve business problems. Rather, it is also least likely to offer a competitive and feedback loop to ensure incremental important to understand the strengths that advantage, but are nevertheless critical for improvement. This is key to obtaining different tools offer and to augment them operational performance monitoring and insights that enable evidence-based and appropriately to obtain the desired results regulatory compliance. On the opposite reliable decision-making. Chapter 2.1 of in a timely and cost-efficient manner. end, the most innovative and complex this handbook provides a step-by-step Figure 7 provides a high-level description analytics are prescriptive, optimized for process for implementing data projects of BI analytical methods, classified by their decision-making and offering insights for DFS providers, utilizing the Data Ring operational use and relative sophistication. into future expectations. This progression methodology. Many categories and their associated also helps to classify the deliverables and Data science facilitates the use of new techniques and implementations overlap, implementation strategy for a data project, methods and technologies for BI, and but it is still useful to break them into four which is discussed further in Chapter 2.1. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 27 1.1_DATA ANALYTICS AND METHODS Data Science Analytic Framework for Business Intelligence Descriptive Diagnostic Predictive Prescriptive Alerts, querying, Regression analysis, A B Machine learning, SNA, Graph analysis, neural searches, reporting, static testing, pattern matching, geospatial pattern networks, machine and deep Techniques visualizations, dashboards, data mining, forecasting, recognition, interactive learning, AI tables, charts, narratives, segmentation visualizations correlations, simple statistical analysis Prescriptive Analytics How can we make it happen? Predictive Analytics Integrated systems What will happen Competitive Advantage in the future? Diagnostic Analytics Modeling Why did it Descriptive happen? Analytics What happened? Traditional BI What is happening now? Reports Information Optimization Complexity of Analytics Figure 7: The Four Categories of Business Analytics 28 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Methods to be tabulated by gender or location, for a metric of business interest, and a set example, or other segments of interest. of independent variables with which The analytical use cases outlined in Figure 7 Excel uses the term ‘pivot table’ to it correlates. Identifying statistically help determine the method, time, cost, and describe this type of analysis. significant16 variables can guide strategy, complexity of data projects. The following focus goals and estimate outcomes. methods are generally included in the Diagnostic Analytics data scientist’s toolbox, and help to match • Segmentation: Segmentation is a Finding key drivers or understanding method of classifying groups into sub- broad methods with analytical purposes. changing data patterns is diagnostic groups based on defined criteria, behavior These methods are especially relevant for analysis. It is about asking why something or characteristics. Segmentation can discussions with external consultants or happened; for example, asking why help to identify customer demographic solutions providers to help frame what they transaction patterns changed to determine or product usage categories, with are delivering or to evaluate a proposal. if there is not only correlation, but quantified and statistically meaningful Descriptive Analytics causation. Diagnostic analysis usually thresholds. This is often used in requires more sophisticated methods and conjunction with regression analysis or Descriptive analysis offers high-level research designs, as described below. more sophisticated modeling techniques aggregate reports of historical records and answers questions about what occurred. to predict to which segment an as- • A | B Testing: This is a statistical Key Performance Indicators (KPIs) are also yet-unidentified prospective customer method where two or more variants of within this category. could belong. an experiment are shown to users at • Geospatial: This method groups data • Descriptive Statistics: Also known as random to determine which performs according to their location on a map, or summary statistics, descriptive statistics better for a given conversion goal. in relationship to place and proximity. include averages, summations, counts, A|B testing allows businesses to test This can also help to identify customer and aggregations. Correlation statistics two different scenarios and compare and behavioral segments, such as that show relationships between the results. It is a very useful method from where and to where they send variables also help to describe data. for identifying better promotional or money, or which branches they tend marketing strategies between tested • Tabulation: The process of arranging to visit. Combined with more advanced options. data in a table format is known as techniques it can also enable location- tabulation. Cross-tabulation summarizes • Regression: Statistical regression is one based services to proactively engage data from one or more sources into a of the most basic types of modeling, and customers who are near people or places concise format for analysis or reporting, is very powerful. It enables multi-variable of interest. often aggregating values. It is a method analysis to estimate relationships for segmentation, allowing aggregates between a dependent variable, usually 16 Statistically significant is the likelihood that a relationship between two or more variables is caused by something other than random chance DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 29 1.1_DATA ANALYTICS AND METHODS Predictive Analytics • Modeling: There are two primary versus the accuracy of the prediction. Predictions enable forward-looking decision- modeling methods: regression and Regression models tend to be very making and data-driven strategies. classification. Both can be used to make transparent and easily interpretable, From a data science point of view, this predictions. Regression models help to for example, while the random forest is arguably the most central category determine a change in an output variable method is at the other end of the of methods, as complex algorithms and with given input variables; for example, spectrum, providing good predictions computational power are often used to how do credit scores rise with levels of but insufficient understanding of what drive models. From a business perspective, education? Classification models put drives them. predictive models can deliver operational data into groups or sometimes multi- groups, answering questions such as Prescriptive Analytics efficiencies by identifying high propensity whether a customer is active or inactive, Methods in this category tend to be customer segments and expanding reach or which income bracket he or she falls categorized by predicting or classifying at lower costs via targeted marketing within. There are numerous types of behavioral aspects in complex campaigns. They can also help enhance modeling techniques for either, with relationships, and it includes an advanced customer support by proactively anticipating nuanced technical detail. Modeling set of methods, which are described below. service needs. approaches tend to generate a lot of Artificial intelligence (AI) and deep learning • Machine Learning: This is a field of attention, but it is important to note models fall into this group. However, study that builds algorithms to learn that the modeling method is likely not an this classification is better framed by the from and make predictions about important analysis design specification. expected infrastructure needed to use the data. Notably, this method enables an Typically, many model types are tried and results of an analysis, ensuring it offers analytical process to identify patterns in the best one is then selected in response operational value. For example, this could the data without an explicit instruction to pre-defined performance metrics. take the form of a set of dashboard tools from the analyst, and enables modeling Or sometimes they’re combined, needed to run an interactive visualization methods to identify variables of interest creating an ensemble approach. on a website or the Information and drivers for even unintuitive patterns. A consultant should describe why a Technology (IT) infrastructure to put a It is a technique rather than a method recommended approach is selected, credit scoring model into automation. in itself. Approaches based on machine and not simply state, for example, that Integrating an algorithm or data-driven learning are categorized in terms of the solution builds on a specific method process into a broader operational system, ‘supervised learning’ or ‘unsupervised such as the much publicized ‘random or as a gatekeeper in an automated process learning’ depending on whether forest’ method. Deciding which method relying on it to provide a service, is what there is ground truth to train the to use for modeling should consider the defines a data product. learning algorithm, where supervised importance of being able to interpret methodologies have the ground truth. why results have been rendered 30 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Industry Lessons: Google’s Got the Flu Predictive Modeling and Model Tuning: Reliability Risks of Unsupervised Models Researchers at the search engine benefits are obvious. The model was the model, identified as statistically Google wondered if there could be a a success and was released publicly powerful correlations in 2008. correlation between people searching as Google Flu Trends. Google’s But many of these search terms were for words such as ‘coughing,’ impressive big data modeling was actually predictors of seasons, and ‘sneezing’ or ‘runny nose’ – symptoms prominently featured in the scientific seasons in turn correlated with the of flu – and the actual prevalence of journal Nature in 2008. Six years flu. When flu patterns shifted earlier influenza. In the United States, the later, however, the failure of the same or later than had been the case in spread of influenza has lagging data; model was prominently described in 2008, those search terms were no people fall sick and visit the doctor, the journal Science. What happened longer correlating as strongly with then the doctor reports the statistics, between 2008 and 2014? the flu. Combined with changing user and so the data capture what has demographics, the model became already happened. Could models The number of internet users grew unreliable. Google Flu Trends was driven by search words provide substantially over these six years and left on autopilot, using unsupervised real-time data as influenza was the search patterns of 2008 did not learning methods, and the statistical actually spreading? This approach remain constant. The core issue was correlations weakened over time, to reducing time lags in data is that Google Flu Trends was developed unable to keep up with shifting known as nowcasting. For issues using unsupervised machine learning patterns. such as seasonal flu, the public health techniques: 45 search phrases drove When using similar methods for business decisions or for public health matters, it is important to keep in mind that loss of reliability over time can present significant risks. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 31 1.1_DATA ANALYTICS AND METHODS The Random Forest Method • Text Mining (Natural Language Tools Processing): Text mining is the process Data science and its methods are developed of deriving high-quality information The random forest with computer programming languages, from text. Text may help to identify method has generated or the algorithms run on computational customer opinions and sentiments a lot of excitement in platforms. The data that feed these about products using social media data science because algorithms is drawn from databases. posts, twitter or customer relationship it tends to drive highly accurate The data scientist’s toolkit also includes management (CRM) messages. Natural models. It is a form of classification hard knowledge about technical computing Language Processing (NLP) combines model that uses a tree-type or and the soft skills required to develop and computational linguistics and AI flowchart-type decision structure deploy data algorithms. The technical methods to help computers understand combined with randomized selection specifications of these tools are beyond the text information for processing and approaches to identify an optimal scope of DFS data analytics. Nevertheless, analysis. path between the desired result and some prominent technologies are a ‘forest’ set of input variables. It is • Social Network Analysis (SNA): highlighted to note a few tools that data important to understand that some This is the process of quantitative and scientists are likely to use. Successful data science modeling methods qualitative analysis of a social network. data products require a combination of are easily understood in a business For business purposes, SNA can be methods, tools and skills, as will be further context, while others are not. The employed to avoid churn, detect fraud discussed in Chapter 2.1: Managing a random forest method may, for and abuse, or to infer attributes, such as Data Project. example, generate highly accurate credit worthiness based on peer groups. models, but its complexity yields a • Image Processing: This approach uses Hard Tools ‘black box’ that makes it very difficult computer algorithms to perform analysis • Databases: The structure of the data will to interpret. This could potentially for the purpose of classification, feature guide the appropriate database solution. be problematic for a credit scoring extraction, signal analysis, or pattern Structured data are typically served by model; it might identify the most recognition. Businesses can use this to relational databases with fixed schemas credit-worthy people given the recognize people in pictures to help with that can support integral data reliability, input data, but may not help to fraud detection, or to detect geographic which can help analysts identify data describe what makes these people features relevant for agent placement value anomalies – or prevent them credit-worthy or what determines using satellite images. from saving erroneous data in the first the credit recommendation. 32 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES place. Relational databases organize • Frameworks: These are sets of Soft Tools datasets into tables that are related to software packages that combine a data • Languages: ‘R’ and Python are two each other by a key, that is, a metadata storage solution with an application programming languages that have attribute shared across the tables. programming interface (API) that become essential to data science. Enterprise data warehouse solutions integrate management or analytical Both offer the benefits of fast prototyping and transaction data storage commonly tools into the database. In other words, and exploratory analysis that can get use relational databases. Prominent data projects quickly up and running. these are single-source solutions to products include: Oracle, SQL Server and Both also include add-on libraries built manage and analyze data. Prominent MySQL. Unstructured data are typically for data science, enabling sophisticated products include Spark and Hive. Hadoop, machine learning or modeling served by non-relational databases that mentioned above, is something between techniques with relative programming lack rigid schemas, commonly referred a NoSQL database and a framework. simplicity. Frameworks and databases to as NoSQL databases. They provide advantages in scale and distribution, It is used to manage and scale distributed also have their own sets of programming data using a search approach known as languages. SQL is needed for relational and are often relied on for big data and MapReduce, a method developed by database systems, while other solutions interactive online applications. As big may require Java, Scala, Python, or for datasets get bigger, hard disk space Google to store and query data across Hadoop, Pig. becomes limited and the computational their vast data networks. • Design and Visualization: Core data time it takes to search takes longer. • Cloud Computing: Third-party vendors science languages usually include The advantage of NoSQL databases is offer hosting solutions that provide visualization libraries to help explore that they are designed to be horizontally access to computational power, data data patterns and to visualize final scalable, meaning that another computer, storage and frameworks. This is an results. As many data projects produce or two, or a hundred, can be seamlessly excellent solution for firms that want interactive dashboards or data-driven added to grow the storage space and to engage in more sophisticated data monitoring tools, a number of vendors computer power to search them. While offer turnkey solutions. Some product analytics, especially big data, but do not relational solutions can also be scaled providers include: IBM, Microsoft, and distributed, they’re often more have the ability to invest in computer Tableau, Qlik, Salesforce, DataWatch, complex to manage and tune when data servers and hire technicians to manage Platfora, Pyramid, and BIME, among are saved across multiple computers. them. Prominent products include: others, some of which are exemplified Prominent NoSQL products include: Amazon Web Services (AWS), Cloudera, in the operational case studies in Hadoop, MongoDB and BigTable. Microsoft Azure and IBM SmartCloud. Chapter 1.2. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 33 ics yt app Da al o d s li c th & m an ta ions PART 1 at a e Dat Chapter 1.2: Ma a p s da ce na gi t ro ng a Data Applications for ur so jec Re t DFS Providers This chapter covers the three main areas in which data analytics allows firms to be customer-centric, thus building a better value proposition for the customer and generating business value for the DFS provider. It looks first at the role data insights can play in improving the DFS provider’s understanding of its customers. Second, it illustrates how data can play a greater role in the day-to-day operations of a typical DFS provider. Finally, it discusses the usage of alternative data in credit assessments and decisions. These sections will present a number of use cases to demonstrate the potential data science holds for DFS providers, but they are by no means exhaustive. The business possibilities that data science offer are limited only by the availability of the data, methods and skills required to make use of data. Presented below are a number of examples to encourage DFS providers to begin to think about ways in which data can help their existing operations reach the next level of performance and impact. Figure 8 illustrates how data analytics can play a role in supporting decision-making throughout a DFS business, along the customer lifecycle and corresponding operational tasks. As such, data play a key role in helping DFS providers become more customer- centric. It goes without saying that all organizations depend on customer loyalty. Customer centricity is about establishing a positive relationship with customers at every stage of the interaction, with a view to drive customer loyalty, profits and business. Essentially, customer-centric services provide products that are based on the needs, preferences and aspirations of their segment, embedding this understanding into the operational processes and culture. 34 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES The Customer Life Cycle Reduce customer Target the customers attribution most likely to take up DFS Identify need for product/process enhancements Measure Inspire Acquire marketing impact Predicting customer behavior Customer- Centric DFS Provider Improve Building loyalty Retain Develop customer activity programs Building closer Examine relationships with customer feedback valuable customers Pricing strategy Figure 8: Opportunities for Data Applications Exist Throughout the Customer Life Cycle DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 35 1.2_DATA APPLICATIONS Being responsive to customers is key consumer protection for this segment is forms. These data can be manipulated and to customer centricity. It is useful to higher because they could have less access analyzed to offer granular market insights. understand why customers leave and to information, lower levels of literacy, and Such analysis usually involves a diverse set when they are most likely to leave so that higher risk for fraud when compared to of methods, and both quantitative and appropriate action can be taken. Some other segments. DFS providers will need to qualitative data. This section starts with customers will inevitably leave and become understand the particular needs of these a case study to illustrate how small steps former customers. Using data analytics to customers and then design operational to incorporate a data-driven approach can understand how these customers have processes that reflect this understanding. bring greater precision to understanding behaved throughout the customer lifecycle Thus, understanding customers and customer preferences. It is followed by can help providers develop indicators that delivering customer value is crucial for DFS a discussion on how data can be used to will alert the business when customers are providers, and data can help them become understand customer engagement with a likely to lapse. It may also offer insights more customer-centric. DFS product in order to improve customer into which of these customers the provider activity and reduce customer attrition. may be able to win back and how to win Next, it explains how to use customer them back. 1.2.1 Analytics and segmentation to identify specific groups Applications: Market within the customer base and how to DFS providers often cater to people who Insights use this knowledge to improve targeting previously lacked access to banks or efforts. This is followed by a discussion other financial services as well as other This section demonstrates how to use data of how DFS providers can harness new underserved customers. This poses special to develop a more precise and nuanced technologies to predict financial behavior challenges for providers as they first understanding of clients and markets, and improve customer acquisition. Finally, establish trust and faith in a new system for which in turn can help a provider to develop this section examines ways to interpret their customers. Such customers may have products and services that are aligned customer feedback to improve existing irregular incomes, be more susceptible to with customer needs. As described in the products and services. economic shocks and may have different previous chapter, DFS providers have access expenditure trends. Finally, the need for to valuable customer data in a variety of 36 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES CASE 1 Zoona - Testing Marketing Strategies for Optimal Impact Developing Hypotheses for Successful Marketing Messages and Testing Them Zoona is a PSP with operations in The first strategy was called ‘Instant indicates results 30 percent better Zambia, Malawi and Mozambique, Gratification’, and it awarded all than the baseline pilot. The analysis where it aims to become the primary customers opening an account a free shows that the lottery methodology provider of money transfers and bracelet as well as a high chance of was the least popular, while the simple savings accounts for the receiving a small cashback reward highest number of opened accounts masses. Marketing is often a time- each time they made a deposit. was credited to the ambassador consuming and resource-intensive In the second strategy, called strategy. These accounts also had activity, and it can be difficult to ‘Lottery’, customers had a low high deposit values. Zoona also measure impact. Zoona dealt with chance of winning a large prize, with looked at customer activity rates, some of these challenges by using a only four winners selected over two measured as the number of deposits customer-centric approach to test months. The third approach involved three different marketing strategies per account. The instant gratification account-opening ambassadors who for a new deposit product called approach was the clear winner. went to high-activity areas, such Sunga. First, it ran a three-month In Figure 9, November 24, is the as markets, to encourage people to pilot of the Sunga product in one open accounts. date depositors began winning area, later extending the pilot to small cashback rewards every time another three towns to test three Statistics from the first month of this they deposited into their accounts: different marketing strategies, all in extended pilot are presented below. the blue line shows deposits rising order to identify the most impactful The numbers have been indexed significantly. approach for the nationwide launch. against the initial pilot town, so 1.3 Comparing Marketing Strategies, Results Table INDEXED (first 30 days) # Registrations Deposit Value Pilot 1.0 1.0 P1: Instant Gratification 1.4 1.9 P2: Lottery 1.1 1.8 P3: Ambassador 3.0 3.8 Table 1: Comparing Results, ‘ambassador’ strategy increases account openings 300% over DATA baseline AND DIGITAL FINANCIAL SERVICES ANALYTICS 37 1.2_DATA APPLICATIONS 2.0 1.9 1.8 1.7 Observe Rise in Blue Line No. Deposits per Account 24 November 2016 1.6 1.5 Registration Town 1.4 PILOT P1: IG 1.3 P2: LOTTERY P3: AMBASSADOR 1.2 1.1 1.0 Nov 01 Nov 14 Nov 28 Dec 01 Dec 14 Dec 28 Date Figure 9: Results of the Customer Incentives Marketing Campaign Testing Trials The outcome of the analysis was percent of those in the instant and ‘Instant Gratification’ strategies further supported by follow-up calls gratification group told a family or – the first to drive account openings, to customers. The feedback revealed friend about the product. As a result, and the second to drive customer that instant gratification also drove the nationwide marketing strategy activity levels. word-of-month marketing, as 88 now combines both the ‘Ambassador’ This case study illustrates that a rigorous approach to test marketing strategies does not need to involve complicated methodologies. Rather, a systematic approach and planning using quick iteration of techniques measured by customer response rates can create measurable insights. It also highlights the benefit of combining methodologies to arrive at the desired customer behavior. 38 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Use Case: Understanding Product effective messaging for the product, or used throughout the business to align with Engagement for DFS Offerings to develop actions to manage customer customer behavior and needs. This type Understanding how a customer uses interaction with the product. High levels of analysis can help inform marketing or does not use a product or service is of registration but low levels of activity strategies, agent recruitment strategies or important for making improvements to usually imply that the cost of acquiring adoption of best practice agents processes, the appropriate area of operations in order and maintaining active customers is for example. Figure 10 provides a simple to extend reach and increase adoption. unnecessarily high. Transactional data, illustration of how transactional data can Transactional data and customer profiling as well as geospatial data, can offer the be interpreted. The data analytic process is data provide valuable information on how provider insights into activity levels by also explored in more detail in Chapter 2.1. customers engage with a product over both customers and agents. These insights time. This feedback can be used to develop can help the provider effect changes Build Hypothesis Gather Data Analyze Data Data-driven Actions • What happened? • Transactional data • Simple statistical analysis • Change strategy based • Why did it happen? • Usage levels • Tables on findings • What is happening now? comparison of • Correlations • More primary research behaviors across groups • KYC data • CDR data Figure 10: The Process of Analyzing and Interpreting Data DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 39 1.2_DATA APPLICATIONS Improving Customer Activity than acquiring new ones. Large numbers country contexts as one single segment, A simple transactional analysis as seen of never-transacted customers indicate or use basic demographic segmentation to above may, for example, reveal that inadequate targeting at the recruitment understand customers. The reason for the highly active customers are associated stage. A high number of lapsed customers limited incorporation of segmentation into with specific agents. To be able to act on may indicate other limitations in the customer insight generation is twofold. this information, it will be necessary to service offering, which can be improved by First, beleaguered DFS providers in highly find out why this is the case. Could it be small product or process enhancements. competitive markets may be encouraged because of best practices adopted by the by the success of certain products and Use Case: Segmentation may feel compelled to adopt a product- agents, because of geographical location, or because of some other variable? As an Segments can be delineated by centric approach, rather than a customer- example, interviews could be conducted to demographic markers, behavioral markers centric focus, to their businesses. better understand agent techniques, and such as DFS usage patterns, geographic Thus, DFS providers may neglect to think geospatial data could be used to better data, or other external data from MNOs about the different possible uses for their understand the impact of location on such as usage and purchase of airtime and offerings depending on customer needs agent and customer activity. Very high or data. Understanding segments is necessary and concerns. Rather, they may choose very low activity groups often indicate the to uncover needs and wants of specific to highlight very particular use cases and need for deeper research and focus group groups as well as to design well-targeted messages for a product. For example, discussions to understand the reasons sales and marketing strategies. Insights while M-Pesa’s mobile money transfer behind them. from segmentation, intended to expand product was very successful in Kenya, revenue-generating prospects in each MNOs in other markets have not had the Reducing Customer Attrition unique segment, are critical inputs for an same success, emphasizing the need to Looking closely at transactional data can look at market and customer behavior institution’s strategic roadmap. Customer provide clues as to why customers are and needs market-by-market before segmentation is a crucial aspect of leaving the service and how to retain them. rolling out products. Second, there is a becoming a customer-centric organization The frequency with which customers lack of awareness about how to effectively that serves customers well, makes smart interact with a service can indicate whether segment client base and how to use this investment decisions and maintains a they have just been acquired, are active segmentation analysis. Segmentation healthy business. customers of the service, or need to be won does not need to be complicated or back into the service. Different messages In principle, many DFS providers recognize expensive. Practitioners should clearly and channels are relevant to customers the importance of segmentation. However, define business goals, which can lead the in each of these stages. Generally, keeping in practice, most DFS providers either segmentation exercise. existing customers is far less expensive serve the mass market in developing 40 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Customer Segmentation Super User Passive User Lapsed User Sends money every Receives money Changed SIM, doesn’t two weeks to from employer understand product, children studying in & withdraws used once, never the city immediately used again NEEDS NEEDS NEEDS Loyalty program Information on Information on product product, additional help from agent Figure 11: Examples of DFS Customer Segments, by Product Activity The following framework presented by the Consultative Group to Assist the Poor (CGAP) illustrates how different types of segmentation can be employed by a practitioner depending on their needs:17 Type of Example Data Needs Advantages Disadvantages Segmentation Demographic • Rural vs. Urban Registration and Know Your • Simple • Lack of uniformity within • Male vs. Female Customer (KYC) information • Data are easy to find groups • Old vs. Young • Less insightful Behavioral • Never transacted vs. dormant • Transactional DB • Data are easy to find • Lack of insight into the vs. active users • Easy to ascribe value to the customer’s life, needs, • Savers vs. withdrawers customer aspirations • Less useful for marketing messages Demographic and • Students • Registration and KYC • Ascribes value to a • Data are relatively harder Behavioral • Migrant workers sending information customer and provides to find money home • Transactional DB insights on their life and • Might have overlapping • Primary Market Research needs segments • Easier to develop marketing messages Psychographic • Women who want a safe • Deep and rich historical • Strongly responsive to • Difficult to find data place to save transactional data customer aspirations • Might have overlapping • Customers who believe • Primary research • Strong value proposition segments access to mobile money • Easier to develop marketing • Could be very dynamic implies higher status messages segment, i.e., wants could • Budget conscious change Table 2: CGAP Customer Segmentation Framework 17 CGAP (2016). Customer Segmentation Toolkit DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 41 1.2_DATA APPLICATIONS CASE 2 Tigo Cash Ghana Increases Active Mobile Wallet Usage Customer Segmentation Models Improve Customer Acquisition and Activation Tigo Cash launched in Ghana in the service. An actively transacting six months and nearly two terabytes April, 2011, and is the second-largest client base is not only a challenge in of CDRs and transactional data were mobile money provider in terms Ghana; the GSMA estimates global analyzed by a team of data scientists. of registered users. Despite high activity rates are as low as 30 percent. registration rates, getting customers Results from the analysis suggest that to do various transactions through In 2014, Tigo Cash Ghana partnered differences exist between customers mobile money remains a key challenge with IFC for a predictive analysis to across a large number of metrics of and focus. Client registration rates, identify mobile voice and data users mobile phone use, social network and maintaining activity rates, that had high probability to become structure and individual and group remained a key goal after launching active mobile money users. To do this, mobility. There are strong differences District-Level Adoption Rate: Tigo Cash Predicted Adoption (based on CDR): Tigo Cash Top Target Districts: Tigo Cash Figure 42 DATA 12: Current, ANALYTICS ANDPredicted and Top DIGITAL FINANCIAL Districts of Mobile Money Usage SERVICES between voice and data-only prevented them from using mobile potential active mobile money subscribers, inactive mobile money money services. Low levels of usage users. What started as an analysis of subscribers and active mobile money were more closely linked to people’s historical CDRs, delivered proof-of- subscribers. A strong correlation can lack of awareness of the mobile money concept value and led to a developed be observed between high users of value proposition or perceptions that data-driven approach that allowed traditional telecoms services and the they did not have enough money to use Tigo Cash to exceed the 65 percent likelihood of those users to also become the services. activity mark among its mobile active regular mobile money users. money clients. The active customer New Customers base grew from 200,000 prior to With the help of machine learning Predictive modeling resulted in the exercise, to over 1 million active algorithms, the research team identified 70,000 new active mobile money customers within 90 days. matching profiles among voice and users due to the one-time model use. data-only customers who are not yet The results mapped out the pool Institutional Mindset Shift mobile money subscribers, but who of likely mobile money adopters, As a mobile money provider, Tigo are likely to become active users. and identified locations where Cash has become a top performer The team also geo-mapped the data below-the-line marketing activities in Ghana. The output of the (see below) for further analysis. were achieving the highest impact. collaboration became the foundation Moreover, the analysis of CDRs and Having an ex-ante idea of marketing of all of Tigo Cash Ghana’s customer transactional data was complemented potential in different areas avoids acquisition work. Above all, the data by surveys to not only understand what the overprovision of sales personnel analysis showed the value of knowing happened, but why. and increases marketing efficiency. customers. Tigo Cash Ghana plans The data-driven approach delivered to increase its internal data science Determinants of Mobile a smarter and more informed way to capacity as well as to further Money Adoption target existing telephone subscribers improve its customer understanding The need for further customer to adopt mobile money. with additional primary research. education and product adaptation The goal has now shifted from is something that came out clearly Improved Activity Rates registering new customers who are through the individual surveys. Only a SMS usage, and high-volume voice expected to be active, to thinking small proportion of mobile money users and mobile data usage are key ahead about ways to keep activity reported that agent non-availability factors that were used to identify levels high in a sustainable way. An institutional approach to customer acquisition and retention can be fundamentally changed and improved, simply by making use of existing data to make more informed operational decisions. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 43 1.2_DATA APPLICATIONS Targeted Marketing Programs Loyalty and Promotional Republic of Congo (DRC), IFC analyzed Targeting the right market groups, with Campaigns agent transaction data and registration the right advertising and marketing There may be customer segments forms in the DRC to show that being a campaigns, can greatly increase the that conduct a very high number of woman and being involved in a service- effectiveness of a campaign in terms of oriented business are highly correlated with transactions on the DFS channel. uptake and usage. Using a combination of being a higher-performing agent.18 These segments may desire loyalty rewards data sources, DFS providers can segment for specific transactions such as payments Product or Process Enhancements transactional data by demographic at certain kinds of merchants. Alternatively, Classifying customers into segments parameters in order to identify strategic the DFS provider may be able to nudge also allows DFS providers to pay greater groups within their customer base. other segments towards certain kinds attention to the specific needs of a Marketing programs can be customized of transactions by offering promotional representative cohort. In a bigger group, to target these groups, often with greater campaigns. Specific transactions in the these needs may get lost – but paying efficiency and effectiveness than standard database and customer profiles would help attention to smaller segments allows approaches. DFS providers have been identify which groups would benefit from DFS providers to sharpen their focus and known to combine segment knowledge such campaigns. explore underserved or ignored needs with data on profitability in order to focus High-value Customers and wants. For example, within a group marketing efforts on segments that are Relationships of people not using a service, there might likely to optimize profits. Similarly, other be those who are lapsed customers, or DFS providers have used customer life Segmenting customers based on those who transacted a few times but cycles to make the right product offers to profitability is a common application of then stopped using the service. Talking to the right customers. The main challenge the segmentation process. Additionally, these users might reveal a need to make here is to find what customer groups care one can assess the groups that are likely small changes in the product or process. about in order to design an appropriate to become important in the future. Alternatively, customers in one segment marketing campaign. While the universe of DFS providers can use this information to may use the full suite of products offered data available to DFS providers is growing increase their market share of this group by a DFS provider, while another segment every day, in the absence of analysis to shed and to decrease resource allocation to may use only one or two of these products. light on this, once the customer groups are less profitable groups. The data needed In such cases, segmentation provides identified, DFS providers can use primary for this kind of analysis are customer insight for targeted market research and research to identify what the segments demographics, transactional data and data product development with the objective care about. All customer data can be used around customer profitability. of unlocking customer demand. to develop targeted marketing programs. However, results are likely to be sharper This is equally applicable to identifying high- if the analysis is done on the members of performing agents based on segmentation. specific customer segments. Working with FINCA in the Democratic 18 Harten and Rusu Bogdana, ‘Women Make the Best DFS Agents’. IFC Field Note 5, The Partnership for Financial Inclusion 44 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Market Opportunity and customers based on demographics tends to making it difficult to develop insightful Priority Products treat all customers in a group as the same, messaging around these segments. Once the segmentation exercise is irrespective of their level of activity on the channel. Demographics can also be static in Conducting a customer database complete, DFS providers can assess the nature, where – particularly in the world of segmentation exercise requires dedicated extent to which their product offering meets the needs and wants of each segment. tech-enabled financial access – customer resources and a detailed plan. Notably, They can estimate which segments behavior is dynamic and ever-changing. segmentation strategies that make use represent the greatest opportunity over of multiple sources of data are most Access to transactional databases can time and how competitive their offering successful in usefully and accurately transform traditional segmentation into is within these crucial growth segments. describing customer groups. Thus, the a powerful tool to generate customer Thus, an analysis based on segmentation process to develop customer segmentation insights. With the increased availability can play a powerful role in the strategic must incorporate this approach. Data of data, new data analysis tools and roadmap of a DFS provider. analysis plays an important role in this multiple channels available to customers, process, as it allows DFS providers to Traditional demographic segmentation DFS providers now have the option of using individual behavioral information. segment exactly by the variables that – which can be age-based, income- based or geography-based – is useful, This information better predicts people’s play a role in driving usage and uptake. but experience shows that demographic financial needs and usage. Furthermore, it This report only discusses the role of data segmentation is less predictive of an reflects the changing needs and activities analysis in facilitating this process, but it institution’s future relationship with of the customer. However, behavioral data is important to note that those segments a customer than segmentation based may not have a lot of information about can be created through multiple kinds of on behavioral characteristics. Grouping customer needs and aspirations, thus research and analysis. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 45 1.2_DATA APPLICATIONS CASE 3 Airtel Money - Increasing Activity with Predictive Customer Segmentation Models Machine Learning Segmentation Model Delivers Operational Value and Strategic Insight Airtel Money, Airtel Uganda’s DFS was able to identify potential active below a given cutoff. While not as offering, was launched in 2012. users with 85 percent accuracy. accurate as the sophisticated model, Initial uptake was low, with only This yielded 250,000 ‘high-probability’, it provided a solid ‘quick cut’ that a fraction of its 7.5 million GSM new and active Airtel Money customers could be used against KPIs to rapidly subscribers registering for the service. from the GSM subscriber base for assess expectations. Activity levels were also low, with Airtel to reach with targeted marketing. around 12.5 percent active users. Geospatial and customer network Finally, the study analyzed the IFC and Airtel Uganda collaborated analysis helped to identify new areas of corridors of mobile money movement on a research study to use big data strategic interest, mapped against new within the region. It found that analytics and predictive modeling uptake potential. 60 percent of all transfers happen to identify existing GSM customers within a 19 kilometer radius in and who were likely to become active The machine learning model around Kampala. Understanding this users of Airtel Money. identified some variables with high need for short-distance remittances statistical reliability, but they made also informed Airtel Money’s The project analyzed six months of little business sense, like ‘voice marketing efforts for P2P transfers. CDR and Airtel Money transactions. duration entropy’. As a result, a Moreover, this network analysis of The analysis sought to segment highly supplementary analysis delivered P2P transactions identified other active, active and non-active mobile business rules metrics, or indicators towns and rural areas with activity money users. The study identified three that had good correlation to corridors that could drive strategic differentiating categories: GSM activity potential activity and also had high engagements beyond Kampala for levels, monthly mobile spending and relationships with business KPIs. Airtel to focus on growing. user connectedness. Using machine Each metric had a numeric cutoff learning methods, a predictive model point to target customers above or 46 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES P2P Transactions Sent Out by Source Number CDR Customers Location Mitya n a M ba ra ra M ba le Ma sin di Gulu Kampala ta See ja Jin Masaka Figure 13: Network analysis (left) of P2P flows between cities and robustness of channel. Also pictured, geospatial density of Airtel Money P2P transactions (center), compared with GSM use distribution (right). Data as of 2014. Advanced data analytics can provide insights into active and highly active customer segments that can drive propensity models to identify potential customers with high accuracy. Network and geospatial analysis can deliver insights to prioritize strategic growth planning. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 47 1.2_DATA APPLICATIONS Use Case: Forecasting Customer Predictive analysis can help practitioners Predictive analysis could help identify Behavior achieve the following goals: customers at the acquisition stage who are much more likely to become active users in Predictive modeling is a decision- • New customer acquisition the future through a statistical technique making tool that uses past customer • Developing an optimal product offering known as response modeling. Response data to determine the probability of • Identifying customer targets and modeling uses existing knowledge of future outcomes. DFS providers evaluate predicting customer behavior a potential customer base to provide multidimensional customer information in • Preventing churn a propensity score to each potential order to pinpoint customer characteristics customer. The higher the score, the more that are correlated with desired outcomes. • Estimating impact of marketing likely the customer will become an active As part of modeling, each customer New Acquisition and Identifying user. MNOs who are DFS providers have is assigned a score or ranking that Targets used this kind of modeling to predict which calculates the customer’s likelihood to take members of their voice and data customer As evidenced by research and practitioner a certain action. base are likely to become active users of experience, practitioners have successfully registered large numbers of new clients for their DFS service. The model is predicated For a customer-centric institution, predictive their DFS services. However, transforming on the hypothesis that customers who are modeling can inform how it understands and these registered customers into active likely to spend more on voice and data are responds to client needs. However, there customers remains a difficult task that also likely to adopt DFS. Using CDR data, remain a few impediments that prevent it only a few DFS providers have been able the model is able to predict with a high from being more widely used. There has to master. On average, about one third degree of accuracy how likely a customer is been a perception – that is now gradually of registered customers have conducted to become an active user of DFS. changing among DFS providers – that a single transaction in the last 90 days.19 providers already know their client base well Developing Optimal Product One of the reasons identified for these low enough to understand what products and levels of activity is inadequate targeting at Offerings marketing campaigns work. Alternatively, the recruitment stage. Most DFS offerings There are predictive models that can some DFS providers look at what has target the vast mass market. As such, be used to discover what bundles of worked elsewhere and try to replicate similar they are able to sign up a large number of products are likely to be used together by products and services in their own markets. customers, but have had limited success customers. Thus, the model will identify Many providers are also unsure about exactly converting these clients into an active and segments that tend to use only a single how and where to start the process. profit-generating customer base. product such as P2P transfers and others 19 ‘State of the Industry Report on Mobile Money’, Decade Edition 2006 - 2016, GSMA 48 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES who make use of multiple products, such customer level or at the aggregate level to Estimating Marketing Impact as deposit services, airtime purchase and a segment as a whole. Marketing for DFS tends to be resource- P2P transfers. However, the second group intensive given its relative newness may never use the service for microloans. Notably, a comprehensive predictive in many markets. This is furthered by This is information that the DFS provider analysis of lifetime customer value requires the realization that a product requires can use for marketing purposes and a high level of active customers across awareness-building before achieving product development. product and channel areas. This may not customer acceptance. Without a tool to yet be realistic for many DFS providers. Predicting Customer Behavior measure success, managers are forced to However, as organizations grow, being rely on gut feelings and high-level sales This analysis can also be used to understand able to forecast future customer patterns data in order to assess the value of their future value potential for each customer. and trends will not only become possible, marketing efforts. Given that customers This includes the lifetime customer value, but imperative to grow a healthy business. are now interacting with DFS providers on customer loyalty, expected purchase and Thus, being aware of this functionality multiple channels, digital and otherwise, usage behavior, and expected response can help DFS providers incorporate it it is also challenging to isolate the effects to campaigns and programs. Similarly, into their decision-making process as and of specific campaigns, as customers are DFS providers can increase their up-sell when relevant. exposed to multiple messages at any given and cross-sell opportunities by predicting future usage through the current basket of Preventing Churn point in time. products and patterns in use. Determining Customer churn happens when a customer Predictive modeling allows for the which bundles of products work together leaves the service of a DFS provider. measurement of marketing impact on through transactional data analysis also The cost of churn includes both the lost customer behavior. Depending on the presents an opportunity for cross-selling. future revenue from the customer, as well data available, the analysis can allow DFS For example, a PSP may find that users as the marketing and acquisition costs providers to estimate ‘lift,’ or an increase in are using the wallet as a storage account, related to replacing the lost customer. sales that can be attributed to marketing. an indication that these customers may Additionally, at the time of churn, revenues Predictive modeling will identify how be serviced more effectively through a earned from the customer may not have specific marketing measures can impact savings account. covered the acquisition of that customer. customer behavior across segments. This information can be used across several Thus, analytics around customer churn It may demonstrate, for instance, that a operational functions: campaign and have two objectives: predicting which certain marketing action or advertising marketing design, financial projections, customers are going to churn, and on a certain channel may have a much customer investment allocation, and understanding which marketing steps are higher response from certain segments as future product development. This kind of likely to convert a customer at a high risk of compared with the average response from prediction can also be used at the individual churning into a retained customer. the population. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 49 1.2_DATA APPLICATIONS Personalized Marketing Messages non-specific value proposition for DFS. 2. Understand Customers: Then, DFS The previous sections have already Finally, the right marketing message will providers need to examine these data discussed how targeted marketing can pull the customer to take action based on and consider segmentation into groups use a deeper understanding of customer the messages they receive, presumably based on common characteristics. segments. Personalized marketing is targeted because they speak to the underlying pain 3. Develop Messages and Interact with marketing at an extremely individualized points of the customer. Customers: DFS providers should then level, where an individual customer’s wants develop messages for customers and and needs are being anticipated using Some personalized messages may fail in identify the appropriate channels to their past behavior and other reported their targeted objectives, as unsolicited deliver messages to their customer base. information. Many potential customers messages can easily be ignored, or worse, The next step is to engage with the have limited experience with financial may cause negative associations with the customer base through the messaging. services and are often suspicious of its ability DFS provider. Thus, personalized messages to be relevant to their lives. Personalized need to be carefully crafted and targeted Test the Efficacy of Messaging: The 4. messaging allows DFS providers to ‘speak’ in order to ensure they are reaching impact of the message can be measured to their customers as if they know them, customers who require the information. using A|B testing. Personalization must thus enabling DFS providers to win be accompanied by testing so that it is customer trust. Additionally, customers are How can DFS providers personalize possible to assess its impact. able to have a highly tailored relationship marketing messages? 5. Refine the Message: Customer feedback with their provider. In competitive and the measurement of impact must markets, personalized messages would 1. Collect Data and Identify Customers: feed into further message refinement. help build an affinity for one service over First, DFS providers need to collect another. Customers are much more likely data about their customers. The to respond to messaging that responds to sources for these data include their interests, rather than impersonalized customer transactions, demographic messaging that refers to a very high-level, data, preferences, and social media inputs. 50 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES CASE 4 Juntos Delivers Scalable and Personalized Customer Engagement Messages Data Sources: Qualitative and Quantitative Data Improve Segmentation and Outreach Juntos, a Silicon Valley technology Good data underpin this approach. To begin, messages are delivered to company, partnered with DFS First, Juntos conducts ethnographic users, and users can reply to those providers to build trusting research to better understand messages. This develops the required relationships with end users; customers in the market. Engagements trust relationship. More importantly, improving overall customer activity are always informed by quantitative those responses are received by an rates. Globally, many DFS providers data provided by the DFS partner, automated Juntos ‘chatbot’ that analyzes the results according to experience high inactivity and qualitative behavioral research done three KPIs: low engagement. This discourages in-country and from learnings drawn providers, whose investments may from global experience. Having • Engagement Rates: What percent not be seeing sufficient financial developed an initial understanding of of users replied to the chatbot? return and whose customers may the end user, Juntos conducts a series How often did they reply? have access to services of which they of randomized control trials (RCTs) • Content of Replies: What did the are not making sufficient use. Juntos prior to full product launch. These responses say? What information offers a solution to this problem controlled experiments are designed did they share or request? by using personalized customer to test content, message timing or • Transactional Behavior: Did engagement messages based on data- delivery patterns, and to identify transactional behavior change after driven segmentation strategies that the most effective approach to receiving messages for one week? deliver quantified results. customer engagement. One month? Two months? DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 51 1.2_DATA APPLICATIONS These experiments enable Juntos to examples, but they show how a or female; income range; and usage understand which inactive clients generic message compares with a patterns, merging this information became active because of Juntos personalized message with a time- with ethnographic data on consumer message outreach, and to understand sensitive prompt. Juntos’ baseline sentiment. which messages enabled higher, more ethnographic data improve qualitative consistent activity. For example, a understanding of customers, helping By testing a wide variety of messages, control message is sent to a randomly build the hypothesis around which Juntos is able to segment user groups selected group of users: “You can use messages are likely to resonate, according to messages that show your account to send money home!” then putting those messages to statistical improvement in usage Others might draw from service data statistical test. over time. This means that high- to include the customer’s name: “Hi engagement messages can be crafted The first question is whether the John, did you know that you can test messages yield statistically for everyone from rural women, use your account to send money better results compared with the to young men, to high-income home?” Perhaps other data will be generic control message. When the urbanites. The Juntos approach incorporated within the message: answer is “yes,” it is important to is tailored for each context and “You last used your account 20 days dive one step deeper and ask about is continuously tuned to nimbly ago, where would you like to send the respondent and surveying across accommodate customers who change money today?” These are merely segments such as rural or urban; male their interactions over time. Collecting qualitative customer sentiment and market data improves understanding of customer behavior, which helps providers craft messages that people like to see. Statistical hypothesis testing identifies which messages resonate best with specific groups, enabling personalized messaging for targeted audiences. 52 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Use Case: Understanding 2. Sentiment Analysis: Sentiment analysis boost word-of-mouth marketing is not Customer Feedback and Text or ‘opinion mining’ is an algorithm-based difficult. However, for new products, like Analytics tool used to evaluate language, both DFS, providers need to find a method DFS providers can also extract usable spoken and written, to determine if the to catalyze the education levels among insights about customer preferences and opinion expression is positive, negative, potential customer bases, especially or neutral and to what extent. Through among customers who build enthusiasm attitudes through new algorithm-based this analysis, DFS providers understand and momentum for the product within the techniques called text mining, or text how customers feel about their products, target customer base. Typically, customers analytics. Today, many companies can how they relate to the brand and how are more motivated to spread the word access information about customer likes these attitudes are changing over time. about one or two specific use cases; they and dislikes through social media, emails, Of particular interest are any peaks or will rarely spread a generic message about websites, and from call center conversation troughs in the sentiment analysis. the brand. Social media feeds and other transcripts. Notably, these methods web-based information can be used to have been applied in developed country Currently, evaluations from text analytics identify influencers by their connectedness, contexts in Europe and North America. can be applied across three areas: level and nature of interaction and potential However, DFS providers in emerging reach. This kind of analysis is dependent markets may also want to analyze these Product and Service Enhancement on unstructured social network data, data data to help grow business. Text analysis DFS providers could make quick from review sites and data from blogs. may also be done manually. With advances improvements to products and services if in technology, these methods are likely Marketing Impact and Monitoring they could hear directly from customers. to become cheaper and more adaptable Feedback Social media, emails and other direct to developing country contexts and feedback mechanisms are a great way Opinion mining allows DFS providers to languages. to immediately and directly hear from understand the thinking process of huge The most common application for text customers. Market research can be a numbers of customers. Through sentiment analytics is across two methods: limited source of customer feedback in analysis, it is possible to track what this respect. customers are saying about new products, 1. Text Summarization Methods: These commercials, services, branding, and methods provide a summary of all of the Word-of-mouth Marketing other aspects of marketing. This analysis key information in a text. This summary Word-of-mouth marketing remains can also be used to understand how the can be created by either using only the the most trusted form of advertising for market perceives competitor products and original text (extractive approach) or by many customers. For products and DFS services. These data from social media, using text that is not present in the text providers that have large existing customer blogs, review websites, and other websites (abstractive approach). bases, motivating satisfied customers to in the social sphere are also unstructured. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 53 1.2_DATA APPLICATIONS 1.2.2 Analytics and Applications: Operations and The operations team has an important role in organizational structure, being Performance Management independent from other core functions and The operations team is responsible for running ‘the engine room,’ which is core to the DFS also integrated in major business activities. business because it performs a myriad of tasks, including: collecting data, storing data and The nature of the team’s responsibilities ensuring its fluid connectivity among various systems and applications for the DFS provider’s require technical skills, as well as knowledge entire IT environment; constantly monitoring data quality; onboarding and managing agent of business. This combination enables performance; ensuring that the technology is operating as designed; providing customer meaningful data interpretations that can support; delivering the information and tools needed by the commercial team, including eventually help in the decision-making performance measurement, risk monitoring and regulatory reporting; resolving issues; processes of key business stakeholders. efficiently monitoring indicators, exceptions and anomalies; managing risk; and ensuring This section describes the role that data that the business meets its regulatory obligations. This cannot be done efficiently without can play in optimizing the day-to-day access to accurate data, presented in a form that is relevant, easily digestible and timely. operations of a typical DFS provider. It starts by describing how data can be turned into useful information, giving real Agent Lifecycle life examples of data analysis in action. This includes some tips on best practice Business Partner Customer Lifecycle in DFS data usage. As the use of data Lifecycle dashboards becomes increasingly common, it provides insights into dashboard creation and content. Use Case: Visualizing Performance Risk & Compliance Operations Develop & Manage With Dashboards Tasks Product It is often said that a picture is worth a thousand words. Thus, finding a graphical way to represent data is a powerful way to communicate information and trends quickly, which is critical for constant Billing, Revenue, Technical Operations Commission monitoring of business performance and key for identifying risks before they E-Money Reconciliation develop. Well-structured dashboards, tailored towards various groups of users, should reflect demand from the business units and help them make more Figure 14: Operations Tasks informed decisions. 54 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Turning data into graphs and other activity accounts to activate the customer • Sales: Agent performance; merchant forms of visualization makes it easier to and not let the account become dormant. and biller performance; sales team communicate the information revealed Some of these dashboards would allow end performance and also helps spot trends and anomalies in users to manipulate the data to visualize • Operations: Agent liquidity management the data. Many people in the organization various data cuts and segments. Often, • Customer Care: Call center statistics do not have the time or the resources to these kinds of dashboards are presented and insights analyze the data themselves; they simply live on a large screen on the team floor want the answers to questions that will • Technical Operations: Technical for everyone to see. For field staff, where help them do their job more effectively. operations insights internet access may be of variable quality, online dashboards can be downloaded and Off-the-shelf data management tools A dashboard gives a snapshot of the KPIs cached locally for use in the field. have advanced enormously over the relevant to a department or to the overall business. If there is rarely a need to take last few years. It is likely that standard Other management dashboards provide action based on the reported data, the dashboards are available as part of the insights by analyzing data from the previous dashboard metrics are probably incorrect. technology vendor package. In order to day, week, month, or year, and hence can In order to design robust dashboards, it is gain the deeper insights required and to do be delivered in multiple ways, including important to incorporate feedback from so in a reproducible manner, there are two reports, presentations or via an online the ultimate users, in order to meet their standard approaches: portal. Consequently, each department specific needs. Without this feedback, the and project team needs dashboards 1. Return to the Vendor: There is often dashboards might become obsolete and all personalized to the department’s goals and budget available for vendors to make efforts to develop them would be wasted. initiatives. Typically, as a minimum, DFS changes to the dashboards, but multiple Therefore, dashboard development is a solutions should have multiple operations department requests and multiple joint venture between the operations and dashboards covering the following areas, vendor clients vying for attention can business teams, which might go through each providing role-based access by lead to capacity issues and delays. several iterations to circle down the specific audiences: 2. Use Excel to Manipulate Raw Reports feedback loop of the various stakeholders. Downloaded from System ‘Data • Risk: Revenue leakage; Non-performing Some dashboards need to be real-time. For Cubes’: When a question is given to the loans (NPLs); Anti-money laundering example, a technical operations team needs business decision support team, it will (AML) insights; capital adequacy; fraud to act on alerts raised in real time: customer create a custom dashboard and deliver detection care managers actively assess call volumes a report or PowerPoint presentation to to assign team work and manage incidents, • Finance: Profit and loss insights; attempt an answer. This is another ad risk management teams are constantly e-money oversight hoc form of dashboard creation. informed about missed repayments, and • Marketing: Customer insights and sales teams can take early actions on low- trends for various offerings DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 55 1.2_DATA APPLICATIONS The latest generation of data management Standard Operations Reports one should identify exactly what one tools allow the freedom to investigate In order to improve their businesses, DFS wants to know and confirm that action will areas of interest without needing expertise providers are trying to find the answer to be taken as a result of obtaining the data. in data manipulation. However, underlying questions such as: databases need to be designed and Well-structured departmental KPIs provide optimized to successfully deploy and use the operations teams with insights from • What was the transaction volume these types of tools. Whatever the data which they can measure performance and value? management process or system being versus targets. They help teams understand • How many customers and agents used, these are the points to consider what is happening on the ground and where were active? when creating a dashboard: there is the potential for improvement. • What revenue did we make? 1. Think About Answering “So What?”: The standard KPI reports about the main • How does this compare with last month The results should be actionable, not just business drivers are usually segmented by and with the budget? ‘nice to know.’ Many dashboards only operational area. The focus KPIs of each • Are any risk indicators outside of show the current status of the business respective operational area are in Table 3. acceptable ranges? and do not give context of previous • Are there any recurring unusual results or time-based trends. transactions, any spikes in activity or any Decide What Question is Being 2. anomalies that signal unusual activity? Answered Before Starting: Often, reports are a dumping ground for all the The starting point is to focus on the KPIs, data that are available, whether they are or metrics with quantifiable targets useful or not. These types of reports do that operational strategy is working to not contain the motivational metrics and achieve and against which performance measures that increase performance. is judged. The overall business KPIs should 3. Design the Report to Tell a Story: directly relate to the strategic goals of the Once the right data are measured and organization and, as a result, determine collected, the report should contain eye- the specific KPIs of each department. catching information to lead the reader The most useful data are those that can to the most important points. Make it be turned into the information needed to visual, interesting and helpful. make decisions. Before creating a report, 56 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Department Topics of Focus for KPIs Finance and Treasury Revenue, interest income and expenses, fees and commissions, amount held on deposit, transaction volume and value, customer and agent volume (active), indirect costs, and issuing e-money for non-banks, bank statement reconciliation Business Partner Lifecycle Recruitment, activity levels, issue resolution, performance management, reconciliation and settlement (merchants, billers, switches, partner banks, other PSPs) Customer Lifecycle Management KYC management, activity levels, transactional behavior, issue resolution (customer services), and account management Technical Operations Monitoring product performance, monitoring partner service levels, change management, partner integration, fault resolution, incident management, and user access management Credit Risk Portfolio risk structure, non-performing loans, write-offs and risk losses, loan provisioning Operational Risk and Compliance Operational risk management, suspicious activity monitoring and follow up, regulatory compliance, due diligence, and ad hoc investigations Agent Network (DFS specific) Recruitment, activity levels, float management, issue resolution, performance management, reconciliation and Lifecycle settlement, and audit Other Depending on the nature of the DFS, other reports may be required, for example, organizations extending credit will perform credit rating, debt recovery and related tasks Table 3: Focus KPIs by Operational Area Depending on the business strategy and always a temptation to include peripheral improved, but they generally do not need departmental objectives, a selection of data, which are not strictly needed to to be reported to a wider audience unless the above data are presented as the understand the health of their department, there is a specific point to be made. A good business and departmental KPIs. These within management reports. This can example of this is the approach illustrated may, ideally, be presented as dashboards, be distracting or lead to inappropriate with MicroCred’s use of data dashboards. or as a suite of reports. It is important for prioritization. The support data are vital each department to segregate their data to help understand the drivers of the into KPIs and support data as there is KPIs and determine how they can best be DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 57 1.2_DATA APPLICATIONS CASE 5 MicroCred Uses Data Dashboards for Better Management Systems Data Visualizations and Dashboards for Daily Performance and Fraud Monitoring MicroCred is a microfinance network focused on financial inclusion across Africa and Asia. In Senegal, it operates a growing microfinance business offering financial services to people who lack access to banks or other financial services. Reach has been extended across the country by creating a network of over 500 DFS agents. The agent’s POS devices can perform both over- the-counter (OTC) transactions for bill payments and remittances, and also facilitate deposit and withdrawals to MicroCred accounts. Transaction confirmation is provided through SMS receipt. By late 2016, nearly one third of customers had registered their account to use the agent channel, and over one quarter were actively using agent outlets to conduct transactions. This generated significant operational and channel performance data. Figure 15: Example of MicroCred Dashboard Data 58 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES MicroCred was an early adopter of Agent activity, with alerts to • questions raised by the data presented next-generation data management show non-transacting and under- in the dashboards. It focuses on: systems, acquiring and implementing performing agents BIME, a visualization tool to help Usage of MicroCred • branches • Suspicious activity and potential optimize operations. It enabled versus agents fraud alerts, such as unusual agent MicroCred to develop interactive or customer activity • Customer adoption and usage of DFS dashboards, tailored to answer • Deployment of the DFS channel Monitoring of DFS enrollment • specific operational questions. process, with focus on unsuccessful Evolution of fundamental KPIs • MicroCred most frequently uses two enrollments versus long-term goals dashboards: • Geographical spread of transactions With visualization tools like Daily Operations Dashboard Monthly Strategic Dashboard BIME, it is simple to create graphs This gives a daily perspective on This gives a longer-term, more to illustrate operational data, the savings and loan portfolios, strategic view and is mainly used by making it easier to spot trends and highlighting any issues. It presents the management team to visualize anomalies, and to communicate them data over a three-month period, but more complex business-critical effectively. Implementing the data can be adjusted according to user measures. It was developed to management system also presented needs. This dashboard uses automated consider behavior over the customer some challenges, both technical and alerts to warn the operations team lifecycle, including how usage of cultural. MicroCred recommends that of potential problems. The reports, the service evolves as customers a step-by-step approach is adopted, customized for operational teams, become more familiar with both the starting with some basic dashboards, include measures such as: technology and the services on offer. and building up over time to more • Tracking KPIs, including transaction It is also possible to easily perform sophisticated dashboards. volumes, commissions and fees ad hoc analyses to follow up on any Visualization tools and interactive dashboards can be integrated into data management systems and provide dynamic, tailored reports that serve operations, management and strategic performance monitoring. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 59 1.2_DATA APPLICATIONS Data Used in Dashboards Customer Data team and the agents are geographically There are two main levels of data recording Having a unique customer identifier is dispersed with varying levels of required to develop the dashboards: connectivity and are often equipped with crucial, especially when the dashboard is transaction and customer level. They serve fairly basic technology. Nevertheless, sourcing data from multiple applications. different goals, but both are important. their data needs are many. Relationship Through data integration, providers can managers, aggregators and agents with control data integrity to ensure quality data Transaction Data multiple outlets in multiple locations recording, which is necessary for tracking Transaction data are characterized need performance and float management portfolio concentration, calculating product by high frequency and heterogeneity. information. Field sales force workers who penetration, cross-selling and sales staff However, DFS providers should aim infrequently return to the office to access coverage, and analyzing other important to standardize transaction typology information remotely. The agent needs metrics. There are generally two large in order to track product profitability, information on their own performance in groups of data that need to be recorded monitor and analyze customer (and terms of transaction and customer count, on a customer level: demographic and agent) behavior, and raise early warning volume of business, efficiency of sales financial. Full lists of data metrics can be signals of account underperformance or (conversion), and profitability. Potentially, found in Chapter 1.2. The combination low activity. Transaction types should be information on the cash replenishment of transaction-level and customer-level clearly differentiated and should be easily services available, particularly in markets data can provide useful insights about the identifiable in the database, even when where agents can provide e-money float behavior of certain customer segments the transactions look technically similar. and cash management services to each and can lead to optimal performance For example, a common cause of confusion other, will be useful. In markets with occurs when there are multiple ways of management. independent cash management partners, getting funds into a customer account, agents also need to be armed with data on Use Case: Agent Performance such as incoming P2P, bulk payments float levels. Management or cash-ins, but all data are combined and simply reported as ‘deposits.’ These Agent management is probably the Agent performance management needs three transaction types should be treated most challenging aspect of providing granular data, linked directly to the teams separately because of their very different successful digital financial services, as it responsible for managing the outlets. impact on revenue – one is a direct cost, requires regular hands-on intervention by Agent performance data need to be easily one a source of revenue and one potentially a field sales team as well as back-office segmented in the same way that the cost neutral – and because of their operations support. It can be problematic sales team is structured; each section and implications for the marketing strategy. to disseminate information, because the individual can see their own performance. 60 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES This is the basis for setting performance Identifying the Strongest Agents acceptable standard, or if this proves targets that can be accurately assessed and impossible, retired from service. Because Quality agents should be rewarded for their rewarded. In the example below, both the lack of e-money liquidity has strong efforts. Incentives including marketing teams and the people responsible for each correlation with non-performance, a key activities and over-riders, or performance- level of the agent hierarchy, from sales metric often used for agent performance related bonuses, can be based on these director to district sales representatives, analysis is the number of days ‘out of stock’ data. Having personalized agent targets need accurate, timely data relating per month (that is, float levels below a based on local market conditions, and directly to their responsibilities. The most threshold value). having a way to clearly show the agent useful information the sales team can be how they are performing against their This kind of agent data analysis is very given relates to the agents for which they own targets and their peers, can be very effective, but quite detailed and often are responsible. powerful. Targets include liquidity and performed manually, which can be slow customer activity. A key characteristic of and labor intensive. Providing the sales Agent Coverage Gaps a good agent is that they rarely run out of team with automated data management There are no definitive answers for the e-money or cash float. Agent aggregator tools that they can use in the field, as optimal number of agents needed for targets should be based on the liquidity well as personalized performance metrics, each customer to have reasonably easy management activity they are contracted can be powerful. The Zoona case study access to an agent and for each agent to to support as well as their agent team’s demonstrates these points well. have enough customers to generate an performance. acceptable income. Research points to somewhere between 200 and 600 active Identifying the Weakest Agents customers per active agent as optimal In most markets, around 80 percent for DFS providers, depending on market of agents are active. This means that conditions. A key sales task is to monitor customers wishing to transact with the the agent and customer data, controlling other 20 percent of agents will probably be the growth and location of agent outlets unable to do so because there is insufficient to ensure that they are in line with float or an absent agent. Underperforming customer activity. agents need either to be brought to an DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 61 1.2_DATA APPLICATIONS CASE 6 Zoona Zambia - Optimizing Agent Performance Management Data Culture: An Integrated Data-driven Approach to Products, Services and Reporting Zoona is the leading DFS provider in maximize business growth. Factors Agent Lifecycle Zambia, offering OTC transactions such as the number of customers A relatively new agent on a main through a network of dedicated served per day by existing agents and road may not be as productive as a Zoona agents. Agent services include: queue lengths are used to determine mature agent in a busy marketplace, customer registration, sending and local demand and potential for due to location and the mature agent receiving remittance payments, growth until saturation is reached. having developed a loyal customer providing cash in and cash out To ensure reliability, modeled base. However, a robust DFS service for accounts, and disbursing bulk scenarios are cross-referenced with needs agents in both locations – and payments from third parties, such as input from the field sales team, which the targets set for each agent should salaries and G2P payments. Zoona has local knowledge of the area and be realistic and achievable. Zoona has a data-driven company culture the outlets under the most pressure. analyzes agent data to project future and tasks a centralized team of data performance expectations for agent In key locations, the team also uses analysts to constantly refine the segments, such as urban and rural, Google Maps and physically walks sophistication and effectiveness of its producing ‘performance over time’ along the streets, observing how busy services and operations. curves for each agent, down to the they are and where the potential hot suburb level. These support good Agent Location spots may be. For example, thousands agent management KPIs. Zoona has developed an in- of people may arrive at a bus depot, house simulator to determine the then disperse in various directions; Liquidity Management optimum location for agent kiosks. Zoona maps the more popular Agents require a convenient source The approach uses Monte Carlo20 routes, creating corridors where of liquidity to serve transactions, simulations to test millions of potential customers are likely to be so proximity to nearby banks or possible agent location scenarios found. Zoona also maps the location Automated Teller Machines (ATMs) to identify which configurations of competitors on these routes. is included in placement scenarios. 20 Monte Carlo simulations take samples from a probability distribution for each variable to produce thousands of possible outcomes. The results are analyzed to get probabilities of different outcomes occurring. 62 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Difficulty replenishing float can also be due to an overconcentration of agents, who collectively strain nearby float sources and undermine value for the local agent network. The Zoona simulations look at both scenarios as part of optimization. Furthermore, through understanding that agent float is a key driver of agent performance, Zoona is piloting an innovative solution for collecting both an agent’s cash and electronic float balances to help agents manage their float more effectively. This provides agents with access to performance management tools, which are developed using the QlikView data management visualization toolkit. It provides Zoona with data that agents might otherwise not wish to report. Analytics can support many aspects of operations and product development: optimized agent placement, performance management and tools that create incentives for voluntary data reporting. A data-driven company culture drives integration. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 63 1.2_DATA APPLICATIONS Agent Back Office Management regulatory requirements (and no need for • Set new performance targets and The agent back office team is responsible for float management). Consequently, the key incentives all of the tasks required to set up new agents, metrics they need are similar to those for • Submit agent service requests and then manage their ongoing DFS interactions. agents, but with some different business queries directly to the operations team Often, this also includes sourcing the data processes and targets. • Capture prospects for new agent outlet needed by the sales team (above). To be Agent Efficiency Optimization locations effective, they need a lot of data, including both standard reports and access to data Data can be used more effectively by agent Access to this kind of data can result in more to run ad hoc reports focused on specific management teams when they have motivated and successful agents as well as queries. As well as providing the sales team mobile and online access to these data. improve overall DFS business performance. data, they also need to measure how long Some of these tasks include: Important questions can be addressed, their many business processes take, in like: “How much e-money float do agents • Planning the workload order to ensure their team has capacity to need?” In order to manage cash and digital deliver against internal service levels. This is • Check in and out of the agent outlets on floats, it is useful to understand the busiest achieved by measuring issues raised by type field visits times of day, week and month, and to and volume, and measuring issue resolution • Update or verify location and other provide guidance on their expected float time, often via a ticketing system. demographic information for the outlet requirements. It is also helpful to have flags Business Partner Back Office • Show customized performance statistics on the system such that if an agent’s float to the agent directly upon arrival falls below a minimum level, an automated For the purpose of back office • Show commission earned both to date alert is received by the person responsible management, various types of non-agent and for the month for the agent’s float management. In more business partners can be combined. These include billers and other PSPs, merchants, sophisticated operations, algorithms can • Show revenue earned on the customers organizations using the DFS for business be used to proactively predict how much that the agent is serving management purposes, including payroll float each agent will need each day and to • Allow them to add photos to the and other bulk payments, and other advise them of the optimal starting balance database FIs, including banks and DFS providers. either before trading commences or after The business partner management back • Fill in basic Quality Assurance (QA) agent trading closes. This can also be done for the office team is responsible for similar tasks survey measures directly amount of cash that the agent is likely to as agent management, but with different • Notify that KYC information is in transit need to service cash-out. 64 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES CASE 7 FINCA DRC - What a Successful Agent Looks Like and Putting Results in Action Data Collection: Tuning the Process for Better Insights and Successful Implementation With a banking penetration rate of repay loans supports FINCA DRC Data availability and data quality were just below 11 percent, DRC has one to reduce its portfolio risk. the main challenges in developing the of the lowest rates of financial access agent performance model. Digitized in Africa. In 2011, microfinance The predictive model defined data are required for sources usually institution FINCA DRC introduced ‘successful agents’ both in terms only collected on paper, like agent its agent network, employing small of higher transaction numbers and application and monitoring forms. business owners to offer FINCA DRC volumes. Data for the Generalized Missing data must be minimized, banking services. The agent network Linear Model (GLM) came from both to make datasets more robust grew quickly, and by the time the three principle sources: and to enable the merging of datasets agent data collection began in 2014, by matching metadata fields. This • Agent Application Forms: These hosted more than 60 percent of requires standardizing data collected provide information on the FINCA DRC’s total transactions. By by different people, who may be using business and socio-demographic 2017, agent transactions had grown different collection methods. Lack of data on the owner. to 76 percent of total transactions. consistent data can lead to significant However, growth was mostly • Agent Monitoring Forms: FINCA sample reduction, undermining the concentrated in the country’s capital, DRC officers regularly monitor model’s prediction accuracy and Kinshasa and in one of the country’s agents, collecting information on performance. commercial hubs, Katanga. FINCA the agent’s cash and e-float, the DRC sought to expand the network shop condition, sentiment data on Successful agents in DRC are into rural areas and so they built a the agent’s customer interaction, identified by the following statistically predictive model to identify criteria and the FINCA DRC product significant criteria: geographic that define a successful agent. The branding displayed. This is then location, sector of an agent’s main results were incorporated into agent compiled into a monitoring score. business, gender of the agent, recruitment surveys, helping FINCA • Agent Transaction Data: These and whether they reinvest profits. DRC select good agents in expansion data include information about Women-owned agents are found, for areas. Moreover, the availability the volume and number of cash in, example, to make 16 percent more of a successful agent network that cash out and transfer transactions profit with their agent businesses customers can use to conveniently performed by individual agents. than their male counterparts; DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 65 1.2_DATA APPLICATIONS the value of their business inventory is 42 percent higher. They were also found to put more money back into their business inventory, rather than keeping it in a bank account that yields little interest. This resulted in about 5 percent higher total average transaction value per month. These results were implemented to improve and streamline the agent selection process, which ultimately helped to expand the network into rural areas by incorporating factors into agent surveys and roll-out strategy. By 2016, the agent network had grown to host 70 percent of total transactions. The model identified location as a key criterion, revealing another research opportunity. As a follow-on study, FINCA DRC and IFC will use a RCT methodology to identify optimal agent placement location. Comparing data on agent’s profiles against agent metrics can highlight key characteristics that lead to enhanced agent performance. Integrating these learnings with agent targeting and management processes ensures the full leveraging of data for performance management. 66 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Use Case: Back Office Management through a several-step verification process, automated notifications can be sent either key information is recorded in the system to the front-office staff or to customers Process Automation manually by the front or middle-office, directly. For example, for churn prevention, Even though DFS providers are putting creating additional burden on staff and customers who are approaching dormancy a lot of effort in to developing front end causing inefficient time allocation. These status can receive reactivation text automation (mobile, online banking), some forms then have to be stored in a physical messages or emails. Borrowers can receive still struggle to develop highly automated warehouse and maintained for a certain notifications about upcoming payments back end functions. Automated tasks that period of time. Streamlining and simplifying or better-priced products available for can assist back-office operations – such the data collection process through the refinancing. Some functions requiring as loan underwriting and origination, front end interface and through a system human interventions, such as financial and transaction processing and automated of built-in data checks increases efficiency business analysis and personal relationship reconciliation – have tremendous value. and reduces labor costs. Of course, in order management, will complement and benefit Providers are now moving towards robotic to record the data in a robust manner, from the automated process. automation of the simple and repeat IT architecture must be strong enough to Risk Monitoring and Regulatory processes, which can be carried out much correctly classify, check and store data. Compliance more cheaply and accurately by machines than by humans. According to AT Kearney, Data processing can be automated In the aftermath of the 2008 financial crisis, Robotic Process Automation (RPA) makes at almost all stages of the customer national regulators have been continuously operations 20 times faster than the average relationship. Establishing standard tightening regulation of the financial humans and includes benefits of 25 percent verification steps can speed up account industry to protect both customers and to 50 percent cost savings for those who opening and account changes, and credit the industry in general. Increased capital, adopt.21 Various areas of automation can decisions for certain segments can be liquidity and transparency requirements generally be grouped within automation of triggered by well-structured, tested put heavy burden on the regulated financial data recording and data processing. scoring models. Furthermore, action industry while creating a competitive heat maps can automate disbursements, advantage for non-regulated players, The primary focus of data recording lies and automated request and feedback such as financial technology providers. in digitizing paper-based work flows. forms can digitize account closures. Subsequently, banks have to budget We observe that many providers still use Advanced analytics, which are described higher compliance costs for adhering paper-based application forms to collect in the previous chapter and can include to regulatory requirements. Regulatory account opening information. Multiple lead generation for sales campaigns or reporting requires pooling data from errors that occur along the manual entry multichannel management, may be used various systems, including: financial ledger, process force these forms through multiple to uncover untapped opportunities and accounting system, treasury, asset quality loops of rework. Eventually, after going risks within the portfolio. Once identified, monitoring, and collections databases, 21 ‘Robotic Process Automation: Fast, Accurate, Efficient’, A.T. Kearney, accessed April 3, 2017, https://www.atkearney.com/financial-institutions/ideas-insights/robotic-process-automation DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 67 1.2_DATA APPLICATIONS among others. Regular stress tests require establishing reporting processes, allocating will require DFS providers to develop strong IT infrastructure with a high capacity staff time and, in some cases, investment in and maintain tools aimed at protecting to store and process large amounts of data. new technology. external threats and potential criminal Moreover, KYC compliance requires real- activities. Maintaining and aggregating life data-feeds for timely and safe decision- Fraud Prevention the appropriate data necessary to build making. Data necessary for measuring With global trends moving towards fraud prevention and operational risk and monitoring market, credit, AML, and cloud computing, data governance and models can reduce DFS provider exposure. liquidity risks are ideally housed in a unified protection becomes increasingly important. Real-time data streaming and processing repository to enable a DFS provider to DFS providers have to pay closer attention enables them to detect fraud faster and have a complete picture of risk across its to customer transaction behavior. They more precisely, thus reducing potential entire portfolio. This unified repository also must also perform KYC compliance in risks of losses. For example, if a customer’s enables the DFS provider to run scenario order to detect potential fraudulent credit or debit cards are being used from analyses and stress tests to meet regulatory activities – such as money laundering and an unusual geographical location or at requirements. Regulatory compliance false identity – while avoiding or reducing unusual frequency, DFS providers can alert incurs direct costs through the higher cost operational and financial risks. New the customer and potentially block the of capital, as well as indirect costs, such as cybersecurity interventions and regulations processing of these suspicious transactions. Data Tracking for Fraud Detection In the context of DFS providers that offer P2P services, providers can use a variety of tools to determine whether transactions are fraudulently being deposited into someone else’s account in order to bypass fees. Instead of using their account and paying fees, there is a deposit (from an agent account) directly into the recipient account. Transaction speed can give a basic indication; if money is deposited into an account and then withdrawn again in a very short period of time, there is a fairly good chance that it was a direct deposit. Transaction location gives an even better indication because if the location of the agents doing the deposit and withdrawal is some distance apart, it is unlikely, or even impossible, that the customer could have traveled between those points in the interval between transactions. It should be possible to create alerts for this kind of behavior, and agents who do unusually high numbers of direct deposits can be followed up. This will not catch transactions between customers living in close proximity, so many DFS providers also perform mystery shopper research to better understand direct deposit levels. 68 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Use Case: User Interaction day and also the likely busy periods so also likely to hear about minor service- Management they can ensure the system can cope with affecting problems that prevent customers the peaks. from transacting optimally, such as lack of Managing customers through the agents, restrictive transaction limits and lifecycle, encouraging increased usage, Defining ‘normal behavior’ patterns is short transaction timeouts. It is therefore and managing new behavior falls fundamental to risk management. Activity important to collect statistical data on the within the remit of the marketing team. patterns that stray from the agreed norms, calls received, including complaints and However, there is also an operational particularly transactional and service use suggestions. Leveraging this type of data is aspect to customer management that data, should be flagged. These patterns exemplified in Case 8. is predominantly a concern for the should be reviewed to determine whether customer service, risk and technical teams. the unusual behavior was legitimate, or a Monitoring the number of calls as the These teams are responsible for ensuring potential case of fraud. As well as customer service grows helps to determine how that the user interaction is as designed, and agent behavior, it is also wise to many call center representatives are detecting and fixing any issues. They are profile ‘normal activity’ for employee needed. For some busy services, only a also responsible for managing the user interactions in the system. For example, is proportion of the calls presented actually interaction for business customers and one employee looking at significantly more make it to a customer care line. In this internal users. customer records than a ‘normal’ employee case, the calls attempted versus the calls in the same role, or accessing the system presented is an important figure as this In this regard, it is important to define the outside of their normal shift patterns? This indicates either a major issue or inadequate ‘normal’ expected usage and behavior of abnormal activity could point to potential the system so forecasts can be made for staffing. The most frequently reported fraudulent activity. both technical and commercial planning. call center issues are forgotten PINs, lost Measures are usually set from the top Customer Service Efficiency phones or cards, transactions sent to the down, such as monthly business targets Improvements wrong recipients, and lost voucher codes. and strategic goals. With that said, some Customer service teams in the call centers The number of calls that can be taken is outcome metrics need to be gathered from are the employees closest to the DFS dependent on the speed of the back-office the ‘bottom up’, such as measurements customer on a day-to-day basis. Because system and how quickly it can respond in of the average usage of a service. As of this, they can provide early warning resolving the issue. As call center costs are previously discussed, using averages can be of any major issues that may arise. generally high, the data they provide should misleading, and behavior may need to be Often, they will be the first to learn of a be used to speed up the issue resolution broken into sectors, and then aggregated system fault or fraudulent agent behavior, process and to increase the number into an ‘average view’ of activity against so a process is needed to alert the of calls each representative can take. which plans can be made. For example, appropriate team of any potential issues These data can also be used to improve the technical team needs to know both based on the (sense-checked) information the user experience so that the customer the expected number of transactions per received from customers. These teams are makes fewer mistakes. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 69 1.2_DATA APPLICATIONS CASE 8 Safaricom M-Pesa - Using KPIs to Improve Customer Service and Products Using Data Analytics to Identify Operational Bottlenecks and Prioritize Solutions M-Pesa in Kenya was the pioneer problems in both the technology and pace with the increase in customer of DFS at scale, with 20.7 million business processes, as a bad customer numbers. To identify bottlenecks customers, a thirty-day active base of experience could quickly erode and prioritize solutions, the team 16.6 million,22 and revenue reported customer trust. Data-driven metrics analyzed their data. PABX call data in 2016 of $4.5 billion.23 When supported the team to plan and guide and issue resolution records were Safaricom launched the service in operations appropriately. examined and found the following: 2007, there were no templates or best As service uptake was unexpectedly • Length of Call Time: The average practices; everything was designed high from the start, the number of call was taking 4.5 minutes, around from scratch. Continuous operational calls to the customer service call double the length of time budgeted improvement was essential as the center was correspondingly much for each call. service scaled. higher than anticipated, resulting in • Key Issues for Quick Resolution: a high volume of unanswered calls. Uptake for the service was The two key call types to be tackled This problem established a KPI that unexpectedly high from the start, for optimization were customers the customer care team needed to with over 2 million customers in its forgetting PINs and customers resolve to acceptable levels. first year, beating forecasts by 500 sending money to the wrong phone percent. This growing demand forced The problem was first tackled by number; this covered 85 percent rapid scale, and required operations recruiting additional staff, but to 90 percent of long calls coming to proactively anticipate scaling recruitment alone could not keep into the call center. 22 Richard Mureithi, ‘Safaricom announces results for the financial year 2016’. Hapa Kenya, May 12, 2017, accessed April 3, 2017, http://www.hapakenya.com/2016/05/12/safaricom-announces-results-for-the-financial-year-2016/ 23 Chris Donkin, ‘M-Pesa continues to dominate Kenyan market’. Mobile World Live, January 25, 2017, accessed April 3, 2017, https://www.mobileworldlive.com/money/news-money/m-pesa-continues-to-dominate-kenyan-market/ 70 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES The analysis accomplished two things. First, bottlenecks were successfully identified, passing key insights into operations. Second, other operational issues were uncovered, mainly, the extent to which customers erroneously sent money and forgot their pins. Managing against the Unanswered Calls KPI therefore delivered broader operational benefits. Using the analytic results, operations implemented a resolution strategy. First, by understanding lengthy versus short problem types, difficult issues could be rapidly identified and passed quickly to a back-office team. This reduced customer wait times and bottlenecks, allowing more customers to be processed per day. Second, operations and product development teams worked to reduce times across all call types. This was achieved by improving technical infrastructure and user interface, mitigating the problems that caused lengthy calls. The combination of initiatives reduced the Call Length KPI and number of Unanswered Calls KPI, shifting both to acceptable levels despite customer numbers continuing to grow beyond forecasted levels. Managing by KPIs is a critical element of operations. Analyzing the data behind KPIs in detail can help to identify operational bottlenecks, and may even reveal other operational factors that push metrics beyond thresholds. Understanding the data that drive a KPI can make them more useful. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 71 1.2_DATA APPLICATIONS Use Case: Technical Operations Data the system is outsourced or an internal Successful DFS services have good By its very nature, a DFS service needs to development, it is important that the communication between the commercial technical team monitor service levels and and technical teams. The commercial be available 24 hours a day, seven days a capacity trends, planning remedial actions. team should proactively discuss their week, and is normally designed to process The key data normally required include marketing plans and forecasts as well as large volumes of system interactions, system availability, planned and unplanned any competitive activity in order to prepare both financial and non-financial. For this downtime, transaction volume, and peak the technical team for potential volume reason, the service needs to be proactively and sustained capacity. changes. Regular meetings (at least monitored with preventative action taken to ensure continuous service availability. quarterly) are needed to review the latest Transactions and Interactions volume forecasts based on the previous Data from service diagnostics are typically used to perform this analysis. Technical A transaction is a financial quarter’s results and planned marketing performance dashboards need to be money movement, usually activity. This enables the technical team updated in real-time to show system health. the act of debiting one to plan accordingly. The technical team They should be automatically monitored account and crediting must, in turn, advise any partners that may and engineered to alert the responsible another. In order to make that happen, be affected by a change in forecast. This is functions and people if a potential problem the user has to interact with the system. particularly relevant to the MNO partners, is spotted. The concept of using data to Those interactions can themselves offer as there have been several instances of insights, and are frequently used in digital unmanageable SMS volume requirements ‘understand normal’ is used to proactively product development of smartphone during unusually successful promotions. detect faults in various layers of the service, and web services to help understand the Similarly, if technical changes or overhauls and automatic monitoring solutions are customer better. are planned, marketing needs to be set up to detect when threshold settings are breached. For example, if a DFS system aware and should avoid activities that DFS interactions, even using basic normally processes a given number of might put additional strain on the system phones, can be measured and can transactions per second (TPS) every at that time. provide useful data about the customer Thursday evening, but one Thursday the experience for a service. For example, it Lessons Learned from Operations figure is much lower, it signals that there is is possible to measure interactions such and Performance Management likely a problem that requires action. as ‘abandoned attempts to perform a financial transaction’, then diagnose what Record the Business Benefit of Airtime Trends can be used to predict performance prevented the customers from completing Sales: Reports can be misleading when issues while also identifying specific these transactions. Another example is customers use DFS to buy airtime. incidents; because of this, the team must when customer services interact with Depending on the core business of the DFS also consider performance over time. the system on a customer’s behalf, for provider, selling prepaid airtime can either Trend analysis is vital in capacity planning, example, resetting a forgotten PIN. These be a source of revenue or a cost savings. and system usage and growth patterns interactions are rarely measured, but can For non-MNOs, each airtime sale will attract give important clues as to when extra also provide useful insights to improve a small commission, as they are acting as system capacity will be needed. Whether service operations. an airtime distributor. This income should 72 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES be considered part of the DFS revenue. and the better-off people (and businesses) Look at Longer-term Trends and For MNOs, rather than revenue, this that interact with them. This leads to very Short-term Results: Trends provide transaction is a cost savings with significant high volumes of low-value transactions much richer insights than a data point in impact because it eliminates the (typically) alongside small numbers of relatively isolation. Changes need to be understood 2 percent to 3 percent commission fees and high-value transactions. Data visualization in the context of time, as there may be a distribution cost. However, many MNOs do can be very effective in identifying where seasonal effect, like a public holiday, that is not attribute this cost savings to the DFS responsible for a leap in activity. This peak the use of averages is inappropriate. business because it has been accounted may be followed by a dip, then a return to For example, Figure 16 shows a typical for within the prepaid airtime budget line. the status quo, which is common around distribution frequency curve of transaction While this may be correct in accounting Christmas. There can also be a seasonal values for a DFS provider with the majority terms, to accurately gauge the value of impact; for example, during harvest time, of transactions (mode) being $20. The the DFS to the business, this cost savings farmers with cash crops make the majority average transaction value is $86 though, of their annual income and are much more should be included in DFS internal because a relatively small number of high- financially active as compared with other management accounts. value transactions skew the average. times of the year. Other causes of short- Beware of Averages: By their nature, DFS These averages can lead to a mistaken and term changes in performance may be offerings tend to attract both people with inflated view of the ‘average’ customer’s competitive activity, extreme weather and limited resources who lack access to banks wealth and financial activity. political uncertainty. 0.35 0.3 Transaction value mode = $20 0.25 Frequency 0.2 0.15 0.1 Average transaction value =$86 0.05 0 0 50 100 150 200 250 300 350 Transaction Value ($) Figure 16: Transaction Value Frequency Chart Demonstrating that Averages can Lead to the Wrong Conclusions DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 73 1.2_DATA APPLICATIONS Beware of Vanity Metrics: Vanity metrics and maintaining the reputation of the (USSD) sessions with either too short a might look good on paper, but they may business. Figure 17 illustrates the issue for a timeout or a USSD dropout fault so some give a false view of business performance. customer using their phone to pay a bill. In customers physically cannot complete a They are easily manipulated and do not this case, there are three ‘system owners’ transaction in the time allocated. It should necessarily correlate to the data that really involved: an MNO providing connectivity, be straightforward in a supplier-vendor matter, such as engagement, acquisition the DFS providing the transaction, and the relationship to ask for data that will show cost, and, ultimately, revenues and profits. biller being paid. relevant information, for example, USSD A typical example of DFS vanity metrics is Each system returns its own efficiency dropouts or transaction queues. However, reporting registered, rather than active, data, but the customer experience may it is often a critical issue in DFS provision customers. Also, reporting total agents be quite different if there are hand- that there are no direct or comprehensive instead of active agents. Only by focusing off delays between systems. Another service level agreements (SLA), which on the real KPIs and critical metrics is common example is when MNOs provide can sometimes make it impossible to it possible to properly understand the company’s health. If a business focuses on Unstructured Supplementary Service Data understand information in this detail. the vanity metrics, it can get a false sense of success. t1 t2 t3 t4 t5 Technical Service Level Data Must Be Relevant Timeline to the Business Objectives: Each MNO delivers DFS provider Utility billings DFS provider MNO delivers operations team collects a wealth of transaction confirms system completes the transaction data about how its system is performing. request details & confirms the transaction confirmation However, in complex, multi-partner forwards transaction DFS, they may not consider the end-to- transaction can proceed information end service performance and its effect on user experience. For a customer, the Time = t1 + t2 + t3 + t4 + t5 performance indicator that is of relevance is Customer the end-to-end transaction performance; Timeline did the transaction complete, and how long did it take? It is surprising how few DFS measure this end-to-end transaction Figure 17: Transaction Time: System Measures versus Customer Experience performance given its pivotal role in establishing and maintaining customer trust, establishing acceptance of the DFS 74 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Filtering the Data Deluge: Every interaction with a DFS system can generate a large number of data points. Some of these will be financial, and some will record what interface is being used, or even how long it takes the user to navigate the user experience. The intensity of information gathered rises vastly as systems make increasing use of more advanced user interfaces, such as smartphones. This can lead to information overload and ‘filter failure’ – essentially, an inability to see the woods for the trees. This, along with constraints around securing the necessary resources to manage these new data feeds, is the reason why so little of this information is being used by the business for decision- making. Collating and correlating external information with in-house data can lead to a loss of key insights. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 75 1.2_DATA APPLICATIONS CASE 9 M-Kopa Kenya - Innovative Business Models and Data-driven Strategies Data-driven Business Culture Incorporates Analytics Across Operations, Products and Services Established in Kenya in 2011, to customers who have built an ‘ability- Technical Capacity Management M-Kopa started out as a provider of to-pay’ credit score metric, as assessed An analysis of customer usage and solar-powered home energy systems, by their initial system purchase and repayment behavior shows that users principally for lighting while also subsequent repayment. M-Kopa is now prefer to buy credits in advance in charging small items like mobile phones also available in Uganda, Tanzania order to secure reliable power for and radios. The business combines and Ghana. the days ahead. By knowing when machine-to-machine technology, using customers are likely to pay (and how embedded SIM cards with a DFS M-Kopa uses data proactively far in advance), M-Kopa can forecast micro-payment solution, meaning across the business to improve expectations and plan accordingly, the technology can be monitored and operational efficiency. Its databases ensuring their customers will not made available only when advance amass information about customer be affected by announced M-Pesa payment is received. Customers buy demographics, customer dependence outages that might prevent these M-Kopa systems using ‘credits’ via payments from posting. on the device and repayment behavior. the M-Pesa mobile money service, Each solar unit automatically transmits Customer Service then pay for the systems using M-Pesa usage data and system diagnostic until the balance is paid off and the M-Kopa devices communicate battery product is owned. In recent years, information to M-Kopa, informing data when they check in, and data the business has expanded into other them when, for example, the lights analysis allows customer service to areas including the provision of home are on. All of this can be analyzed to check whether the units are operating appliances and loans, using customer- improve quality of service, operational as intended and allows proactive and owned solar units as refinancing efficiency and understanding of preventative maintenance that can be collateral. These products are offered customer behavior. performed remotely: 76 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES • If a customer complains that they Sales Team Management Targeting Likely Customers for are not receiving the expected The field sales team sells M-Kopa Additional Sales amount of power, battery products and services directly to The customer repayment behavior dashboards are used to diagnose customers. Sales representatives use can provide a lot of information the problem. For example, the a smartphone app to log all of their about financial health and credit- battery is not being charged fully activities digitally, in real time. This worthiness. Battery data show a during daylight hours. allows a detailed understanding customer’s dependence on the device of their performance and fast for lighting, which adds a deeper level • Despite good manufacturing quality of understanding. This information controls, there are always variations turnaround when dealing with is used to identify and actively target in battery performance when units issues. Dynamic online performance existing customers for upgrades and are in the field, determined by measures and league tables can be additional services. M-Kopa also factors such as usage patterns, or broken down by individual and are shares this information with credit environmental conditions. M-Kopa available to the sales management bureaus to help provide customers has created predictive maintenance team and team leaders to encourage with a credit rating. algorithms to detect sub-optimal performance improvements through battery performance, allowing gamification.24 The app also allows it to intervene and arrange for a team members to track their free replacement before battery commission and any additional ‘failure’ occurs. bonuses and incentives. A data-driven corporate culture is necessary to integrate analytics and reporting throughout the entire enterprise. This helps to leverage data sources and analytics across multiple areas to engage new customers, manage sales teams, provide better customer service, and develop new products. 24 Gamification is the application of game-design elements and game principles in non-game contexts. More examples within DFS can be found from studies on the CGAP website: https://www.cgap.org/blog/series/gamification-and-financial-services-poor/ DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 77 1.2_DATA APPLICATIONS Storing System Interactions: Even a few collaboration with an MNO, there is also even clocks time-stamping the event on years ago, when many DFS offerings were information on where the sender and the two systems are unlikely to be perfectly being launched, data capture and storage recipient were physically located, the synchronized. Because of this, many was relatively expensive and cumbersome, SIM card used, the kind of phone used, systems only perform data combining and so data that was not immediately potential call records, and customer activity by exception, usually for fraud needed to run a business was not retained. recharge patterns. As many markets have investigations on a case-by-case basis. New technology allows cheap and plentiful a strict SIM card registration mandate, the However, the additional context provided data storage. Though normally ignored, customer KYC information can also be used by combined data can add layers of value, there are also new tools for analyzing data to complete and cross-reference records. particularly in the case of proactive fraud that are in logfiles on servers that make it While some of these parameters are not of monitoring. Making it easier to combine possible, with the right tools, to correlate primary importance to transactions, these data so that it can be used in ‘business- multiple sources of data to provide richer data are useful in determining system as-usual’ operational activities is worth information about services. It is strongly anomalies; for example, if a customer considering, particularly for more mature recommended that DFS providers collect normally transacts from a particular DFS operations. and store every bit of data they can about phone, and that phone has changed, it every system interaction, even those that Failed Attempts: It is common for DFS may be that the transaction is fraudulent. were declined. Whilst it may not seem providers to retain the data associated Further evidence may be gathered by useful or relevant to current operations, it with successful transactions, where cross-referencing the location where the the requested activity was completed. may well be of value at a future date for transaction took place with the customer’s However, failed transactions can also advanced data analytics or fraud forensics. normal location log. provide insights. The reasons why Non-repudiation principles require that particular transactions were declined There can be challenges in trying to correlate these changes must be recorded as can point to very specific needs, such as data from different sources, which require additional events, rather than attempting the need to provide targeted information consideration during the database design to edit previously finalized records. and education, a technical fault, or a process. For example, even when the MNO For example, if commission needs to be shortcoming in the service design that is part of the same organization as the clawed back from an agent, this should needs to be amended to provide a more DFS provider, data sharing can be an issue be recorded explicitly as a separate (but intuitive user experience. because the two systems have not been linked) activity, rather than silently paying designed to provide information services In order to perform these advanced a smaller amount, or simply adjusting the to one another. Retrospectively trying to analytics, every bit of information about commission payable file. link the telecoms data from a customer every system interaction should be Combining Data to Add Context: system interaction with the DFS financial collected and stored, even if its relevance is Combining DFS provider data with data transaction information is not simple. not immediately obvious. from partners can have many operational This is usually because there is no common benefits. For example, where there is piece of data linking the two records, and 78 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Single Source of Truth: When there are through these details is part of any project and characteristics to predict future multiple systems, it is common to have the that combines and compares sources of behavior of new and existing borrowers.25 same data duplicated in multiple places. information; it is also important to clearly The emergence of big data and the sources This is often because current infrastructure understand whether a record is final or and formats of these data have presented makes it hard to combine data sources can still be updated. Incorrectly treating a additional approaches to the credit scoring any other way. This data duplication can non-final record as final can lead to havoc process. Incorporating these alternative lead to issues regarding ‘source of truth,’ in data analysis, creating mistrust in the data sources drives alternative credit in other words, questions around which platform integrity. scoring models. This section looks at how source of data to trust when there is data drives credit scoring, and which types conflicting information. All systems are 1.2.3 Analytics and of data work best for various needs. The occasionally subject to errors, and when fundamental credit scoring relationships there is a dispute over transaction details or Applications: Credit are represented as a timeline in the a debate whether funds were transferred, Scoring figure below. there has to be clear agreement about Credit scoring may be broadly described whose data should be believed. Working as the study of past borrower behavior Past Present Future Borrower Borrower Loan Repayment Characteristics Characteristics Behavior Loan Repayment Behavior Figure 18: Timeline Definition of Credit Scoring 25 Schreiner, ‘Credit scoring for microfinance: Can it work?’, Journal of Microfinance/ESR Review, Vol. 2.2 (2009): 105-118 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 79 1.2_DATA APPLICATIONS Below are the key points illustrated in An entire handbook can be written on the scoring model will calculate and Figure 18: credit scoring, and indeed several thorough report what percentage of past borrowers and accessible texts have been published on with the same combination of borrower 1. Past: Data (or, in their absence, the topic over the past decade.26 In addition, characteristics were ‘bad’. experience) is studied to understand CGAP recently published an introduction which borrower characteristics are most It is important to conduct analysis on both to credit scoring in the context of digital significantly related to repayment risk. the good and the bad loans. Studying the financial services. For the purpose of this 27 This study of the past informs the choice risk relationships in credit data is as simple handbook, the remainder of this credit as looking at the numbers of good and bad of factors and point weights in the section focuses on: loans for different borrower characteristics. scorecard. 1. How data are turned into credit scores The more bad loans as a share of total 2. Present: The scorecard (built on past loans for a given borrower characteristic, borrower characteristic data) is used to 2. How data are being used to meet credit the more risk. assessment challenges in developing evaluate the same characteristics in new markets The cross-tabulation, or contingency table, loan applicants. The result is a numeric score that is used to place the applicant is a simple analytical tool that can be used Scorecard Development in a ‘risk group’, or range of scores with to build and manage credit scorecards. Credit scorecards are developed by Table 4 shows the number of good and similar observed repayment rates. looking at a sample of data on past loans bad loans across ranges of values for an 3. Future: The model assumes that new that have been classified as either ‘good’ example MNO data field, in this case, time applicants with the same characteristics or ‘bad’. A common definition of ‘bad’ since registration on the mobile network. as past borrowers will exhibit the same (or ‘substandard’) loans is ‘90 or more Suppose the expectation is that applicants repayment behavior as those past consecutive days in arrears’,28 but for with a longer track record on the mobile borrowers. Therefore, the past observed scorecard development, a bad loan should network will be lower risk (usually longer delinquency rate for a given risk group is be described as one (given hindsight) that track records, whether in employment, the predicted delinquency rate for new the FIs would choose not to make again in business, in residence, or as a bank borrowers in that same risk group. in the future. For each new loan applicant, customer, are linked to lower risk). 26 See for example: Siddiqi, ‘Credit risk scorecards: developing and implementing intelligent credit scoring’, John Wiley and Sons, Vol. 3 (2012). Anderson, ‘The credit scoring toolkit: Theory and practice for retail credit risk management and decision automation’, Oxford University Press, 2007 27 ‘An Introduction to Digital Credit: Resources to Plan a Deployment’, Consultative Group Against Poverty via Slide Share, June 3, 2016, accessed April 3, 2017, http://www.slideshare.net/CGAP/an-introduction-to-digital-credit-resources-to-plan-a-deployment 28 For DFS and micro lenders, the ‘bad’ loan definition can often be a much shorter delinquency period such as 30 or 60-days in consecutive arrears. Product design (including penalties and late fees) and the labor involved in collection processes will influence the point at which a client is better avoided, or ‘bad’. 80 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES R0w <= 2 Months > 2 Months > 1 Year and > 2 Years and > 3 Years Row Total and <= 1 Year <= 2 Years >= 3 Years A Goods 115 161 205 116 203 800 B Bads 48 48 50 24 30 200 C Bad Rate 29.4% 23.0% 19.8% 17.3% 12.7% 20.0% D Total 163 210 255 140 233 1,000 E % Total Loans 16.3% 21.0% 25.5% 14.0% 23.3% Table 4: Loan Cross-tabulation Table 4 can be read as follows: risk is to look at its bad rate relative to the mining, or using more complex machine- 20 percent (average) bad rate by time learning algorithms for any relationships Row A: Number of good contracts in group in a data set, whether understood by a since registration: (column) human analyst or not. Although a purely Row B: Number of bad contracts in group • Less than 2 months, the bad rate is 29 machine-learning approach might result (column) percent, one and half times the average. in improved prediction in some situations, Row C: Number of bad contracts (row B) / • Between 1 year and 2 years, the bad rate there are also difficult-to-measure but Number of total contracts (row D) of 19.8 percent, or average risk. practical advantages to business and risk management fully understanding how Row D: Number of total contracts (row A • More than 3 years, the bad rate is 12.7 scores are calculated. + row B) percent, a little over half the average risk. Row E: Total contracts in the group Cross-tabulation or similar analysis of In traditional credit scorecard development, (column) divided by all contracts (1,000) single predictors is the core building block analysts look for simple patterns – including of credit scoring models.29 Creating cross- To conduct analysis, the next step is to steadily rising or falling bad rates – that tabulations like those in the example look for sensible and intuitive patterns. For make business (and common) sense. Credit above is easy using any commercial example, the bad rate in row C of Table scorecards developed in this way translate statistical software or the free open- 4 clearly decreases as the time passed nicely to operational use as business source ‘R’ software. since network registration increases. tools that are both transparent and well- This matches the initial expectation. understood by management. An alternative An easy way to think about each group’s approach to scorecard development is data 29 In fact, logistic regression coefficients can be calculated directly from a cross-tabulation for a single variable DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 81 1.2_DATA APPLICATIONS Use Case: Developing Scorecards development not only favors simple (here it is 30.9 percent for ‘23 or younger’), Scorecard points are transformations of models, but also means that a data-driven which is then multiplied by 100 (to get DFS provider should initially focus on whole numbers, rather than decimals). The the bad rate patterns observed in cross- capturing, cleaning and storing more and results (shown in row F) could be used as tabulations. Although there are many better data. points in a statistical scorecard. In such a mathematical methods that can be used to build scorecards (see Chapter 1.2.3), point scheme, the riskiest group will always Table 5 below is another cross-tabulation, the different methods give similar results. receive 0 points and the lowest-risk group this time for the factor ‘age’. Like the This is because a statistical scoring model’s previous table, the bad rates in row C show (i.e., the group with the lowest bad rate) predictive power comes not from the risk (the ‘bad rate’), which decreases as will receive the most points. math, but from the strength of the data age increases. themselves. Given adequate data on For scorecards developed using regression relevant borrower characteristics, simple Bad Rate Differences (see Chapter 1.1), the transformation of methods will yield a good model and A very simple way to turn bad rates regression coefficients to positive points complex methods may yield a slightly into scorecard points is to calculate the involves a few additional steps. The better model. If there are not good data differences in bad rates. As shown in row calculations are not shown here, but the (or too few data), no method will yield G, the bad rate for each group is subtracted ranking results are very similar, as shown good results. The truth is that scorecard from the highest bad rate for all groups in row H. Row 23 or Younger 24 to 30 Years 31 to 47 Years 48 or Older Total A Goods 46 238 374 142 800 B Bads 20 74 82 23 200 C Bad Rate 30.9% 23.8% 18.0% 14.0% 20.0% D Column Total 66 312 456 166 1,000 Percent of Total E 6.6% 31.2% 45.6% 16.6% Loans F POINTS 0 7 13 17 Calculation G (.309 - .309) = 0 (.309 - .238) = 7 (.309 - .18) = 13 (.309 - .14) = 17 [multiplied by 100] H LOGIT POINTS 0 10 21 29 Table 5: Cross-tabulation for Age 82 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Factors that get the Most Points in Credit Scorecards The larger the differences in bad rates across groups, the more points a risk factor receives in a scorecard. Using the simple method of ‘bad rate differences’ (described above), we can see in Table 6 below, ‘bureau credit score’ takes a maximum of 39 points, while ‘marital status’ takes a maximum of only eight points. This is because there are much larger differences in the highest and lowest bad rates for credit history than there are for marital status. Bureau Credit Scores Group < 590 Points 590 - 670 Points 671 - 720 Points > 720 Points Sample Bad Rate Bad Rate 39% 23% 13% 0% 20% POINTS 0 16 26 39 Marital Status Group Divorced Unmarried Married Widowed Sample Bad Rate Bad Rate 25% 24% 19% 17% 20% POINTS 0 1 6 8 Table 6: Examples of Scorecard Factor Importance Since risk-ranking across algorithms is often very similar, many professionals prefer to use simpler methods in practice. Leading credit scoring author David Hand has pointed out that: “Simple methods typically yield performance almost as good as more sophisticated methods, to the extent that the difference in performance may be swamped by other sources of uncertainty that generally are not considered.”30 The long-standing, widespread practice of using logistic regression for credit scoring speaks to the ease with which such models are presented as scorecards. These scorecards are well-understood by management and can be used to proactively manage the risks and rewards of lending. 30 David Hand, ‘Classifier technology and the illusion of progress’, Statistical Science, Vol. 21.1 (2006): 1-14 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 83 1.2_DATA APPLICATIONS Expert Scorecards When there are no historic data, but the provider has a good understanding of the borrower characteristics driving risk in the segment, an expert scorecard can do a reasonably good job risk-ranking borrowers. An expert scorecard uses points to rank borrowers by risk, just as a statistical scorecard does. The main difference (and an important one) is that without past data, including data on delinquencies, there is no way for the FI to know with certainty if its understanding (or expectation) of risk relationships is correct. For example, if we know age is a relevant risk driver for consumer loans and we have seen (in practice) that risk generally decreases with age, we could create age groups similar to those in Table 5. In this scenario, we assign points using a simple scheme where the group perceived as riskiest always gets zero points and the lowest-risk group always gets 20 points. In this case, an expert scorecard weighting of the ‘age’ variable might look like Table 7 below. These points are not so different from the statistical points for age shown in rows F and H of Table 5. 23 or Younger 24 to 30 Years 31 to 47 Years 48 or Older POINTS 0 7 15 20 Table 7: ‘Expert’ Points for ‘Age’ As long as risk-ranking is correct for each individual risk factor in an expert scorecard, the score from an expert scorecard will risk-rank borrowers similar to how a statistical scorecard ranks them.31 This means expert scorecards can be a useful tool to launch a new product for which there is no historic data. They are also a good way for DFS providers that are intent on being data-driven to reap some benefits of scoring – including improved efficiency and consistency – while building a better database. 31 Usually using expert judgment alone, providers incorrectly specify the risk-ranking relationship of one or more factors. Once performance (loan repayment) data are collected, it can be used to correct any misspecified relationships, which will lead to improved risk-ranking of the resulting statistical model. 84 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Choosing a Set of Risk Factors The ‘best’ set of single variable predictors As each individually strong predictor While the specific data fields available for are combined into a multivariate model. is added to a multi-factor model, its credit scoring will vary greatly by product, While this can be done algorithmically risk-ranking improves. However, after segment and provider, generally scoring to maximize prediction, an appealing a relatively small number of good model data should be: individual predictors (typically 10 to 20), approach for DFS providers is to choose the incremental improvement for each a set of factors that together create • Highly relevant additional factor drops rather sharply. a comprehensive risk profile for the • Easy to consistently collect Even if we purposefully select factors borrower,32 along the lines of the popular that do not seem highly correlated to • Objective, not self-reported five Cs of credit: capacity, capital, collateral, one another, in reality, many of the conditions, and character. Such a model is factors will be correlated to some degree, Some types of data tend to be good predictors of loan repayment across easy-to-understand for bankers and bank leading to the diminishing returns of segments and markets. Table 8 presents management, and is consistent with risk additional factors. some of these along with their commonly management frameworks such as the observed risk patterns. Basel Capital Accords. Type of Data Factor Risk Relationship Purchases Risk decreases as disposable income increases Deposits and account turnover Risk decreases as deposit and turnover increases Behavioral Credit history Risk decreases as positive credit history increases Bill payment Risk decreases in line with timeliness of bill payments Time in residence, job, business Stability reduces risk Track Record Time as client Clients with longer relationship are lower risk Risk decreases with age and increases again around retirement age (mainly due to Age health risks) Marital status Married people are more often settled and stable, which lowers risk Demographics Increasing number of dependents can increase risk (particularly for single people), Number of dependents but in some cultures it instead lowers risk (greater safety net) Home ownership Home owners are less risky than renters Table 8: Data that are Often Effective for Credit Scoring 32 Siddiqi, ‘Credit risk scorecards: developing and implementing intelligent credit scoring’, John Wiley and Sons, Vol. 3 (2012) DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 85 1.2_DATA APPLICATIONS When a FI has enough data, it should give This section looks at how data are being Asia – has created verifiable third-party preference to data points that: used to overcome some of the challenges digital records of actual payment patterns, that have long been barriers to financial such as top-ups and mobile money • Are objective and can be observed inclusion. In particular, it is the digital payments. These data, held by MNOs, directly, rather than being elicited by data generated by mobile phones, mobile provide a sketch of a SIM-user’s cash flows. the applicant money and the internet that are helping POS terminals and mobile money tills can • Evidence relationships to credit risk that put millions who have never had bank also paint a somewhat more complete confirm expert or intuitive judgment accounts or bank loans on the radar of picture of cash flows for merchants. • Cost less to collect formal FIs. • Can be collected from most, if not The case studies that follow investigate all, applicants how MNO, social media and traditional • Do not discriminate based on factors banking data have been used to launch the borrower cannot control (i.e., age, new products, to help more borrowers gender, race) or that are potentially become eligible for formal loans and to divisive (i.e., religion, ethnicity, language) evaluate small businesses, which are less When you know how much Use Case: Nano-Loans homogeneous than individual consumers. money a person or company is Since banks must report nano-loan Credit Challenge 1: Verifying dealing with on a daily, weekly repayments to bureaus and central banks, Income and Expenses and monthly basis, you can nano-lending has brought millions of A significant retail lending challenge better estimate what loan size people who previously lacked access to in developing markets is obtaining they will be able to afford. banks into the formal financial sector trustworthy data on new customer cash across the world, establishing credit flow, for people and businesses alike. history that is a stepping stone to Cash flow, or income left after expenses, The following two cases look at how digital unlocking access to other types of loan is the primary source of loan repayment data have helped open huge markets for products. However, some are concerned and therefore a focus of retail lending consumer nano-loans. that nano-loans create a cycle of debt for models. Income levels are also used low-income individuals. Several million to determine how much financing an people with bad nano-lending experiences individual can afford. could become blacklisted at local credit bureaus, which greater endorses the need The growth in mobile telephony and mobile for consumer protection. money usage – particularly in Africa and 86 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES CASE 10 M-Shwari Launches a Market for Nano Loans Data Solutions to Assess the Creditworthiness of Borrowers with no Formal Credit History Commercial Bank of Africa (CBA) Modeling the Unknown on borrower risk-rankings. See call- and mobile operator Safaricom Credit scoring technology looks at out box on page 84. were early to recognize the past borrower characteristics and Another way to use credit scoring power of mobile phone and mobile repayment behavior to predict future with a new product is to study a set money data. loan repayment. What about the case of relevant client data, such as MNO where there is no past repayment M-Shwari, the first highly successful data, in relation to loan repayment behavior? MNOs have extensive digital savings and loan product, is information, such as: data on their clients’ mobile phone well known to followers of ‘fintech’ and, in many cases, mobile money • General Credit History or a and financial inclusion. It has given usage, but it is less clear how that Bureau Report: This only works small credit limits over mobile data can be used to predict the for clients with a file in the bureau. phones called nano-loans to millions ability and willingness to repay a • Similar Credit Products: Another of borrowers, bringing them into loan without data on the payment of credit product similar enough to the formal financial sector. Similar past obligations. be relevant to the new product products have since been launched By definition, there is no product- can be used as a gauge. While in other parts of Africa, and new specific past data for a new product. past repayment of that product competition has crowded the market One way to still use credit scoring may or may not be representative in Kenya. M-Shwari’s story is also with a new product is to use expert of future repayment of the new an excellent study in using data judgment and domain knowledge to product, it may be an acceptable creatively to bring a new product build an ‘expert scorecard’, a tool approximation, or ‘proxy’, for to market. that guides lending decisions based initial modeling purposes. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 87 1.2_DATA APPLICATIONS The first M-Shwari scorecard was would be better risks for the larger redeveloped as soon as possible developed using Safaricom data and loan product. using the repayment behavior of the repayment history of clients that the M-Shwari product itself. Some had used its Okoa Jahazi airtime The first M-Shwari credit scoring behaviors predictive of airtime credit credit product.33 The two products model developed with the Okoa Jahazi usage did not translate directly to were clearly different, as shown in data,34 together with conservative M-Shwari usage, and appropriate Table 9 below. limit policies and well-designed changes to the model based on the business processes, enabled the launch The M-Shwari product offered actual M-Shwari product usage of the product, which quickly became borrowers more money, flexibility of data reduced non-performing loans massively successful. use and time to repay. The assumption by 2 percent. M-Shwari continues was that those who had successfully CBA expected the scorecard to update its scorecard periodically, used the very small Okao Jahzi loans based on Okoa Jahazi data to be based on new information. Product Okao Jahzi M-Shwari Amount The lower of airtime spends over the last 7 days or 100 to 10,000 Kenyan shillings 100 Kenyan shillings Purpose Used for airtime only Used for any purpose Repayment Term 72 hours 30 days Table 9: Okao Jahzi and M-Shwari Product Comparison M-Shwari’s successful launch and development illustrates that there are ways to use data-driven scoring solutions for completely new segments. It also reinforces the general truth about credit scoring that a scorecard is always a work in process. No matter how well a scorecard performs on development data, it should be monitored and managed using standard reports and be fine-tuned whenever there are material changes in market risks or in the types of customers applying for the product. 33 Cook and McKay, ‘How M-Shwari works: The story so far’, Consultative Group to Assist the Poor and Financial Sector Deepening 34 Mathias, ‘What You Might Not Know’, Abacus, September 18, 2012, accessed April 3, 2017, https://abacus.co.ke/okoa-jahazi-what-you-might-not-know/ 88 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES The M-Shwari nano-loan product succeeded, thanks to the timely confluence of: • Access to MNO Data: CBA had a first- mover advantage due to its strong partnership with Safaricom. Today, Safaricom sells its MNO data to all banks in Kenya. • A Well-designed Product: Small, short-term products are better fits for credit scoring, particularly for new products. Rapid feedback on the target population’s repayment performance enables timely model redevelopment and controls risk. • Good Systems and People: The M-Shwari management team is lean and flexible, bringing together a unique combination of management and technical skills as well as the systems to ensure smooth implementation. Leveraging • Outside Resources: Financial Sector Deepening (FSD) Kenya supported CBA with risk modeling expertise crucial to developing the first scoring model and transferring skills to M-Shwari’s team. While M-Shwari’s success story is inspiring, there are many DFS providers that would like to get into the nano-lending space but may find it difficult. These DFS providers may not have relationships with MNOs or may lack the in-house ability to design digital savings and loans products and scoring models. The next case describes how vendors are facilitating the entry of DFS providers into mass-market nano-lending. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 89 1.2_DATA APPLICATIONS CASE 11 Tiaxa Turn-key Nano-lending Approach Developing Data Products and Services Through Outsourced Subscription Services Recognizing that many FIs in Tiaxa brings together FIs and MNOs manages portfolio credit risk. Loss developing markets lack the resources and forms three-way partnerships risk is managed by directly debiting to approach the DFS market using whereby: borrower MNO accounts to work out only internal resources, Tiaxa is delinquencies, which are disclosed offering its patented NanoCredits™ • MNOs provide the data that drives to borrowers in the product terms within a ‘turn-key’ solution that their credit decision models and conditions. Their long-term includes: • FIs provide the necessary lending partnership business model works on licenses (and formal financial terms that vary from profit-sharing • Product design to fee-per-transaction models. sector regulation) and funding Customer acquisition (based on • proprietary scoring models) • Tiaxa provides the end-to-end Data Driving Tiaxa’s Scoring Models • Portfolio credit risk management nano-loan product solution While MNO datasets vary across countries and markets, the datasets • Hardware and software deployment In addition to providing the nano- that inform Tiaxa’s proprietary • Around-the-clock managed service loan product design and scoring models typically will include some • Funding facility for the portfolio models based on MNO data, in combination of the following types (in some African markets) most cases, Tiaxa assumes and of data: 90 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES GSM Usage Payroll, Regular Money Transfers KYC Information Utility Payments Cash In Payments • Top-up frequency, • Payroll, subsidies • Frequency and value • Full name • Cash flow indicator • Cash flow amounts • Cash flow, credit • Receiving or sending? • Account type • Financial information • GSM consumption needs • Register date sophistication information • KYC status • Date of birth (DOB), region Table 10: Types of Data Informing Tiaxa’s Proprietary Models Tiaxa uses a range of machine learning each engagement. Tiaxa now has more among them. Currently, the company methods to reduce hundreds of than 60 installations, with 28 clients, processes more than 12 million nano- potential predictors into an optimal in 20 countries, in 11 MNO groups, loans per day worldwide, mostly in model. Custom models are designed for who have over 1.5 billion end users airtime lending. As the data analytics landscape evolves, third party vendors are expected to develop turn-key solutions that plug into internal data sources and deliver value to existing products. Firms that are unable to invest in tailored data analytics or preferring a ‘wait-and-see’ approach may be able to take advantage of subscription services in the future by pushing data to external vendors. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 91 1.2_DATA APPLICATIONS For FIs, the choice between working with collect data from new applicants is to These non-traditional online data sources vendors or working directly with MNOs ask them directly to provide information. can and are being used to offer identity to reach the nano-loan segment can only These requests can take the form of: verification services and credit scores. be made by considering market conditions The story of social network analytics firm and available resources. Some of the • Application Forms Lenddo provides more background and pros and cons of each approach are • Surveys some insight into how social media data presented below. can add value in the credit process. • ‘Permissions’ to Access Device Data: This Use Case: Alternative Data can include permissions to access media Alternative data sources are showing content, call logs, contacts, personal promise for identity verification and basic communications, location information, risk assessment. Another way DFS providers or online social media profiles Approach Opportunities Challenges Working with MNO Data • Full control of products Need in-house skills in: • Potentially more profitable • Product development • Risk modeling Need systems and software to manage DFS products Working with Vendor • Provides product, modeling and systems know-how • Dependence on vendor • Makes lending decisions • Model details may not be shared • Ready software solutions • Technical skills not transferred Table 11: Working with MNOs or Vendors: Opportunities and Challenges 92 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES CASE 12 Lenddo Mines Social Media Data for Identity Verification and Risk Profiling Using Advanced Analytic Techniques and Alternative Data Sources for New Products Lenddo co-founders Jeffrey Stewart not use bank accounts or services – raw data are accessed, extracted and and Richard Eldridge initially and less than 10 percent did – they scored, but then destroyed (rather conceived the idea while working were ‘invisible’ to formal FIs and than stored) by Lenddo. For a in the business process outsourcing unable to get credit. In developing typical applicant, their phone holds industry in the Philippines in 2010. their idea, Lenddo’s founders were thousands of data points that speak They were surprised by the number early to recognize that their employees to personal behavior: of their employees regularly asking were active users of technology and • Three Degrees of Social Connections them for salary advances and present on social networks. These wondered why these bright, young platforms generate large amounts of • Activity (photos and videos posted) people with stable employment could data, the statistical analysis of which • Group Memberships not get loans from formal FIs. they expected might help predict an Interests and Communications • individual’s credit worthiness. (messages, emails and tweets) The particular challenge in the Philippines was that the country had Lenddo loan applicants give More than 50 elements across all neither credit bureaus nor national permission to access data stored on social media profiles provide 12,000 identification numbers. If people did their mobile phones. The applicant’s data points per average user: Across All Five Social Networks: 7,900+ Total Message Communications: • 250+ first-degree connections • 250+ first-degree connections • 800+ second-degree connections • 5,200+ Facebook messages, 1,100+ Facebook likes • 2,700+ third-degree connections • 400+ Facebook status updates, 600+ Facebook comments • 372 photos, 18 videos, 13 groups, 27 interests, 88 links, 18 tweets • 250+ emails Table 12: Social Media Data Point Averages Per Average User DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 93 1.2_DATA APPLICATIONS Data Usage during the underwriting process. Lenddo’s SNA platform was used to Confirming a borrower’s identify An example from Lenddo’s work provide real-time identity verification is an important component of with the largest MNO in the in seconds based on name, DOB extending credit to applicants with no Philippines is presented below. and employer. This improved the past credit history. Lenddo’s tablet- customer experience, reduced format app asks loan applicants to Lenddo worked with a large MNO potential fraud and errors caused complete a short digital form asking to increase the share of postpaid by human intervention, and reduced their name, DOB, primary contact plans it could offer its 40 million total cost of the verification process. number, primary email address, prepaid subscribers (90 percent of school and employer. Applicants are total subscribers). Postpaid plan In addition to its identify verification then asked to onboard Lenddo by eligibility depended on successful models, Lenddo uses a range of signing in and granting permissions identity verification, and Telco’s machine learning techniques to map to Facebook. Lenddo’s models use existing verification process required social networks and cluster applicants this information to verify customer customers to visit stores and present in terms of behavior (usage) patterns. identity in under than 15 seconds. their identification document (ID) The end result is a LenddoScore™ Identity verification can significantly cards, which were then scanned and that can be used immediately by FIs reduce fraud risk, which is much sent to a central office for verification. to pre-screen applicants or to feed higher for digital loan products, The average time to complete the into and complement a FI’s own where there is no personal contact verification process was 11 days. credit scorecards. These algorithms turn an initially large number of raw data points per client into a manageable number of borrower characteristics and behaviors with known relationships to loan repayment. 94 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Use Case: Credit Scoring for Small capture, analysis and storage. In the best large teams that develop and maintain Business cases, LOS software facilitates digital models, including separate models for The examples discussed so far have capture of traditional data in a way application decision support, ongoing conducive to data analysis, including credit portfolio management (behavioral) and focused on digital products aimed at scorecard development. As value chain and provisioning. As a first step to developing mass-market consumers and merchants. supply chain payments become digitized, models in-house, FIs may opt to use external The stream of behavioral data created there is an opportunity to leverage these consultants to do initial developments and in digital channels has understandably data to project cash flows and build to build capacity with internal staff to take generated the most excitement about data credit scores. it forward. analytics opportunities. However, most FIs also have ample opportunity to make Credit Scoring Methodologies Many DFS providers have data, data better use of data in credit analysis and analysts, and in-house IT specialists FIs have several options for using the data risk management of traditional and offline capable of managing their own scoring they already collect for credit risk modeling. products that include, but are not limited to: systems. What those teams tend to Three of the most common solutions are to develop proprietary credit scorecards lack is experience in credit scorecard • Consumer Loans either through internal expertise, by development. Good data analytics projects • Credit Cards require expert knowledge to succeed. working with outside consultants or by • Micro, Small and Medium Enterprise outsourcing credit scoring to a third- Outsourced assistance can help knowledge (MSME) Loans and Leases party vendor. transfer build in-house expertise as part • Small Agriculture Loans and Leases of project support. When working with Develop Proprietary Credit external consultants, DFS providers must • Value-chain and Supply Chain Finance Scorecards ensure that the necessary tools and skills For these products, FIs have traditionally Banks in leading financial markets (for are transferred to the internal teams so collected a wealth of data, but not example, South Africa, North America, that the scorecards can be managed and necessarily digitized or systemized its Continental Europe, and Singapore) employ monitored going forward. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 95 1.2_DATA APPLICATIONS A Closer Look at Proprietary Scorecards Outsource Credit Scoring to a A recent IFC project with a bank in Asia exemplifies how the process can work: Vendor Most vendors offer custom model 1. The bank shared its past portfolio data with the consultant. development using bureau data 2. The consultant prepared the data for analysis using the open-source ‘R’ (where available), the bank’s own data, statistical software. as well as third-party data such as CDR 3. The bank convened a credit scoring working group to work with the data. Vendors normally also provide consultant. In a workshop setting, the consultant and working group scorecard deployment software and analyzed and selected risk factors for consumer and micro-business maintain the models for the FI. Working lending scorecards. with credit scoring vendors outsources 4. The bank recruited a new analyst to take primary responsibility for the scoring expertise and software scorecards (and the analyst also participated in the ‘R’ workshops). platforms, often bringing new data 5. The credit scoring working group and consultant reviewed the resulting that would otherwise be unattainable. models’ strengths and weaknesses to align usage strategies with the It also brings international experience bank’s business targets and risk appetite. and immediate credibility to the 6. With initial guidance from the consultant, the bank and its local software scoring solution. provider developed a software platform to deploy the scorecard. 7. The consultant provided remote support in scorecard monitoring and Following is an example of First Access’ management. work with a bank in East Africa in the small business lending segment, a The pros and cons of such arrangements include: segment for which MNO data alone is Pros: Cons: not enough to comprehensively assess the applicant’s credit risk. Bank learns the necessary skills to take • • Requires active engagement of senior and ownership of the models junior managers • Bank has complete control over its scorecards Requires staff training or the onboarding of • • The scorecards are fully transparent data analytics and risk modeling specialists • Requires additional deployment software, such as an LOS with scoring functionality In-house development brings • long-term maintenance requirements Table 13: The Pros and Cons of Proprietary Scorecards 96 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES CASE 13 First Access: Credit Scoring with a Full-service Vendor Outsourcing Data Expertise and Working with External Partners Many FIs are interested in using credit company initially worked extensively clients, and thus used one process for scoring to increase the consistency with Vodacom Tanzania, levering all applicants coming in the door. and efficiency of credit assessment its MNO data to develop an auto- for small loans. However, fewer FIs decision tool for DFS providers that First Access studied the bank’s historic in developing markets have the in- serves low-income customers with portfolio data for the segment and house skills to develop and deploy no formal credit history. Since then, built a scoring algorithm using only scorecards efficiently without some it has expanded its presence to the the information available at the time outside help. DRC, Malawi, Nigeria, Uganda, and of each loan application – without Zambia, working more extensively including additional data normally As mentioned above, working with on scoring solutions for the micro gathered in time-consuming visits to external credit scoring vendors and small business segment. the site of the applicant’s business, outsources the scoring expertise a common feature of a microloan and software platforms, and also First Access worked with a bank in underwriting process. At the wish often brings international experience East Africa to develop a scorecard of the bank, the model ranked and immediate credibility to the for its small business (micro) lending, applicants into five risk segments. scoring solution. focused on loans of up to $3,000. The bank took an average of six A ‘blind test’ of all matured First Access is one of many credit days to assess loan applications, and microloans, disbursed over the scoring vendors, but one of the in addition to lengthy wait times, previous six months, indicated that relatively few that focuses on the its NPLs had been increasing. Like the scores ranked borrowers by risk, particular challenges facing frontier many banks in emerging markets, it as indicated by the bad rates in Table markets. Founded in July 2012, the had no tools for screening or scoring 14 below. Risk Segment A B C D E PAR (Portfolio at Risk) 1.00% 3.53% 9.97% 22.42% 26.78% Table 14: Microloan Borrower Rankings by Risk DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 97 1.2_DATA APPLICATIONS Using the scoring algorithm, each Since the algorithm’s results in The First Access software platform applicant could be immediately practice have validated the original enables FIs to configure and manage scored and assigned to one of the blind test, the bank is expanding their own custom scoring algorithms risk segments. The bank adjusted the use of the algorithm to conduct and use their own data on their its credit assessment process to offer more same-day loan approvals customer base and loan products. First Access is currently developing same-day approval for its repeat and rejections for repeat and new new tools for its platform to give FIs customers in segments A and B, customers. Fast-tracking groups A more control and transparency to which made up 22 percent of loan and B has increased the institution’s manage their decision rules, scoring applicants. The time of approval for efficiency in underwriting micro calculation and risk thresholds, this client group was reduced from loans by 18 percent, and both groups with ongoing monitoring of the an average of six days to one day, have outperformed their blind test algorithm’s performance. Such which improved customer experience results, with combined PAR1 of performance analytics dashboards and the efficiency and satisfaction of 1.26 percent instead of the expected can help FIs better manage risk in the bank’s staff. 3 percent. response to changes in the market. Pros: Cons: • Access to world-class modeling skills and international experience • Bank does not own model and usually does not know the scoring calculation • Provide deployment software • Ongoing costs of model usage and intermittent model development • Potentially shorten time needed to develop and implement scorecard • Manage and monitor the scorecard and software Table 15: Pros and Cons of Outsourcing Credit Scoring to a Vendor An outsourced approach to developing data products provides fast solutions and skilled know-how, but may also bring longer-term maintenance risks, intellectual property (IP) issues and a requirement that project designs are scoped in detail up front to ensure useful deliverables. 98 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Accessibility and Privacy concerns have limited the availability of There are two core challenges to using some data, and there is no guarantee that, new forms of digital data: accessibility and for example, social media data will remain privacy. To benefit from new sources of an accessible data source for credit models digital data, FSPs must gain access to these in the future. Facebook has already taken data in a format that can be analyzed. Two steps to limit the amount of data third- of the main ways to access such data are to party services can pull from user profiles,35 either purchase the data or to collaborate and the data it makes accessible through with the vendor. Some MNOs, such as its API can legally only be used for identity Kenya’s Safaricom, sell pre-processed verification. In the United States, the aggregate data fields – such as monthly FTC, which monitors rules on credit and average spend or call usage – directly to FSPs. Some vendors also process large consumer data, has indicated that social raw data sets drawn from MNOs, social networks risk being subject to regulation media and device data, and turn these into as consumer reporting agencies if their usable, sellable customer profiles. Privacy data are used as loan criteria.36 35 Seetharaman and Dwoskin, ‘Facebook’s Restrictions on User Data Cast a Long Shadow’, Wall Street Journal, September 21 2015 36 ‘Facebook Settles FTC Charges That It Deceived Consumers By Failing To Keep Privacy Promises’, Federal Trade Commission News Site, November 29, 2011, accessed April 3, 2017, https://www.ftc.gov/news-events/press-releases/2011/11/facebook-settles-ftc-charges-it-deceived-consumers-failing-keep/ DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 99 ics yt app Da al o d s li c th & m an ta ions PART 2 at a e Dat Data Project Frameworks Ma a p s da ce na gi t ro ng a ur jec t Re so Chapter 2.1: Managing a Data Project The Data Ring Managing any project is complex and requires the right ingredients; business intuition, experience, technical skills, teamwork, and capacity to handle unforeseen events will determine success. There is no recipe for success. With that said, there are ways to mitigate risks and maximize results by leveraging organizational frameworks for planning and by applying good, established practices. This also holds true for a data project. This section introduces the core components necessary to plan a well-managed data project using a visual framework called the Data Ring. The framework’s organizational components draw from industry best practices, recognizing general resource requirements and process steps that are common across most data projects. It shares commonalities with Cross Industry Standard Process for Data Mining (CRISP-DM), a data analytics process approach that rose to prominence after its release in 1996 and was widely used in the early 2000s.37 Its emphasis on data mining and the computational tools prevalent two decades ago has resulted in the method’s use diminishing considerably with the rise of big data and contemporary data science techniques. CRISP-DM’s original website went offline around 2014, leaving an absence of a specific industry standard for today’s data projects. The Data Ring framework leverages concepts from established industry methods, with a modernized approach for today’s technologies and the needs of data science teams. 37 Cross Industry Standard Process for Data Mining. In Wikipedia, The Free Encyclopedia, accessed April 3, 2017, https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining/ 100 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES It was developed by Christian Racca and The Data Ring approach is designed Leonardo Camiciotti38 as a planning tool around risk mitigation and continuous to help recognize core project elements improvement; it is designed to prevent and think through data project resource faulty starts, to ensure goal-driven focus requirements and their relationships in a and to avoid worst case scenarios. It may structured way. In collaboration with the be used as a continuous guide to define original authors and Soren Heitmann, the and refine goals. This helps keep the Data Ring and the associated tool, the Data execution phase under control and delivers Ring Canvas, were further adapted for this results the best way possible. The thought handbook. The key idea is to provide a tool process is circular, asking managers to re- that supports project managers through examine core planning questions with each the complete process. Below is a list of iteration, refining, tuning and delivering. ways the tool should be used: When problems arise, the idea is to prompt mangers to go full circle, considering • Checklist: A checklist or ‘shopping each ring quadrant as a potential list’, through which one analyzes the solution source. presence (and the related gaps) of the necessary ingredients to undertake a The Data Ring diagram is quite complex, data-driven process as it depicts the core set of considerations • Descriptive Tool: The Data Ring is necessary to plan a full project. Project a powerful framework to explain the managers may consider printing the data-driven process (it may be an diagram as a singular visual reference for internal report, a public presentation or designing a data project. In the following a scientific publication) sections, each of these detailed structures will be broken down step-by-step and • Continuous Feedback Mirror: Starting discussed. The section concludes with a from the definition of the objectives and use case walk-through to exemplify how ending at the results, each iteration cycle the Data Ring may additionally be used as provides feedback to refine the process a planning tool. and reassess design • Focus Tool: To keep the project’s focus on the goals while monitoring clear targets 38 The Data Ring is adapted for this Handbook from Camiciotti and Racca, ‘Creare Valore con i BIG DATA’. Edizioni LSWR (2015): http://dataring.eu/ DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 101 2.1_MANAGING A DATA PROJECT Structures and Design Tools and Skills for example. Numeric data are inputted: The upper blocks of the Ring are focused age, income, and default rate history, for Five Structural Blocks on assessing the ‘hard’ and ‘soft’ resources example. The outputs are credit scores, or required to implement a data project: more numeric data. The process is data in, The Data Ring illustrates the goal in the data out. center, encircled by four quadrants. It has • Hard Resources: Including the data five structural blocks: Goal, Tools, Skills, themselves, software tools, processing, In fact, this principle of data in, data out Process, and Value. The four quadrants and storage hardware is continuously applicable throughout the sub-divide into 10 components: Data, • Soft Resources: Including skills, domain data project. It can be applied to every Infrastructure, Computer Science, Data Science, expertise and human resources for intermediate analytic exploration and Business, Planning, Execution, Interpretation, execution hypothesis test, beyond mere descriptions Tuning, and Implementation. A project of starting and ending conditions. The Data Process and Value Ring’s circular process similarly illustrates plan should aim to encapsulate these The lower blocks of the Ring are focused an iterative approach that aims at refining, components and to deeply understand their on implementation and delivery, although through cycles, the understanding of interconnected relationships. The Ring’s these consist of three concrete activities: phenomena through the lens of data organizational approach helps project 1. Planning the project execution analysis. This allows a description of causes managers define resources and articulate (data in) and effects (data out), and the these relationships; each component is 2. Generating and handling the data – the identification of non-obvious emergent provided with a set of guiding framework execution phase behaviors and patterns. The Data Ring’s questions, which are visually aligned 3. Interpreting and tuning the results five core organizational blocks are designed perpendicular to the component. These to implement the project goal and to plan and achieve balance between guiding framework questions serve as a extract value specificity and flexibility throughout the graphical resource planning checklist. data project’s lifecycle. Circular Design Goal: Central Block A central element of the Data Ring is Practically speaking, project planning its circular design. This emphasizes the should consider each ring’s block in Setting clear objectives is the foundation idea of continuous improvement and sequence, iterating toward the overall of every project. For a data-driven solution iterative optimization. These concepts are plan. The circular approach aims at laying to a problem, without quantitative and especially critical for data projects, forming out what steps are needed to achieve a measurable goals, the entire data analysis established elements of good-practice minimum viable process. That is, where process is at high risk of failure. This project design and planning. This is because data can be put into the system, analyzed translates into little knowledge value added the result of any data project is, simply put, and satisfactory results obtained – and then and can cause misleading interpretations. more data. Take a credit scoring model, repeated without breaking the system; 102 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES The Data Ring for example, with a refreshed dataset a few months later that includes new customers. Once established, the project e can then iterate to the next level to deliver is Fram ert l ga a minimum viable product (MVP). This is the Exp Le ewo most basic data product. nd tor Sto a D rks Sec ce at cy ra en a iva ge n Pi A data product is a model, algorithm or ci tio pe S Compute Pr iza al r Sc lin re l ua procedure that takes data and reliably ci i en ctu e So ce s Ac ru Vi ce st FI T Bu ta n feeds the results back into the environment ss ra Da tio ibi lity nf s ica un through an automated process. In other SK I mm in Co es LS Form I ats words, its output results are integrated s ta LL Da O 2 Da TO into a broader operational context without S 1 ta S manual computation. This is what sets cien a data product apart from a singular ce GOAL(S) analysis. A data product might be simple – O PS U SE like an interactive dashboard visualization – but there are also highly complex Imple data products, where credit scores feed Benchm me ark 4 into semi-automated loan decision- nta 3 S VA ES Met ng UE C making processes, influencing new client t rics io L O and ni n PR Defi an ni Bu niti generation with data fed back into the Pl Tu n dg ons ut ng tio et Inp RESULTS cu Pa an dT credit scoring model to guide new lending ta In e r Da s ter Ex tn im es p reta er sh ing decisions. The fact that data products are oc tion Da r ip P s consumers of their own results affirms their ut ta d an an tp Go d So circular principle. The stock of data grows Ou re ve u ur ct r ci ta with each iteration. This also emphasizes na ru ng Da St n ce the Data Ring’s organizational focus with the goal positioned at the center, guiding which data to analyze and whether or not the time has come to stop iterating and judge the goal achieved. Figure 19: The Data Ring, a Visual Planning Tool for Data Projects DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 103 2.1_MANAGING A DATA PROJECT the data science team to ‘play’ with the solution; reflect on the nuances of the data. With that said, it should be done in strategic problem; refine either or both a structured way, through exploratory accordingly. It helps to break down larger hypothesis testing, by emulating the problems into more discrete issues, for a scientific method (See Chapter 1.1, The clear goal to resolve a clear problem. Scientific Method). Start Small. For new data projects, Strategic Problem Statement Reaching the goal signals project The idea of, ‘pitch the problem before a Minimum Viable Product completion. With an iterative approach, the solution’ helps drive this focus and (MVP) is the recommended goal. it is especially important to know how a helps communicate to stakeholders what This is a basic and modest goal, completed project looks in order to avoid the pain is and who has this problem. created to test if a data-driven getting stuck in the refinement loop. Once the problem is discussed, explaining Setting satisfactory metrics and definitions the solution becomes simple. Below are product concept has merit. Once helps guide the project’s path and will warn two DFS strategic problem examples: achieved, project managers may of risks if the project starts to go astray. consider the same Data Ring As with operational management, the • Sample Problem: Existing customers project should both monitor and assess have low mobile money activity rates concepts to scale up the MVP to its KPIs throughout the iterative process, • Sample Problem: Potential customers a prototype. ensuring these reference points continue are excluded from accessing microcredit to serve the project the best way possible. products Goal Setting Goal Statement GOAL(S) The goal is a proposed data-driven solution In the context of a data project, the goal Goal setting is the first step of project to a strategic problem in order to produce is to deliver a data-driven process and planning. The project needs to know value. The operational needs of the project product of some specification. This sets the where it is going in order to know are reflected by the structural blocks project’s path. It is also important to know when it has arrived. To some extent, a and guiding questions of the Data Ring. if the path is a good one; in other words, fate-based approach to data analysis, This translates into clear resource needs, if the product is based on a reasonable especially when dealing with complex human skills and concrete processes, which hypothesis about why it works and why structures, processes and organizations, are all oriented by the problem statements results are reliable. A goal statement might lead to unexpected discoveries and that the project seeks to solve. It is likely has two parts: product specification unplanned trajectories. Discovery is indeed the goal statement and problem statement and its strategic hypothesis. Here are an important factor for data projects, will be defined vis-à-vis the other: consider two proposed solutions to the previous permitting exploration and allowing if the intended goal will deliver the sought problem statements: 104 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES • Proposed Solution: A minimum viable Framing the goal in terms of scale helps this goal-driven hypothesis gives the customer segmentation prediction to define both resource requirements and data product credibility and reliability. model to identify high-propensity active how overarching project components need A similar hypothesis might be constructed users to increase activity rates to fit together. A MVP proof of concept for a credit scoring model to test the • Proposed Solution: A production-level might be delivered on a single laptop in a hypothesis, for example: customers with customer credit scoring algorithm for few weeks. In comparison, production- small social networks have higher loan automated microloan issuance level scale might require special data default rates. Hypothesis setting is by no servers, experts to maintain them and means limited to algorithm-based data Process and Product Specification legal oversight to ensure data security. projects. A visualization dashboard also As detailed above, the two data products Nevertheless, producing a MVP requires has a hypothesis, with respect to the exemplified are a customer segmentation hard and soft resources (i.e., infrastructure relationships between the data that aim prediction model and a customer credit and people), organized according to a to be visualized. Such a hypothesis may scoring algorithm. These are specified by minimum viable process. This means not be statistically tested by algorithms, their scale, which helps describe how ‘big’ defining clear organizational roles, but the reliability of the visualization is the project is, or how it integrates into management and reporting relationships. predicated on these relationships being broader systems. This is how a data-driven solution to consistent and valid over time. Because a strategic problem is operationalized, of this, the visualization will continue to Scale may be considered along the how technical challenges are identified tell a meaningful story or guide useful following progression: and solved, and how to ensure that the decision-making. • Process: input data that reliably yield concrete product delivers strategic value. results data through an automated The principle of ‘reproducible research’ has process Hypothesis become prominent among data scientists. What these data products do is driven by an Reproducible research describes transparent, • MVP: a product concept and process underlying hypothesis, which is only implicit repeatable approaches to analysis and whose results evidence essential value in these two examples. Identifying high- how results are obtained in the first scale • Prototype: a product concept with propensity active users has an operational step of ‘process’. In principle, this is to basic implementation, usability and hypothesis; there is a correlation between enable independent results validation, reliability the variables that define these customer which may be relevant for regulatory or • Product: a proved concept with reliable segments and activity rates. For example, audit purposes. This is why the first step implementation and demonstrated customers with high voice talk time have in iteration when using the Data Ring is value proposition higher activity rates. This is a statistically to articulate a minimum viable process; • Production: a product systematically testable hypothesis and ultimately the onus it sets the project to achieve reliable results implemented and delivered to users or of the data science team to demonstrate. upon which the product’s essential value is customers If the correlation is strong and reliable, based. This process equally supports data DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 105 2.1_MANAGING A DATA PROJECT products to immediately see if and when Mitigation: Know what the project aims solution has a logical inconsistency, such hypotheses become unreliable, which to accomplish. If the team wants to do as a weak business or strategic relationship may prompt re-fitting models to ensure something but is unsure where to start, with the problem it is intended to resolve. ongoing reliability. they should engage a data operations specialists to review the data and help Mitigation: Set clear, precise goals with Goal Risks and Mitigations business relevance incorporated into shed light on what types of relevant Setting project goals in terms of insights they could provide the business. each of the problem-product-hypothesis hypotheses that are formulated, tested and The goal of the project is generally components. Ensure they can be refined refined helps to mitigate common risks in proved by the measurability of the through an iterative approach and revisit data projects. The risks of inadequate goal results, but it is important to note that these as the project progresses. Further, setting are: hypothesis testing often proves false. be sure there is ongoing goal relevance This is a good thing. Either iterate and as business strategy independently Risk: Not Goal-driven succeed, or accept that the idea does not evolves. Plan for exploration and The main risk is the absence of a strategic flexibility within the project execution. work and go back to the drawing board. project motivation and goal, or non-goals. Setting exploratory boundaries is key, This is superior to a good or interesting In other words, this risk encapsulates result based on bad data. as they ensure projects do not go off motivations to do something meaningful course, while still permitting opportunity with the data because of the appeal, in order Risk: Lack of Focus for discovery. This is also supported by to engage popular buzzwords, because the Equally related to non-goal project risks the specific measurement units and competitors are doing it, or just because are projects whose goals are too general, associated targets, or KPIs, for both it is scientifically or technologically sound ill-defined or overly flexible and changing. intermediate objectives and overall – yet the motivations lack a value-driven The goal sets the direction and outlines goal achievement. counterpart. This approach could lead to unusable results or squandered budgets what will be achieved. Lack of clarity may lead to teams getting distracted Risk: Not Data-driven while it presents a missed opportunity to or analyzing ancillary questions, thus Renowned economist Roland Coase leverage the analysis to deliver goal-driven results that are relevant to the organization. delivering ancillary results. Taking this stated: “If you torture the data long For those particularly motivated to do into consideration, some flexibility must enough, it will confess.” The risk is forcing something, it is not uncommon to bring exist for iterative goal refinement, and data to reveal what one expects in an aboard external resources who are simply to allow for exploring and capitalizing on attempt to validate desired knowledge, tasked to discover something interesting. serendipitous discovery. Lack of focus can behavior or organization. Turning to a This can risk results that are not only also be the result of a problem-solution data-driven approach means being ready unusable, but wrong, as open-ended mismatch. This is when the underlying to observe evidence as it emerges from exploration may permit biased analysis or strategic problem may not be precisely data analysis. In other words, analyzing forced results in the drive to deliver. defined, or where the proposed goal projects, processes or procedures through 106 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES data might lead to results that are not Quadrant 1: TOOLS and their interrelated relations. To yield aligned with current beliefs, thoughts or knowledge and value from their analysis, strategy, forcing an organization to make data must be stored, described in a proper a deep change. way and made accessible. This requires a suitable technical infrastructure to be put in Mitigation: Emulate the scientific place to manage the data, their accessibility method to set time-bound project and computation. This also permits access to whole system analysis and the objectives supported by hypotheses tantalizing patterns that can drive value. that are rigorously tested. Ensure the The first quadrant of the Data Ring asks execution strategy uses the concept of project managers to consider their data reproducible research to better enable and the technical infrastructure needed to repeatability and independent validation analyze it through two components: data of results. Also, ensure project sponsors and infrastructure. fully understand that finding valuable Tools: Data patterns is not guaranteed. Figure 20: Data Ring Quadrant 1: Data are the fundamental input (and Risk: Not Pragmatic TOOLS output) of a data project. The Data Ring’s guiding questions are grouped by two Goals should be realistic with respect The world and its dynamic phenomena principles: accessibility and format. These to the project resources and sponsor can be observed and fragmented into data. are critical elements that deeply affect expectations, for example, appropriate In other words, data are just samples of resource needs and process decisions. competency, infrastructure or budget. reality, recorded as measurements and stored as values. In addition, complex systems belie First, it is necessary to know how the data Mitigation: Ensure that product scale is further knowledge, which is embedded in are described, their properties, and if they considered as part of the goal statement. the collective behavior of different system represent numbers, text, images, or sound. This helps bound the project and push components. Individual components may Also, if they are structured or unstructured. project managers to match resources reveal nothing, but patterns emerge from The data must also be understandable and requirements. Additionally, ensure observing the whole system. to humans and must exist in a digitized, machine-usable format. These basic an information and communication The data revolution has provided an parameters are relevant for data of technology (ICT) specialist performs a exponential increase in the volume, all sizes and shapes. These are critical technical IT assessment of the project velocity and variety of digital data. factors for determining the best technical design to ensure pragmatism between This increased availability of digital data infrastructure to use for the project. See the project goal and the technical tools allows higher granularity and precision in Chapter 1 for additional discussion on sourced to deliver it. the comprehension of processes, activities data formats. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 107 2.1_MANAGING A DATA PROJECT Recently, the concept of big data became The following framing questions help longer or shorter, which means higher or prominent. This is a useful concept, identify sources of data and scope them lower project costs. Inadequate upfront but its prominence has also created in terms of project resource requirements. data planning can result in ballooning misconceptions. Particularly, that the If internal data systems do not capture costs down the line; revisions could mean simple availability of a large or ‘big’ amount what is assumed, this forces project needing to select different computational of data can increase knowledge or provide resource planning to shift by identifying infrastructure or different team capacities. better solutions to a problem. Sometimes new required data resources: Data Accessibility this is true. However, sometimes it is not. Though big data can provide results, it is • What data are produced or collected Data must be accessed in order to be also true that ‘small’ data can successfully through core activities? used. It may sound trivial, but this issue deliver project goals. It is important for the • How are those data produced (e.g., is complex and needs to be considered at which products, services, touch points)? the very beginning of each data-driven project manager to ensure that the right process to ensure results are on time and (and sufficient) data are available for the • Are the data stored and organized or do on budget – or if results are even possible. job and that the right tools are in place. they pass through the process? Customer privacy, requesting and granting The definition of ‘big’ is constantly shifting, • Are the data in machine-readable form, data-use permissions and establishing so dwelling on the term itself rarely benefits ready for analysis? who has both ownership and legal interest a project. What is most useful about the • Are the data clean, or are there once data access permissions are granted big data concept is understanding that the irregularities, missing or corrupt values are factors that make data accessibility bigger a dataset is, the more time it will or errors? complex, inconsistent across regulatory take to analyze. With that in mind, a bigger environments, and subject to ethical • Are the available data statistically dataset also requires more specific technical concerns. Data accessibility may be judged representative, to permit hypothesis team capacities and the more complex, according to three factors: testing? sophisticated or expensive technical • What is the relation between data size Legal infrastructure to manage it. Data ‘bigness’ and performance needs? Regulations might prevent an excellent and can also relate to a goal’s scale; a MVP may well-designed data-driven analysis from be attainable with only a snapshot of data, These questions are exemplary of the being carried out in its entirety. This would but production may expect continuous effort necessary in the initial phase in interrupt the process at an intermediate high-velocity transactional data. This is an order to successfully acquire, clean and phase, thus making it vital to be aware of important element of the project design prepare the dataset(s) for subsequent legal constraints from the beginning. process; having terabytes of streaming analysis. Depending on how much control data does not imply sufficiency to meet a is available in the whole data-driven Ownership of data must be established, project’s goal. process, this preparation phase will be identifying who has permission to analyze 108 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES them for insights. If IP agreements are in Microsoft products. This may result offer a work-around – albeit sometimes place, they need to cover both existing in costs and inefficiencies, and may complex or inefficient – strategic factors and derivative works. If the analysis is create extra problems to solve by trying are still often established to deliberately a research collaboration, publication forced alignments. ensure access is only possible according to agreements should be in place, including the data owner’s specification, or perhaps clarity on what constitutes proprietary Digital data are required in order to access denied entirely. information and what may be made public. analyze them at machine scale and speed. There may be some nuanced exceptions to Data Format Ethical use of the information may also the rule and AI is pushing these boundaries. Digital data can be represented in many carry legal constraints. Data regarding different forms and a data format describes people, groups or organizations must be Compatibility is needed between the data data’s human-understood parameters treated carefully, putting safety as the first format and the technology used to manage (i.e., text, image, video, biometric). Often, consideration. Data privacy regulations them. Even if datasets are digitized, they the format is referred to by the three or may also influence how data may or might be isolated and inaccessible due to four-letter suffix at the end of a computer may not be transferred from owner to incompatible technological choices made file. Format may also refer to data storage analyst, such as whether they can be by different departments of the same structures and databases more generally, sent electronically or by physical storage. company, government or organization. for example: Oracle, MongoDB and JSON. Additionally, regulations may outline Sometimes obsolete systems might be in (See Chapter 1.1, Defining Data) procedures for data leaving national place, which can also prevent interactions borders, being routed via third parties, with modern solutions, languages and There are numerous data formats, or being stored on servers located in protocols. The amount of effort to especially including storage and processing specific countries. harmonize the technological infrastructure approaches. Data format is determined might be a non-trivial barrier from a time- strongly by business or organizational Technological cost perspective. context and, in particular, by the people Barriers can exist if the data format is responsible for managing the data misaligned with the selected technology Strategic creation, storage and processing. For for data processing and analysis. As a Actors might seek to preserve a competitive project managers, recognizing format simple example, a NLP algorithm cannot advantage by intermediating access to fragmentation and incompatibility issues be meaningfully applied to image data. their data assets. This usually takes shape is key to establishing the data alignment More practically, databases are generally in one of three ways: by requiring special required for well-designed projects. optimized for specific types of data; hardware or software to read proprietary Understanding the values recorded in a and some technologies aren’t designed data formats; by controlling how the data dataset, as well as more general dataset to work together, similar to building a can be used; or by requiring special licensing metadata, helps project managers to workflow aimed at mixing Apple and fees. Whereas technological factors might plan properly. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 109 2.1_MANAGING A DATA PROJECT A data point’s value refers to the intrinsic with no header. Are those numbers related Understanding how datasets are content of a data record. This content may to transaction values, perhaps the times connected via metadata is a key element of be expressed in numerical, time or textual when the transactions took place? If the project design and key to identifying gaps form, called the data type. For data analysis, project seeks to visualize volumes on a and opportunities for analysis. Metadata the crucial factor is that these underlying map, agent location also becomes a data help identify where additional data may values are not affected by systematic requirement; the computational process be required to deliver project goals, and errors or biases due to infrastructure or must be able to ask the dataset to provide how to link in new datasets when required. human-related glitches. Generally, project all location values. If the location category Metadata help to identify efficiencies where managers do not consider how data are is not comprised of defined metadata, supplementary datasets may already exist; collected or whether instrumentation is then the process will not be able to find licensing third-party data may fill gaps well-tuned. It is relevant to understand any GPS coordinates to plot. The solution and derivative or synthetic metadata how these underlying measurements could be simple, say, adding a ‘location’ could be created to help contextualize are made and to ensure there is proper title to this unnamed column. In this way, project datasets. For project managers, knowledge transfer between data owners project teams can add contextualized it is important to know when and where and data analysts about key measurement information to datasets and provide more metadata are likely to exist. If they are not issues. As a practical example, if a system detailed descriptions of the data (i.e., a part of initial datasets, it may be best to went down during an IT upgrade, then this ask the data owners for this information, metadata) that the analytic process can upgrade will be reflected by a dramatic rather than contextualize it as part of the then ask questions about and use. In this drop in transactions. Analysts need to be project work. sense, metadata are just another dataset. aware of this information to interpret Metadata are special because they are Tools: Infrastructure the anomaly correctly. Anomalies in data inherently connected to the underlying values greatly influence the process of data As previously explained, data are the dataset, which enables this question-and- cleaning and related project planning. fundamental input (and output) of a answer process to take place. This is just data project. Where data physically go Metadata are ‘data about the data,’ which an example; metadata are more than just and come out from is the infrastructure. includes all of the additional background column headers. Even in Excel, metadata Data are digital information that need to be information that enriches a dataset and exist about the spreadsheet being worked acquired, stored, processed and calculated makes it more understandable. The header on, for example, file size, date created and using informatics tools running on virtual title columns in an Excel sheet are metadata author are all examples of metadata. Such or physical computers. (the titles are themselves text data that underlying metadata enable file searching describe the values in the following rows). and sorting, for example, the operating The technological infrastructure has to be For example, imagine a dataset with system can ask for all the files modified in appropriate for the objectives that arise as the labels, ‘agent name’ and ‘transaction the last week. The answers are obtained far as the volume, the variety and the velocity volume’, proceeded by a column of numbers through the file’s metadata. of data are concerned. The infrastructure 110 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES resources enable the usability of the data Storage Hadoop, Hortonworks, Cloudera. See and strongly affect the ‘power’ and the Chapter 2.2.3, Technology Database). It is A database or file system is called storage, or effectiveness of the scientific algorithms worth noting that a project may integrate the infrastructure element for storing data. and mathematical models applied. Generic multiple frameworks. Using an established Storage affects how data are saved and framework is recommended because this data-driven infrastructure is built by these retrieved and these input-output processes avoids the need to program common tools core blocks: are critical for designing a well-performing from scratch, which can be an enormous system. It takes time to write data to a disk, Data Pipeline time and cost savings. The trade-off is and when a query arrives, it takes time to that the project approach must adapt to The data pipeline is a functional chain of search for the answer and send it to the the framework’s way of solving the set of hardware or software where each element next step on the data pipeline. The right problems it was designed to address, which receives input data, processes it, then database tools are often guided by the may or may not perfectly fit the precise forwards it to the next item. It is how data nature of the data themselves, their format needs of the project. Selecting the wrong are uploaded into the analytic process; the and their structure. Additionally, how framework risks mismatching its solutions data pipeline includes the upload process, the data are used plays a role in storage; approach with the project’s problems, tools to crunch the numbers, how the an archiving system aims to compress as introducing inefficiencies. numbers are downloaded, and how they much data into a volume as cheaply as possible, while a transactional database Frameworks are typically designed are then fed into an operational process. around hardware specifications, and they ensures speed and reliability so customers For example, this pipeline delivers the ultimately run on computers that crunch are not kept waiting. Frameworks also technical integration of a data product the numbers for the data project. While guide database choice by providing built- into broader corporate systems. The raw computing power is equally a critical in tools optimized for specific storage pipeline must be planned to ensure a element of the project’s infrastructure, solutions and designs. reliable process that takes in raw data and it is best to first plan the data pipeline, storage requirements and frameworks delivers usable results. The project should Frameworks necessary to accomplish the project needs. ensure that a schematic or flow diagram is A framework is a solution set designed for Adequate computing specifications tend written to describe the pipeline’s functional a group of problems. Technically, it is a set to fall into place afterward. Infrastructure implementation. The initial upload into the of predefined libraries and common tools design and management is usually not the pipeline generally marks the operational to enable writing code and programs more role of project managers, but they do need start of a data project, beginning with the quickly and easily. In the area of big data, to ensure capacities and resources are data Extraction-Transformation-Loading these include platforms that collect tools, available to meet project needs. This is why (ETL) process. The ETL is a procedural plan, libraries and features in order to simplify an IT assessment is specifically indicated set as part of the project’s data governance, the data management and manipulation as part of managing risks and setting which is discussed in more depth later. processes (e.g., Apache Spark, Apache pragmatic goals. Relying on internal IT DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 111 2.1_MANAGING A DATA PROJECT teams or ensuring relevant capacity on the Data-driven projects need data scientists. competency, this usually requires an data project team is critical to help assess With that said, ‘data scientist’ is a relatively interdisciplinary team of technical experts infrastructure requirements and technical vague and broad title, one that is still that strongly interact with all the units – needs, including scalability, fault tolerance, being defined. Meanwhile, industry and single person or group – that manage data distribution, or environment isolation. media have generated hype about big from acquisition to visualization. These technical terms are relevant for data, machine learning and a host of large-scale enterprise computational technologies, while also creating a broader Teams are dynamic and collaborative, infrastructure; MVP goals can be achieved awareness on data’s tremendous potential and it is difficult to keep pace with with much less. Even small data projects value. This has created pressure to invest innovation and the development of are likely to engage enterprise architecture new skillsets, emergent expertise and in these resources in order to keep up with around the data pipeline. The data a growing hyper-specialization. Outsourcing the competition. It is critical for the data- project needs will almost certainly feed in capacities can achieve required dynamism driven project manager to be aware that from corporate systems, and this needs to very specific sets of skills and technical and fit-for-purpose skillsets. Alternatively, be well-scoped, planned and coordinated experience are needed to deliver a data retaining or building core in-house data with IT teams. project’s requirements. Equally critical, science generalists can help ensure they must be aware that many of these successful collaboration across a team Quadrant 2: SKILLS fields of expertise are dynamically forming of multidisciplinary data specialists and in lockstep with technology’s rapid change. business operations. The second quadrant of the Data Ring asks project managers to consider the human An open, scientific and data-driven resources needed to deliver the project culture is required. A proper scientific through three components: computer approach and a data culture must exist science, data science and business. within the team and, ideally, within the entire company. Because good goal setting The Team is predicated on emulating the scientific Assembling the right mix of skills sets is method and exploratory hypothesis testing, a challenge for data project managers the data science team must be driven because of the dynamic evolution of by a sense of curiosity and exploration. technology, ever-increasing dataset sizes The project manager must ensure that and the skills required to derive value from curiosity is directed and kept on target. these resources. The following framing questions will Figure 21: Data Ring Quadrant 2: A data scientist is usually a team of help project managers identify resources SKILLS people dealing with data. Beyond a single and needs: 112 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES • Who is responsible for managing the Skills: Computer Science Java or C++. This can be an issue for a data in the enterprise? How? Data are digital pieces of information that goal’s scale; beyond prototyping and • Are there any ongoing collaborations need to be acquired, stored, processed, implementation in production, enterprise- with research institutions or qualified and managed through computing tools, level programming solutions will invariably organizations to perform the data programming and scripting languages and be required, as well as the skills to science activities? databases. Therefore, skills should include implement. This also likely means that knowledge about: coding refactoring, or translating between • Which recruiting channels exist as far as computer languages, may be required as data-driven professionals are concerned? Cloud Computing well as strong interactions between the • How is data culture fostered inside the When data sources are big or huge, data team and the IT and engineering company, and who is involved? normal programming tools and local staff members. • How is multidisciplinary collaboration computational resources, such as personal facilitated in project planning and computers, become rapidly insufficient. Databases and Data Storage execution? ‘In-cloud’ solutions are a practical and Chapter 1 discusses structured versus • How is scientific validity ensured in effective answer to this problem, but they unstructured data. A data project may choosing algorithms and mathematical mean mastering essential knowledge draw on both, which are respectively data representations (modeling)? Is a about virtualization systems, scaling handled by relational databases and non- qualified person ensuring the results paradigms and framework programming. relational databases. Using these tools are true? (See Chapter 2.2.3, Technology Database) requires different skillsets. Data sourced • Who ensures good practices are in from enterprise transactional databases is Scripting Languages place and algorithms are programmed likely to come from relational databases. Working with computing infrastructure Increasingly, even internal data, such as KYC efficiently? means coding. Python or ‘R’ are often the or biometric information, may be stored by • Is there an open collaboration between best options to fast-prototype and explore either solution, depending on collection the data-driven team and other data patterns. These are likely choices method. However, a credit scoring business units? for a MVP goal and early-stage project algorithm that seeks to use social network development. Both scripting languages A complete, highly interdisciplinary team data is likely to draw on unstructured data have become deeply established as is difficult to achieve, and most firms are from non-relational data sources. necessary data science tools, and the team unlikely to have full breadth of relevant skills should ideally ‘speak’ both. (See Chapter Version Control and Collaboration sets to draw on demand. Understanding 2.2.3, Technology Database) these gaps is usually the first step to Versioning tools are essential for organized being aware of the full potential and Certain corporate infrastructures and code evolution, maintenance and planning outsourcing investments, which certification requirements might require teamwork and are thus essential for good is considered a part of process planning. different coding choices such as Scala, project planning. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 113 2.1_MANAGING A DATA PROJECT Skills: Data Science for the data science team; simply put, the Skills: Business team should possess a mental approach Goal setting is essentially related to Scientific Tools to problem solving and an internal drive to delivering business-relevant results and Different contexts will require a specific find patterns through methodical analysis. benchmarking against appropriate metrics mix according to project needs, but the and KPIs. Knowing how to connect these following are broad academic areas Furthermore, scientific validation is metrics to project execution is the very that data projects are likely to need to essential for a data project, and data purpose of doing the project. This requires draw from: scientists should have a scientific mind. the project team to have sound business That is, a methodical approach to asking knowledge. A clear business perspective is • Solid Foundation of Statistics: used for and answering questions and a drive to also essential for results interpretation – hypothesis testing and model validation test and validate results. Importantly, and ultimately to use and implement the • Network Science: a discipline that uses team members should find motivation project to deliver value. With respect to nodes and edges to mathematically in the results and openness to whatever skills, the key message is that a ‘junction represent complex networks; critical interpretation a sound analysis of the data person’ needs to intermediate data, for any social network data or P2P-type yields, even if the findings might contradict technical specialists, business management transaction mapping initial expectations. In line with the and strategy in order to translate data • Machine Learning: a discipline that scientific method, this approach should be insights for non-technical people; this uses algorithms to learn from data embodied in behavioral competencies, for intermediary’s role also articulates business behaviors without an explicit pre-defined example: making observations; thinking needs in terms of algorithms and technical cosmology; most projects that deliver a of interesting questions; formulating solutions back to the team. There is a model or algorithm hypothesis; and developing testable growing expertise called data operations • Social science, NLP, complexity science, predictions. that encapsulates this role. and deep learning are also desirable skills Design and Visualization Privacy and Legal that could play a key role in specific areas of interest This requires a multidisciplinary skillset in Except for the cases in which datasets are terms of both technical and business needs. released with an open license – explicitly Curiosity and Scientific Mind On the technical side, ‘DataViz’ should not enabling usage, remix and modification – Attitude and behavioral competencies are be considered exclusively as the final part of such as through open data initiatives, the critical factors for a successful data science the project aimed at beautifying the results. issues related to privacy, data ownership, team. People who seek to explore, mine, It is relevant throughout exploration and and rights of use for a specific purpose aggregate, integrate – and thus, identify prototyping, and is well-incorporated at are not negligible (See legal barriers to patterns and connections – will drive periodic project stages, which makes it the data – in Data Accessibility on page superior results. In other words, some a core skillset for data scientists to 117). Corporate legal specialists should general ‘hacking skills’ are an added value identify patterns. be consulted to ensure all stakeholder 114 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES concerns are properly addressed. With the legal professionals in charge. Legal this said, big data and privacy issues are awareness is particularly relevant when pushing into new territory, and legislation securing external consultants and when aimed at regulating the data approach is ensuring Non-Disclosure Agreements still developing. Many companies today (NDAs) are thorough, follow regulation, are building their data-driven businesses and can be upheld. From both an internal by leveraging legal gaps in local laws. and external perspective, data can also be a This can present risks if laws change, while source of fraud. Fraud cases are increasingly also presenting opportunities, by working technically sophisticated and data-driven. to build an enabling environment. Though a data science team does want In terms of skillsets, the project team hacker skills as part of a balanced skillset, members should each have some it does not want actual hackers. It is critical basic legal awareness. This allows for that the full team is well-versed on legal identification of potential problems considerations and both legally and morally and enables constructive dialogue with accountable to adhering to them. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 115 2.1_MANAGING A DATA PROJECT Industry Lessons: De-anonymizing Data Data Privacy and Consumer Protection: Anonymizing User Data is Necessary, and Difficult In 2006, America Online (AOL), Thelma Arnold was found and computing power, he was able to de- an internet service provider, made affirmed the searchers were hers. anonymize millions of taxi drivers in 20 million search queries publicly It was a public relations debacle only two hours. available for research. People were for AOL. anonymized by a random number. Netflix, an online movie and media In a New York Times article, Another data breach made headlines company, sponsored a crowdsourced journalists Michael Barbaro and in 2014 when Vijay Pandurangan, a competition challenging data Tom Zeller describe how customer software engineer, de-anonymized scientists to improve by 10 percent number 4417749 was identified 173 million taxi records released its internal algorithm to predict and subsequently interviewed for by the city of New York for an customer movie rating scores. One their article. While user 4417749 Open Data initiative. The data was of the teams de-anonymized the was anonymous, her searches were encrypted using a technique that movie watching habits of encrypted not. She was an avid internet user, makes it mathematically impossible users for the competition. By cross- looking up identifying search terms: to reverse-engineer the encrypted referencing the public Internet Movie ‘numb fingers’; ‘60 single men’; value. The dataset had no identifying Database (IMDB), which provides a ‘dog that urinates on everything’. search information like Arnold, social media platform for users to rate Searches included people’s names and but the encrypted taxi registration movies and write their own reviews, other specific information including, numbers had a publically known users were identified by the patterns ‘landscapers in Lilburn, Georgia, structure: number, letter, number, of identically rated sets of movies United States of America’. No number (e.g., 5H32). Pandurangan in the respective public IMDB and individual search is identifying, but calculated that there were only 23 encrypted Netflix datasets. Netflix for a sleuth – or a journalist – it is million combinations, so he simply settled lawsuits filed by identified easy to identify the sixty-something fed every possible input into the users and faced consumer privacy women with misbehaving dogs encryption algorithm until it yielded inquiries brought by the United and nice yards in Lilburn, Georgia. matching outputs. Given today’s States government. Properly anonymizing data is very difficult, with many ways to reconstruct information. In these examples, cross-referencing public resources (Netflix), brute force and powerful computers (New York Taxis), and old-fashioned sleuthing (AOL) led to privacy breaches. If data are released for open data projects, research or other purposes, great care is needed to avoid de-anonymization risks and serious legal and public relations consequences. 116 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Social Science and Data use automated approaches, there are Quadrant 3: PROCESS The intersection of data savvy and the significant risks that a data project can social sciences is a new area of scholarly deliver results that appear to look great but activity and a key skills set for project teams. are unknowingly driven without true BI. The business motivation for a data project Therefore, constant dialogue with sector generally comes down to customers, experts must be part of project design. whether it relates to increased activity, new products or new demographics. Communications To engage customers, one needs to know Data tell a story. In fact, precise figures can something about them. Data social science tell some of the most powerful stories in a skills help interpret results through a lens concise way. Linkages between business that seeks to understand what users are communications and project teams are or are not doing and why; thus, teams an important element for using project are able to better identify useful data results – as is being able to implement patterns and tune models around variables them in the right way, aligned with that represent customer social norms communications strategy. There is also a Figure 22: Data Ring Quadrant 3: and activities. PROCESS strong communications relationship with Sector Expertise data visualization and design, especially for The previous sections looked at the upper- public-facing projects. Data visualization is half of the Data Ring, focused on hard Domain experience, market knowledge important for communicating intermediate requirements (infrastructure, data, and and sector expertise all describe the critical and final results. Ensuring visual design tools) and soft requirements (skills and relationship between project results and skills is as important as the technical skills competences). This section now shifts to business value. Absent of sector expertise, to plot charts, making results interactive the lower-half of the Data Ring, which the wrong data can be analyzed, highly looks at the process for designing and accurate models may test the wrong or serving them to the public through executing a data project. hypothesis or statistically significant websites. For many data projects, the variables might get selected that have no visualization is a core deliverable, such Acknowledging that corporations or relationship to business KPIs. With many is the case for dashboards and for many institutions have their own approaches machine learning models delivering ‘black project goals specifically aimed at driving based on a mix of organizational history, boxes’ or infrastructure frameworks that business communications. corporate culture, KPI standards, and data DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 117 2.1_MANAGING A DATA PROJECT governance regulations, the following specific deliverables. These are needed to to interpretation. It may include charts are considered general good practices help project sponsors see what was done that plot principle data points for core to enable data-driven projects and to the data and possibly to detect errors. segments, such as transactions over time their deliverables. Additionally, these support follow-on disaggregated by product type to show projects or derivative analyses that build trends, spikes, dips, and gaps. Delivered Data projects must define their on cleaned, pre-aggregated data. early on in the execution process, the deliverables, the results of project Planning data inventory report is an opportunity and Execution. These results intermediate Questionnaires and to discuss potential project risks due to between Process and the subsequent Collection Tools the underlying data as well as strategies block that aims at turning them into Projects that require primary data collection, for course-correction and need for data business Value. The following list specifies both quantitative and qualitative, may need refinement or re-acquisition. It is especially eight elements common to many data to use or develop data collection tools, helpful to scope data cleaning requirements projects. Where applicable, these should such as survey instruments, questionnaires, and strive to adjust for anomalies in a be in a project’s deliverables timeline, or location check-in data, photographic statistically unbiased way. specified within terms of reference for reports, or focus group discussions or outsourced capacity. interviews. These instruments should be Data Dictionary delivered, along with the data collected, The data dictionary consolidates Dataset(s) including all languages, translations and information from all data sources. It is a Datasets are all the data that were transcripts. These are needed to permit collection of the description of all data collected or analyzed. Depending on the follow-on surveys or consistent time-series items, for example, tables. This description size, collection method and nature of the questions, and they also provide necessary usually includes the name of the data field, data, the format of the dataset or datasets audit or verification documents if questions its type, format, size, the field’s definition, can vary. These should all be documented, arise on the data collection methods at a and if possible, an example of the data. with information on where they are located later stage. Data fields that constitute a set should – such as on a network, or a cloud – and list all possible values. For example, if a how to access them. Raw input data will Data Inventory Report transaction dataset has a column called need to be ‘cleaned’, a process discussed This is a report with a summary of the ‘product’ that lists whether a transaction in the execution section below. Cleaned data that were used for analysis. This was a top-up, a peer-to-peer, a cash-out, datasets should be considered as specific report includes the type, size and date of then the dictionary would list all product deliverables, along with scripted methods files. It should include discussions of major values and describe their respective codes or methodological steps applied to clean anomalies or gaps in the data, as well observed in the data, such as TUP, P2P, and the data. Finally, aggregated datasets as an assessment of whether anomalies COT, respectively. For data that are not in and methods might also be considered as may be statistically biased or present risks a discrete set, like money, then a min-max 118 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES range value is usually provided, along with Exploratory results typically support Analytic Deliverables: Results, its unit of measure, such as the currency intermediate deliverables or project Algorithms, Whitelists and type. Relationships with other datasets milestone assessments. These results may Visualizations should also be specified, where possible. also be summarized to help articulate These are the actual results of the project. For example, a customer’s account number project status and progress by highlighting A customer segmentation project may data field might be present in product questions under current exploration as include a whitelist of customers to target transaction datasets and also in KYC well as questions that have already been and their associated propensity scores as datasets. Specifying this connection helps addressed. A logbook of exploratory well as possible geolocation information to understand how data can be merged, initiatives and principle findings is useful in to advise a marketing campaign. A credit or to identify where additional metadata this regard. scoring algorithm delivers result sets for requirements may be needed to facilitate users specified in control and treatment such a merge. The data dictionary is Model Validation Charts and datasets and the code for the model itself, typically delivered in conjunction with the Performance Metrics or a visualization including scripts to plot data inventory report, supporting a project’s KPIs and animate them and webscripts or For model-based data projects, this is strategic design discussion, risk assessment other components for a user interface. Each a list of charts with the most relevant or additional data requirements in its project will have its own set of nuanced performance metrics of the predictive early stages. deliverables. These must be defined as part model. See the Chapter 2.2.3: Metrics for Assessing Data Models for a list of the of the project’s process design. Exploratory Analyses and Logbook top-10 model performance metrics and This is a set of plots, charts, or table data Final Analysis Report and definitions. These charts and metrics Implementation Cost-benefit summarizing the main characteristics will be used to evaluate the efficacy and Discussion of a specific enquiry or hypothesis test. All the descriptive statistics of the data reliability of the model. Validation charts This is the final report presenting analysis could also be included, for example, may include the gain and lift charts, and the results, answering the questions and averages, medians or standard deviations. performance metrics will depend on the referring to the goals that were set and The exploratory analysis part of identifying particular project. These may include, for agreed on at the beginning of the project. trends and discovered patterns within example, Kolmogorov-Smirnov test (KS), This should be delivered in conjunction the data is necessary for refining analytic Receiver Operating Characteristic (ROC) with the analytic deliverables. In addition to hypotheses, contextualizing metadata curve, or Gini coefficient. This information discussing methodology, process, findings, or identifying ‘features’ that are used in a is necessary to assess goal-completion and solutions to key challenges, the final model. Exploratory analysis is performed as milestones. The model’s approval for report should articulate the core value part of initial project execution, and it often production use or next-step iteration proposition of the analytic deliverables. continues through to project completion. should be made in terms of these metrics. This may include: efficiency gains and DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 119 2.1_MANAGING A DATA PROJECT cost savings from improved data-driven Metrics and KPIs about a continuous re-modulation on the marketing; forecasting increased lending Metrics are the parameters that drive basis of improving problem awareness and opportunities; or productivity benefits project execution and determine if the definition. Some may believe that if they re- from dashboards. The final report should project is successful. For example: rejecting tune it differently, next time they can hit 85 be considered with respect to the project’s null hypothesis at a 90 percent confidence percent. Some others may think they could implementation strategy, to reflect on the target; achieving a model accuracy rate of add new customer data to improve the cost-benefit of the value proposition in 85 percent; or response time on a credit model. This fluid situation does not help in the analytic deliverables and the resource score decision below two seconds. Ex-ante estimating budgets, but budget parameters requirements to implement them at the metrics setting avoids the risks related should be used by project managers as scale expected by the project. to post-validation when, due to vague a dial to tune efforts, commitment and thresholds, project owners deliver ‘good space in order to test different hypotheses. Process: Planning enough’ results. This is often in an effort Upfront investments should understand The following considerations are particularly to justify the investment, or even worse, this exploratory and iterative process and relevant for planning data projects and affirm results against belief, insisting they its risks. The concept of product scale also helping to specify the scope of intermediate should work. See Chapter 2.2.3: Metrics for helps mitigate this risk; start small, iterate Assessing Data Models, which provides a up. It may risk inefficiencies to scale and and final deliverables. list of top-10 metrics used in data modeling refactor, but it also mitigates budgetary Benchmarks projects. Metrics related to user experience risks such as buying new computers only Understanding who else had a similar are also important, but must be specific to later find that the hypothesis does to project context. For example, when not hold. problem and how it was approached assessing how long is acceptable for a user and solved is crucial in the planning the Timeline planning has similar considerations to wait for an automated credit scoring execution phase. Scientific literature is to budget planning. Again, the trade-off decision, faster is better. Still though, an immense source of information and is between giving space to exploration it needs to be a defined KPI ex-ante to the boundaries between research and and research by keeping an alignment to enable the project team to deliver a well- operational application often overlap in the goals and metrics. A project management tuned product. data field. From the project management technique from the software industry perspective, benchmarking means Budget and Timing known as the ‘agile approach’ is useful for analyzing business competitors and their The planning and management control data projects. This approach looks at project activities in the data field, ensuring that must take into consideration the almost- progression through self-sustainable cycles the project is aligned with the company’s permanent open state of data projects. where output is something measurable and practices and internal operations. In lay Goals and targets show an end point, but testable. This helps to frame an exploration terms, don’t reinvent the wheel. until it is reached, a data project is often in a specific cycle. 120 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Partnerships, Outsourcing and analysis, and even intermediate datasets, • Encryption: Sensitive or identifying Crowdsourcing aggregates and segmentations that feed information should be encrypted, into other processes. obfuscated, or anonymized, and This point is particularly important from the maintained through the full data project resource perspective. Asking project Data Governance pipeline. design questions about requirements and their sufficiency helps to identify the gaps This is how and when the data get used and • Permissions: Access to datasets should for project managers to fill. Notably, this who has access to them. Data governance be defined on a granular basis by team is not limited to human resources. Cloud planning should consult broader roles, or by access point (i.e., from computing is outsourced computational corporate policy, legal requirements and within corporate firewalls, versus from communications policies. The purpose of external networks). hardware. Even data can be externally sourced, whether by licensing it from the plan is to permit data access to the • Security: Datasets placed into the vendors or by establishing partnerships project team and delivery stakeholders, project’s ‘sandbox’ environment should that enable access. Crowdsourcing is while balancing against data privacy and have their own security apparatus or an emerging technique to solicit entire security needs. The data governance plan firewall, and ability to authenticate data teams with very wide exploratory privileged access. is usually affected by the project’s scale, bounds, usually with the goal of delivering where bigger projects may have much • Logging: Access and use should be pure creativity and innovative solutions more risk than smaller projects. A main logged and auditable, enabled for to a fixed problem for a fixed incentive. analysis and reporting. challenge is that the data science approach As examples, Kaggle is a prominent pioneer benefits from access to as much data as • Regulation: The plan should ensure for crowd-sourced data science expertise; is available in order to bridge datasets regulatory requirements are met, and or Amazon’s ‘Mechanical Turk’ service for NDAs or legal contracts should be in and explore patterns. Meanwhile, more crowd-sourced small tasks or surveys. place to cover all project stakeholders. data and more access also pose more Customer rights and privacy must also An important element to consider is risk. Project data governance should also be considered. Intellectual Property (IP). Rights should specify the ETL plan. This also encompasses be specified in contractual agreements. transportation, or planning for the physical Process: Execution This includes both existing IP as well as IP or digital movement, which must consider Exactly as the Data Ring depicts a cyclical created through the project. Consider the full transit through policy or regulatory process, the Execution phase in many process and execution phase along the data environments, such as from a company in data projects tends to reflect a sort of pipeline. IP encompasses more than final Africa to an outsourced analytics provider loop within the loop. What is usually deliverable results; it includes scripts and in Europe. The plan should consider the called a ‘data analysis’ is actually more of computer codes written to perform the following principles: a collection of progressive and iterative DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 121 2.1_MANAGING A DATA PROJECT steps. It is a path of hypotheses exploration specific analytic process framework, or Cleaning, Exploring and Enriching and validation until a result achieves the whose projects may be better served by the Data defined target metrics. a given approach, can easily incorporate This step is where the data science team these frameworks into the Data Ring’s really starts. The chance that a dataset is The Execution phase most closely resembles project design specification here in the perfectly responsive to the study needs established frameworks for data analysis, execution phase. The following steps are is rare. The data will need to be cleaned, such as CRISP-DM or other adaptations.39 otherwise provided as a general good which has come to mean: Project managers who prefer to use a practice data analytic execution process. a. Processing: Convert the data into a common format, compatible with the processing tools. b. Understand: Know what the data are Execution by checking the metadata and available documentation. c. Validate: Identify errors, empty fields Hypothesis setting and abnormal measurements. d. Merge: Integrate numeric (machine- readable) descriptions of events performed manually by people during the data collection process in order to Cleaning, exploring, provide a clear explanation of all events. enriching the data Hypothesis validation? e. Combine: Enrich the data with other data, whether from the same company, from the public domain, or elsewhere. f. Exploratory Analysis: Use data visualization techniques to partially Results understanding Running datascience tools explore data and patterns. g. Iterate: Iterate until errors are accounted and a process is in place to go reliably from raw data to project-ready data. This is the minimum viable process. Figure23: The Data Ring Execution Process 39 Related data analytic process methods include, for example: ‘Knowledge Discovery in Databases Process’ (KDD Process) by Usama Fayyad; ‘Sample, Explore, Modify, Model, Assess’ (SEMMA) by SAS Institute; ‘Analytics Solutions Unified Method for Data Mining/Predictive Analytics’ (ASUM-DM) by IBM; ‘Data Science Team Process’ (DSTP) by Microsoft 122 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Running Data Science Tools coding bugs). The output of any analytic of a project is what will test the project’s This is where data scientists apply calculation or process, whether big or design process and approach, pushing for small, will yield: revision when the unexpected arises. The their expertise. Machine learning, data mining, deep learning, NLP, network Data Ring framework can also help think • Unusable (or incorrect) Results science, statistics, or (usually) a mix of through execution problems to identify • Trivial or Already-known Results the aforementioned are applied. When solutions; its concepts are not restricted developing data projects that include • Usable Results that Feed into Next Steps to upfront planning. The associated Data predictive models, it is necessary to have • Unexpected Results (to be investigated Ring Canvas (discussed in 2.1: Application) a model validation strategy in place before with a new pipeline, new data or is designed with this intention, to provide a the model is run. This enables the project new approach) template that can be updated continuously hypothesis to be statistically tested. to reflect the project’s status throughout The project design should recognize these Practically, the dataset that drives the project execution. possible outcomes and be prepared to model must be segmented into a ‘control’ deal with each case. Barring unusable Metrics Assessment and set and a ‘treatment’ set using randomized results, all other outcome categories are selection. A 20 percent to 80 percent split Next Steps likely to merit a presentation or reporting is a common, basic approach. The model Only through a quantitative and precise task in order to make it comprehensible to is trained on the treatment set. Then, the initial definition of project goals and others, including internal team members, model can run on the control set, and the metrics can project efficacy be judged. managers, customers, and general model’s predicted values can be compared audience. This usually means a written If the results are not satisfactory, the to the control set’s known values. This is summary, table, graph, or animation, process has to start again. This evaluate- how accuracy rates are calculated and how which are mediums to present and explain and-iterate step is always critical, but has a hypothesis may be tested. results. Data visualization experts play a additional considerations when external Results Understanding, key role in this process, as it is not just a firms are sourced. Deliverables may be Interpretation and Representation matter of beautifying results. The difficult judged inadequate despite the work task is to create compelling, interactive quality. Accountability of delivered results The results interpretation will be discussed and visual layers to succinctly add to the must be agreed up front, as should the in more detail in the following section in broader project narrative, which should terms of delivering business Value. From the degree of leeway to continue iterating in constitute a project problem statement process perspective, results understanding pursuit of satisfactory results. Exactly as unto itself. focuses on ensuring an alignment between their part in the hypothesis setting first the results obtained and the expected The execution phase is also the opportunity step of this execution loop, data project output of the process execution; and to reassess project plans, again noting that managers again play a key role in keeping ensuring that they’re computationally data projects are best delivered using an the scientists focused on main goals and valid (i.e., controlling arithmetic errors, or iterative approach. The execution phase empowering future iterations. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 123 2.1_MANAGING A DATA PROJECT Quadrant 4: VALUE written reports, at least not exclusively. Value: Tuning Data project deliverables are usually Understanding results is just the initial task. characterized by dashboards, predictive Data-derived knowledge must be turned models or data-driven decision-making into concrete actions that are manifested levers, automatization tools and, ideally, by tools, models and algorithms. Because powerful business insights. In other of the iterative, exploratory approach of a words, a data project rarely ends with data project, the first time a final outcome recommendations. Instead, it delivers is successfully reached, it will invariably modules to be operationalized. have rough edges that need to be tuned Value: Interpretation into a smooth operating tool. Tuning focuses on three areas: The first step following an execution stage focuses on understanding the Data Input value-proposition inherent in the results The choice and the quality of input data and what may be needed to refine these can decisively determine the effectiveness outputs or their underlying processes to Figure 24: Data Ring Quadrant 4: deliver the Goal. A number could mean of the algorithms used to perform the VALUE nothing or everything, depending on analysis. Consider machine learning, where interpretation. Understanding results is the algorithms develop a learning attitude Value is the last part of the Data Ring or, not a simple explanation of phenomena. following a training phase that uses a by design, the starting point for future Instead, it means placing results in business subset of data. Therefore, by working iterations to add or implement components context and embracing the complexity with data, operations progressively learn or scale-up the design. This step articulates of real operations. This also requires a to collect better data. Improving the raw how the results of process execution are ultimately transformed into ‘information’, transparent, collaborative approach, data and minimizing anomalies, collection and then ‘knowledge and value’ that can discussing the results with all project methods, manual inputs and collection be implemented. stakeholders to determine what they mean errors, will result in more finely tuned from all angles. Keeping in mind the role of results over time. This value-creation component of the data operations (see Business Skills), it is results is usually one of the substantial not uncommon that data scientists may Infrastructure, Skills and Process differences between a traditional data have difficulty explaining the operational After the first execution iterations, there analysis or BI project and an advanced relevance of results to managers. If an will be a better understanding of the analytics process, particularly in the important finding is made, its value effectiveness of the team allocated to the big data space. This is because project must be successfully communicated to project, data governance processes, as well deliverables are rarely defined in terms of management, who can drive it into action. as available software and hardware tools. 124 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Also, there will be increased understanding Value: Implementation These descriptors can guide implementation of how the overall project organization strategy, formulating what the use Implementation Strategy case looks like. This is also an important works together. Inefficiencies will be To generate a real impact, the component of generating buy-in from revealed and, as discussed previously, implementation strategy must be designed management. For example, if the use case all areas of the project can serve as starting from the beginning, as part of envisions full automation, the project design potential solution sources. Generally, questions must ask that infrastructure and goal setting. This issue must be kept tuning strives for all components to resources are sufficient to implement a fully in mind throughout the process. Avoid work increasingly well together. This is automated algorithm. If investing in a new the risk of obtaining brilliant data that done through: better team organization; data center is needed to run the algorithm cannot be used in practice. A key aspect stronger communication; increased team and deliver just-in-time credit decisions, for the implementation strategy is to buy-in to ensure that the project results competencies; and technology, either ensure management buy-in. Presumably, are used could be difficult, whereas a use better methods, increased computational allocating resources provides a certain case strategy based on a small-scale pilot power, or all of the above. level of commitment. With that said, implemented with existing resources might because stakeholders have been assured make an easier case. Data Output there are no guaranteed results from Finally, the output data should be exploratory processes, the implementation Cost-benefit reviewed. It is important that output strategy needs to ensure continuous The anticipated value proposition should results are not biased or affected by errors support and strong communication around be articulated in the initial design. At the (human or otherwise), bad integration intermediate findings. outset, this may be in general terms, for example: an efficiency gain, a cost between different steps of the process or Analytic types, as discussed in Chapter 1.1, reduction or customer retention. As the even common coding bugs. Often, this can also be relevant for thinking about how project develops and results are obtained means reviewing and fixing the input results get used: and tuned, the value proposition may data. Although, the analytic process is very become quantified. Once the goal is capable of introducing its own anomalies. • Descriptive: Summarizing or aggregating achieved, this will help define what has This is both a validation check and a information actually been obtained and the value that tuning opportunity. Ultimately, reviewing • Diagnostic: Identifying sub-sets of it represents. The same process should information based on specific criteria be considered for using the results. In the output supports overall organization and beginning, some general infrastructure or reliability, such as ensuring that a final • Predictive: Usually building on predictive system requirements may be envisioned. visualization displays the correct results sub-sets, combined with decision-levers Once the project is mature, the value 100 percent of the time and under all • Prescriptive: Fully integrated into must be weighed against the cost of conditions, for example. automated systems; a piece of operations implementing the solution. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 125 2.1_MANAGING A DATA PROJECT APPLICATION: encountering the Business Model Canvas, the right tools and skillsets for successful Using the Data Ring and observe people attaching colored project implementation. Here, a step-by- sticky notes to canvas poster boards, step overview refines the five Data Ring A Canvas Approach committed to the hard task of providing a structures in terms of their interconnected concise, comprehensive schematic vision relationships. The point is that each of the As a planning tool, the Data Ring adopts a of their business model. The framework’s ring’s core blocks represent a component canvas approach. A ‘canvas’ is a tool used widespread application among innovators of a dynamic, interconnected system. to ask structured questions and lay out the The iterative approach and canvas answers in an organized way, all in one and technology startups provides a solid application allow laying these out in a place. Answers are simple and descriptive; basis to support the project management singular diagram to visualize the pieces of even a few words will suffice. Developing a needs for innovative, technology-driven the holistic plan, to identify resource needs strong canvas to drive project planning can data projects. There are many excellent and gaps and to build a harmonious system. still take weeks to achieve, as the interplay resources providing additional information of guiding questions challenges deep on the Business Model Canvas, but it is This is done by iterative planning, where understanding of the problems, envisioned not a prerequisite for understanding or a goal must first be set. Once the goal solutions and tools to deliver them. Below applying the Data Ring. is set, the approach goes step-by-step is a list of the four main reasons to adopt a around the ring to articulate the resources, The Data Ring Canvas takes inspiration relationships and process needed to achieve canvas approach: from this approach, applied to the the goal. This is done by sequentially asking 1. To force the project owner to state a specific requirements of data project four key project design questions for each crystal-clear project value proposition management, while also emphasizing of the core blocks. The project design the need to set clear objectives and apply questions are: 2. To provide self-diagnosis and to define and respect an internal governance The Four Project Design Questions strategy 3. To communicate a complete representation Resources of the process ‘on-one-page’ Defining Resources To flexibly plan with a tool that can 4. 1 What resources do I have? redefine components as the project 2 What resources do I need? evolves Relationships The canvas concept was introduced by Defining Relationships Alex Osterwalder, who developed the 3 Is the plan sufficient to deliver the project? Business Model Canvas. In recent years, it 4 Is the plan sufficient to use the results? has become unusual to attend a startup competition, pitch contest, hackathon, or innovation brainstorming event without Figure 25: The Four Project Design Questions asked by the Data Ring Canvas 126 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Before closing this section, it is important What • budget, benchmark, data Data Ring Relationships to remember the most common mistake governance, or ETL plan do I need? made when using these types of business tools: do not focus too much on the canvas This is especially critical for value, as completion. Simply put, the Data Ring exploring required value underlies the Canvas – like the Business Model Canvas project motivation. Also, value ties in with – is only a means, not the objective itself. the resources that are acquired through the project’s own analytic results. Planning Defining and Linking Resources project needs in terms of value also helps to define both intermediate and final project Defining Resources deliverables, including the development The first two questions identify project of reports or knowledge products. This resource requirements. These are sequential, iterative approach helps to identified by sequentially asking the first identify gaps and acquisition requirements guiding question: “What data do I have?... as they arise in steps, building the overall What skills are available to the project?... What internal processes are already plan incrementally. in place?...” The guiding questions for Linking Resources each component should be considered Figure 26: Highlighting Resource in order to detail the planning process. With resources specified for each Linkages in the Data Ring Canvas This includes asking, “What value do I have?” structural block, a project plan should aim Answering perhaps not in terms of results to deeply understand their interconnected already achieved, but at the outset, this relationships. The last two project design FIT: Tools and Skills may be a useful, relevant question. There questions reflect on these relationships; All of the project’s hard and soft resources might be tuning methods to draw on from that is, given the resources envisioned in must be able to work together, a related projects, or perhaps there are pre- one category block, the need to explore relationship described by Fit. It might existing commitments from management if the resources in the other categories seem obvious, but practical experience to drive implementation. These should be are sufficiently linked together. If not, shows that the resources assessment considered among initial Value resources requirements and linkages may need to be phase is often underestimated. Different that drive overall planning. adjusted vis-à-vis one another. These four pieces of hardware and software need to Once resources are scoped across each linkages are specified in the Figure 26: fit, ‘speak’ to one another. People must also block, the questions iterate: ops, results, and use. Each linkage should speak, not only to communicate with be specified to complete the Data Ring each other within the team, but also to • What data do I need? Canvas and articulate a holistic project use technical infrastructure. The canvas • What skills do I need? plan. These are described to the right: should specify the primary scripting and DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 127 2.1_MANAGING A DATA PROJECT database languages, as well as the specific plot an agent network on a map. Ops looks facilitate value interpretation, such as a framework methods needed to deliver the at what people are doing. The Process block final analytic report. Additional data results project. Notably, these languages must be articulates how people take action in terms or supplementary models may also need to common across teams and tools. of time, budget, procedural or definitional be specified to ensure a strong relationship requirements. The project operations link between the Process and Value blocks. The tools and skills should also fit the to Skills in that identifying viable solutions project’s goal scope. The main risk related to to the operational problems requires USE: Value and Tools an incorrect assessment of the resources is relevant know-how about the topic. The fourth project design question looks pushing advanced hardware components, The canvas ops should specify the project’s past delivery, toward achieving value from fully developed software solutions or core operational problems that must be the project’s Use. The project’s design must human skills (e.g., data scientists) to tackled, linked by the skills needed to tackle be sufficient to use the output of the data the project without proper integration them and the process to get them done. product. A visualization dashboard will run with existing infrastructures and domain on a computer, for example, that is connected experts. The recommended starting goal RESULTS: Process and Value to an internal intranet or the broader web. for a minimum viable process and product The computational Results of the process A web server will put it online so people can helps mitigate this risk by goal setting execution will be turned into value. use it. The data it visualizes will be stored around smaller resources; the idea is to The canvas should list the specific results somewhere, to which the dashboard must explore ideas and test product concepts. that are expected, whether it is an connect and access the data. IT staff will Once proved, one can incrementally scale algorithm, model, visualization dashboard, maintain these servers. These resources up the process and the product with the or analytic report. Value is achieved through may or may not be identified in terms hard and soft resources needed to go to the process of how results are interpreted, of what is needed to deliver the project the next level. tuned and implemented. Model validation itself. The fourth project design question approaches link with the selected model’s helps to identify implementation gaps that OPS: Skills and Process type of data results. The model choice is could emerge upon project completion, Project operations, or Ops, is the linked by the definitions and metric targets ensuring these considerations are made as process where people tackle the actual established in Process and the business part of up front project planning. Use links computations and data exploration interpretability and use implementations the Value the project delivers with the necessary to deliver the project. These that create Value. Numeric results and their Tools needed to feed the project’s output activities are driven by the specific analytic interpretation carry the risk of not being data into the implementation system. questions and operational problems that able to correctly understand the results This is especially important for projects the project team is working to resolve. obtained. There is also a risk when turning drawing from outsourced solutions, where For example, a credit scoring project would these results into decisions or business implementation support needs must likely have a specific operational problem levers that deliver value. To ensure results be scoped within initial procurement. to calculate variables that correlate with are interpretable for business needs, the The canvas Use should specify how the loan default rates. Similarly, a visualization canvas must consider its key deliverables implementation strategy connects to might have the technical problem of how to and may include additional resources that implementation tools. 128 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES CASE 14 Managing the Airtel Money Big Data Project This project management case Goal Setting: Where the Data Ring • The Hypothesis: There is a draws on the Airtel Money Uganda Starts correlation between GSM activity case presented in Chapter 1.2, Case A goal is a solution for a strategic and Airtel Money activity behavior 3. This project was designed and problem, and the project’s purpose (i.e., statistical profiles can be managed by IFC’s Financial Inclusion is to deliver that solution. In this created and matched) research team based in Africa. example, the problem was low Airtel The use case below walks through Resource Identification Money activity rates. IFC proposed each of the Data Ring’s project design a solution: a model to define the IFC was not in possession of questions and considers the specifics statistical profile of an active user Airtel data ex-ante, having only of this project. A completed Data and matching that profile against a commitment from the Airtel Ring Canvas reflects this process, non-users within the existing GSM partnership to provide access to articulating the key project resources subscriber base. Once identified, CDR and Airtel Money transaction and design relationships in a single these customers could be efficiently data. While both IFC and Airtel have visualization. While this canvas is for targeted as high-propensity Airtel substantial IT infrastructure for their a completed project, the process of Money users. Because it was operations, these were not available using a canvas approach is dynamic; unknown if this profile match was for project requisition. The IFC team writing and erasing components possible, it was important to set tasked a data operations specialist to as misalignments force new design manage the project, bringing relevant a modest scope aimed at a proof and requirement considerations. skills across computer science, data of concept: In addition, using sticky notes is a science and the DFS business. IFC good approach, as they permit easy • The Goal: To develop a minimum DFS specialists, financial inclusion additions and new design elements viable customer segmentation research specialists and regional while also allowing for movement on prediction model to identify high- experts familiar with the local market the canvas until a satisfactory plan propensity active users that would and customer behaviors supported is achieved. increase activity rates the project. During process planning, DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 129 2.1_MANAGING A DATA PROJECT the operational problem was elements, for which Cignifi, Inc. was resources, processes and results. known ex-ante: low Airtel Money selected. Cignifi brought: additional Importantly, it helps to pre-identify activity. The team also had existing infrastructure resources, with their points that anticipate refinement benchmark data from a similar data big data Hadoop-Hive clusters; sector during the implementation process. project delivered for Tigo Ghana experience working with MNO It also helps reassess key process (see Chapter 1.2, Case 2: Tigo Cash CDR data; skills in ‘R’ and Python; areas when issues are uncovered Ghana, Segmentation), which helped statistics and machine learning; and during the analytic execution and to set project management metrics, resources for data visualization. require adjustments to the plan. like an 85 percent accuracy target for The IFC-Airtel-Cignifi team then set a data governance and ETL The data governance plan expected the envisioned model. The model’s plan that was advised by legal and refinement; the project’s analytic definitions also specified ‘30-day and execution phase was 10 weeks, activity’ as its dependent variable. privacy requirements. This plan sent the Cignifi team to Kampala, Uganda but was planned relative to the Finally, budget was allocated through data acquisition start date, meaning to work with Airtel’s IT team to: the IFC advisory project, funded by project timing would be affected understand their internal databases; Bill and Melinda Gates Foundation; by actual date and any ETL issues. define the data extract requirements; a six-month timeline was set. The data pipeline also had uncertain encrypt and anonymize sensitive data; and then transfer these data sufficiency; planning the pipeline and Resource Exploration to a physical, secured hard drive allocating technical resources was Through the IFC-Airtel project not possible until the final data could to be loaded onto Cignifi’s servers. partnership, the team negotiated be examined and their structure The project’s value expectations were access to six-months of historical known. This is a common bottleneck. specified in the RFP for a data output CDR and Airtel Money data, listing user propensity scores, known Anticipating these uncertainties, approximately one terabyte, to be as a ‘whitelist’. Additional analytics the value add specified an inception extracted from Airtel relational were also specified, including a social deliverable: a ‘data dictionary’ databases and delivered in CSV network mapping and geospatial that discussed all acquired data format. This necessitated a big data analysis. descriptions and relationships, and technical infrastructure and the data that would be used to refine project science skills to analyze it. IFC issued Plan Sufficiency: Delivery sufficiency once these details were a competitive Request for Proposal Sufficiency review helps to ensure known. The execution phase of any (RFP) to outsource these technical alignment across all the planned data project is where surprises test 130 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES the project plans. As this is to be information in marketing campaigns, revealed a more significant error. The expected, the project also specified if the analysis proved successful. first month’s dataset did have serious an early deliverable in the form The delivery strategy was agreed gaps, and this issue required revising of an interim data report, which with Airtel management: a final the data governance ETL plan and would provide high-level descriptive meeting would allow presentation overall project design. The original statistics and findings of initial and discussion of the analytic report, project plan specified October exploratory analysis, anomalies or and Airtel’s IT team would take 2014 through March 2015 data. gaps in the data. The interim data the whitelist to base next steps on The solution was to discard October report would also include anything the findings. data entirely and work with Airtel unexpected that might require a to extract data for April in order to Project Execution: Planning maintain the six-month time series strategic adjustment. Adjustments necessary to ensure a statistically Plan Sufficiency: Implementation Realities on the ground require reliable model. It was also discovered The project’s MVP goal sought to project plan adjustment. The that, according to plan, the data test whether the modeling approach following challenges were discovered themselves were insufficient. The was relevant for Airtel and the during project execution and geospatial and network analysis Uganda DFS market. In this sense, required revising the plan to ensure required tower location data. It was the plan in place was sufficient. all project areas were sufficiently discovered that the Airtel Money The project would deliver (a) a final working toward goal achievement. datasets did not record the location report, with key findings and analysis of where transactions were made, (b) a whitelist: a dataset of Airtel’s After the initial dataset was secured, only the time they took place. The millions of GSM clients – by an the data pipeline process found Cignifi team contextualized these encrypted identifier – each with an irregularities. The extraction process metadata by creatively matching associated propensity score of how somehow inserted empty lines into timestamps in the Airtel Money likely they were predicted to actively the raw datasets. While the data data with timestamps of voice calls use Airtel Money. could be loaded successfully, it for matched users in the GSM data. interpreted incorrectly; numerous The team used a 30-minute window, The plan in place was not sufficient data gaps existed even though that which provided a location coordinate in the sense that resources were was not the case. This required that was reliable within a 30-minute pre-allocated to use the whitelist changes to the ETL process. The fix time-distance from the location of DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 131 2.1_MANAGING A DATA PROJECT the Airtel Money transaction. In transaction within any 30-day period model aimed to identify these high- discussion with the IFC team, it was over the entire dataset. This required value customers. agreed that this was acceptable for the the model design to be redone. This analysis to proceed, although it relied was ultimately a benefit, as the initial Finally, the results interpretation on the assumption that most people, analysis also revealed that cash in led to an additional project results on average, were not traveling great and cash out transactions were not deliverable: business rules. As distances in the 30-minute period providing the desired statistical discussed in the related Airtel case, the between making an Airtel Money robustness to achieve the project’s model’s machine learning algorithms transaction and making a phone call. accuracy metrics. The IFC-Cignifi established a number of significant team agreed to redo the models variables that were difficult to The tuning phase required a using the redefined active users and interpret in a business sense. The IFC number of significant changes. to refocus on P2P transactions, as team considered that the deliverable The summary statistics of the first- they were deemed to provide the to Airtel management could be round results appeared unusual to greatest accuracy and, importantly, enhanced by ensuring the model the DFS specialists; they did not to define propensity scores for the and associated whitelist propensity match behavior patterns the social highest revenue-generating customer scores articulate the statistical profile science experts were familiar with. segment. Moreover, an additional of active users in business terms It was discovered that the original model was added for ‘highly active that align with business-relevant project definitions had ambiguously users,’ or those who transacted KPIs. Cignifi delivered three quick specified ‘active user’ in such a way at least once per 30 days over a segmentation metrics with ‘cut that the analysis team modeled an consecutive three-month period. points’ to profile users by: number output in terms of a DFS transaction Although a small group, these users of voice calls per month; total within 30 days of the Airtel Money generated nearly 70 percent of total voice revenue per month; and total account opening date, rather than a Airtel Money revenue; the additional monthly voice call duration. 132 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES A Completed Canvas: The Airtel Big Data Project Design, Using the Data Ring Canvas Project name: Designed by: Date: Version: The Data Ring Canvas Airtel Big Data IFC Dec 2015 7 PL-SQL, R, #BUSINESS #DATA 1 TB anonymized CDR Python, IFC: Data OPS,  #COMPUTER and Airtel money transaction Pig, Ggplot DFS SCIENCE data over 6 months Airtel: ICT (ETL) Cignifi: Managing #DATA SCIENCE big data, encryption #INFRASTRUCTURE Cignifi: Statistics, Airtel: Oracle Data Science, Viz #INFRASTRUCTURE FIT Cignifi: Hadoop, Spark, AWS, Proprietary methods GOAL Profiling active Airtel Decision meeting Money customers Customer segmentation OPS USE Marketing Campaign model to identify users with Mapping Identifying tower System using whitelist high propensity to increase geo-spatially location proxy activity rates P2P flows #IMPLEMENTATION Targeted marketing #TIME&BUDGET campaigns RE S U LTS 6 months | $ from #DEFINITIONS Bill & Melinda Gates #TUNING ‘Active’, ‘Highly Foundation Customer Different models: GLM, Active’ users Whitelist Random Forest, Ensemble propensity scores #PARTNERSHIP/ #EXECUTION OUTSOURCING Machine learning #INTERPRETATION #INTERPRETATION IFC, Airtel, Cignifi model with Analytic report Validation out of time Financial Inclusion (3-way communication) 85 percent accuracy and out of sample growth strategy ‘Business rules’ Figure 27: A Completed Data Ring Canvas for the Airtel Big Data Phase I Project ©2017 International Finance Corporation. Data Analytics and Digital Financial Services Handbook (ISBN: 978-0-620-76146-8). This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. The Data Ring Canvas is a derivative of the Data Ring from this Handbook, adapted by Heitmann, Camiciotti and Racca under (CC BY-NC-SA 4.0) License. View more here: https://creativecommons.org/licenses/by-nc-sa/4.0/ DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 133 2.1_MANAGING A DATA PROJECT Project Delivery delivery date coincided with an The model whitelist identified existing marketing campaign, putting approximately 250,000 highest- the whitelist results on hold. Airtel propensity users to target as expected Money subscribers grew significantly active mobile money users. Across over the following several months, the full whitelist of several million which diminished the value of the whitelist since many new customers GSM users, the top 30 percent of were onboarded through business- propensity scores predicted uptake for as-usual marketing. Over this time, ‘highly active’ P2P users to generate GSM subscribers also grew, which an estimated 1.45 billion Ugandan provided millions of new potential shillings from P2P transactions; and Airtel Money users. IFC and Airtel 4.68 billion Ugandan shillings from agreed to a Phase II analysis in late cash-out, or approximately $1.7 2016. The project goal is similar, million in additional annual revenue. with an added analytic component The project findings were strong built on Phase I, designed to and compelling. However, the examine uptake and distribution implementation strategy was only patterns of Airtel Money across time defined as a decision point. The and geography. 134 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 135 ics yt app Da al o d s li c th & m an ta ions PART 2 at a e Dat Chapter 2.2: Resources Ma a p s da ce na gi t ro ng a ur so jec Re t 2.2.1 Summary of Analytical Use Case Classifications Summary of Analytical Use Case Classifications Classification Question Addressed Techniques Implementation Descriptive • What happened? Alerts, querying, Reports • What is happening now? searches, reporting, static visualizations, dashboards, tables, charts, narratives, correlations, simple statistical analysis Diagnostic • Why did it happen? Regression analysis, A|B Traditional BI testing, pattern matching, data mining, forecasting, segmentation Predictive • What will happen in the Machine learning, SNA, Modeling future? geospatial, pattern recognition, interactive visualizations Prescriptive • What should be done to Graph analysis, neural Integrated Solutions, make a certain outcome networks, machine, and Automated Decisions happen? deep learning, AI 136 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 2.2.2 Data Sources Directory Source: Core Banking and MNO Systems  Structure: Typically structured data, using relational databases. Format: Digital data, which may be extracted in various formats for reporting or analysis. Legacy data might include paper-based registrations, or scanned registration forms. Name Data Examples Biller Data About Clients Duration of contract; payment history; purchase types Enhanced marketing insights; potential to create credit score using biller data Client Registration Status Registration status (e.g., active, dormant, never used) Marketing insights; business performance monitoring; regulatory compliance Customer KYC Name; address; DOB; sex; income Marketing insights; regulatory compliance Account Status Account type; activity status (active, dormant, aging Marketing insights; business performance monitoring; of activity, dormant with balance) regulatory compliance Account Activity Account balance; monthly velocity; average daily Marketing insights; credit scoring; regulatory compliance balance Financial Transaction Data Volume and value of deposits; withdrawals; bill Business and financial performance monitoring; regulatory (direct) payments; transfers; or other financial transactions compliance; marketing insights; credit scoring Financial Transaction Data Failed transactions; declined transactions; channel Product performance and product design issues; training and (indirect) used; time of day communications needs E-money Data E-money floats; reconciliations; float transfers Agent performance management; fraud and risk between agents management Non-financial Activities PIN change; balance request; statement request Marketing insights; efficiency improvements; product development Loan Origination Loan type; loan amount; collateral used; length; Marketing insights; portfolio performance monitoring; credit interest rate scoring; new loan assessment Loan Activity Loan balance; loan status; source of loan repayment Marketing insights; portfolio performance monitoring; credit transaction scoring; new loan assessment DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 137 2.2_RESOURCES Source: Mobile Money System Structure: Typically structured data, using relational databases. Format: Digital data, which may be extracted in various formats for reporting or analysis. Legacy data might include paper-based registrations, or scanned registration forms. Name Data Examples Customer KYC Name; address; DOB; sex; income Marketing insights; regulatory compliance Registration Status Activity status (active, dormant, aging of activity, Marketing insights; business performance monitoring; dormant with balance) regulatory compliance Wallet Activity Wallet balance; monthly velocity; average daily Marketing insights; credit scoring; regulatory compliance balance Transaction Data Volume and value of cash in; cash out; bill payments; Business and financial performance monitoring; regulatory P2P; transfer; airtime top-up or other financial compliance; marketing insights; credit scoring transactions E-money Data E-money floats; reconciliations; float transfers Agent performance management; fraud and risk between agents management Source: Agent Management System Structure: Typically structured data, using relational databases. Format: Digital data, which may be extracted in various formats for reporting or analysis. Legacy data might include paper-based registrations, scanned registration forms, or agent monitoring or performance reports. Name Data Examples Agent Activities (direct) Agent transaction volume and value; float transfer; Sales and marketing insights; credit scoring; agent float deposit and withdrawal; float balance; days with performance management no float Agent Activities (indirect) PIN change; balance request; statement request; Sales and marketing insights; agent performance create new assistant management Merchant Activities (direct) Merchant transaction volume and value; number of Sales and marketing insights; credit scoring; merchant unique customers performance management Merchant Activities (indirect) PIN change; balance request; statement request; Sales and marketing insights; merchant performance create new assistant management Technical System Data Number of TPS; transaction queues; processing time Capacity planning; performance monitoring versus SLA; identify technical performance issues Agent and Merchant Visit Presence of merchandising materials; assistants Customer insights; agent performance management Reports by Sales Personnel knowledge; cash float size; may more commonly include semi-structured or unstructured data, such as paper-based monitoring reports 138 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Source: Customer Relationship Management (CRM) System Structure: Often incorporating both structured and semi-structured data that uses relational database or file-based storage systems, such as voice recordings or issue summaries tagged by structured categories. Format: Digital data, commonly, although semi-structured and unstructured data may not be available for reporting (such as for voice recordings). Name Data Examples Call Center Records Issues log; type of issues; time to resolution (may Customer insights; operational and performance include semi-structured data in reports) management; system improvements PBAX Number of call center calls; length of calls; queue wait Operational and performance management times; dropped calls Customer Care Feedback Data Number of calls; call type statistics; issue resolution Identify: technical performance and product design issues; statistics training and communications needs; third party (e.g., agent, biller) issues Agent and Merchant Feedback Number of agent or merchant calls; call type statistics; Identify: technical performance and product design issues; Data issue resolution statistics agent training and communications needs; client issues Communication Channel Volume of website hits; call center volumes; social Customer insights; operational and performance Interactions media inquiries; live chat requests management; system improvements Qualitative Communication Type of inquiries; customer satisfaction; social media Customer insights Data reviews Source: Customer Records Structure: Often incorporating both structured, semi-structured and unstructured data, ranging from: KYC documents that may include variety of personal information depending on document type; to market or customer surveys; to focus group notes. Format: A wide variety of formats may be used to store customer record data, including relational databases, file storage systems or scanned or paper documents. Name Data Examples KYC Documents ID; proof of salary; proof of address Regulatory compliance; demographic and geographic segmentation Registration and Application Open DFS account; loan application Regulatory compliance; demographic and geographic Forms segmentation Qualitative Research Client interviews; focus groups Marketing and product insights Quantitative Research Awareness and usage studies; pricing sensitivity Marketing and product insights studies; pilot tests DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 139 2.2_RESOURCES Source: Agent and Merchant Records Structure: Often incorporating both structured, semi-structured and unstructured data, ranging from: KYC documents that may include variety of personal information depending on document type; to market or merchant surveys; to focus group notes. Format: A wide variety of formats may be used to store agent or merchant record data, including relational databases, file storage systems or scanned or paper documents. KYC Documents Articles of incorporation; tax returns; KYC documents; Regulatory compliance; demographic and geographic bank statements segmentation Registration Forms Register as DFS agent or merchant Regulatory compliance; demographic and geographic segmentation Qualitative Research Agent interviews; focus groups Sales, marketing and product insights Quantitative Research Mystery shopper research Sales, marketing and product insights Source: Third Party Partners Structure: Third party may take any form or structure, depending on the content, source and vendor providing it. Format: Formats may range from common .CSV formats to proprietary access APIs and delivery methods. Name Data Examples Biller Data About Clients Duration of contract; payment history; purchase types Enhanced marketing insights; potential to create credit (utilities) score using biller data Payer Client Data About Clients Payroll history; duration of regular payments Enhanced marketing insights; credit scoring (employer, government) Client Information Repositories KYC data; credit rating; previous fraudulent activity Credit scoring; fraud investigations; risk management (e.g., credit bureau, watch-lists, police records) Geospatial Data (satellite data) Regional demographics; population density; Market insights; agent management topography; infrastructure such as roads and electricity; financial access points Social Media and Social Type and frequency of network activities; personal Market insights; credit scoring Networks information; number of connections; type of connections 140 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 2.2.3 Metrics for Assessing Data Models TOP-10 LIST OF PERFORMANCE METRICS FOR ASSESSING DATA MODELS Metric Definition Receiver Operating The ROC curve is defined as the plot between the true positive rate and the false positive rate. It illustrates the performance Characteristic (ROC) Curve of the model as its discrimination threshold is varied. The greater the area between the ROC curve and the baseline, the better the model. AUC Area Under the Curve (AUC) measures the area under the ROC curve. It provides an estimate of the probability that the population is correctly ranked. It represents the ability of the model to produce good relative instance ranking. Value equal to one is a perfect model. KS The Kolmogorov-Smirnov (KS) statistical test measures the maximum vertical separation between the cumulative distribution of ‘goods’ and ‘bads.’ It represents the ability of the model to separate the ‘good’ population of interest from the ‘bad’ population. Lift Chart It measures the effectiveness of a predictive model calculated as the ratio between the positive predicted values over the number of positives in the sample for each threshold. The greater the area between the lift curve and the baseline, the better the model. Cumulative Gains It measures the effectiveness of a predictive model calculated as the percentage of positive predicted value for each threshold. The greater the area between the cumulative gain curve and the baseline, the better the model. Gini coefficient The Gini coefficient is related to the AUC; G (i)=2AUC-1. It also provides an estimate of the probability that the population is correctly ranked. Value equal to one is a perfect model. This is the statistical definition for what drives the economic Gini Index for income distribution. Accuracy Accuracy is the ability of the model to make a prediction correctly. It is defined as the number of correct predictions over all predictions made. This measure works only when the data are balanced (i.e., same distribution for good and bad). Precision Precision is the probability that a randomly selected instance is positive, or good. It is defined as the ratio of the total of true predicted positive instances to the total of predicted positive instances. Recall Recall is the probability that a randomly selected instance is good or positive. It is defined as the ratio of the total of true predicted positive instances to the total of positive instances. Root-Mean-Square Error The RMSE is a measure of the difference between values predicted by a model and the values actually observed. The (RMSE) metric is used in numerical predictions. A good model should have a small RMSE. 2.2.4 The Data Ring and the Data Ring Canvas The Data Ring and the Data Ring Canvas tools are also available for download from the website of the Partnership for Financial Inclusion here: www.ifc.org/financialinclusionafrica The following tear-out page provides a copy of the Data Ring and Data Ring Canvas to use. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 141 2.2_RESOURCES 142 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES The Data Ring ise Fram ert l ga Exp Le ewo d tor an Sto D rks Sec e at cy nc ra a iva ie ge n Pi io Sc pe Compute at Pr iz al r Sc lin e al ci tur i en su e So c ce Ac ru Vi ce st FI T Bu ta n ss ra Da ati o ibi lity nf s ic un SK I m in m Co es LS Form I ats s ta LL Da O 2 Da TO S 1 ta S cien ce GOAL(S) O PS U SE Imple Benchm me ark 4 nta 3 S VA ES Met ng UE C t rics io L O and ni n PR Defi an Bu niti ni Pl Tu t n dg ons u ng io et Inp RESULTS ut Pa an ta In ec r dT Da s ter p Ex tn im es reta er sh ing oc tion Da r ip P s ut ta d an an tp Go d So Ou e ur ve ur ct r ci ta na ru ng Da St nc e ©2017 International Finance Corporation. Data Analytics and Digital Financial Services Handbook (ISBN: 978-0-620-76146-8). This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. The Data Ring is adapted from Camiciotti and Racca, ‘Creare Valore con i BIG DATA’. Edizioni LSWR (2015) under (CC BY-NC-SA 4.0) License. View more here: https://creativecommons.org/licenses/by-nc-sa/4.0/ DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 143 2.2_RESOURCES Project name: Designed by: Date: Version: The Data Ring Canvas FIT OPS USE GOAL(S) RE S U LTS ©2017 International Finance Corporation. Data Analytics and Digital Financial Services Handbook (ISBN: 978-0-620-76146-8). This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. The Data Ring Canvas is a derivative of the Data Ring from this Handbook, adapted by Heitmann, Camiciotti and Racca under (CC BY-NC-SA 4.0) License. View more here: https://creativecommons.org/licenses/by-nc-sa/4.0/ 144 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Conclusions and Lessons Learned The universe of data is expanding on an Additionally, it is essential there is a clearly Becoming data-driven also includes hourly basis. The analytical capacity of defined department or individual with reviewing the existing staff skillset computing is also becoming more and influence within the organization driving and assessing team member levels of more advanced, and the cost of data the process. Some organizations that are comfort with technology and computing. storage is falling. The data analytics further along the maturity curve have Existing staff can be trained to handle potential described in this handbook – chosen to create a senior level position new technologies. They are ideally and in these cases – highlight how DFS called Chief Data Officer (CDO); this placed to apply new technologies to old providers can leverage data large and small person works closely with senior leadership problems because they already know the to build new services and achieve greater to manage all data related strategy organization, its market and its challenges. efficiencies in their current operations by and management. Typically, staff will require classroom incorporating data-driven approaches. and ongoing on-the-job training in data The organization should look at its management. The DFS provider may Practitioners should strive to adopt a data- current capacities and experience in wish to identify staff members who driven approach across their business. order to clearly articulate the future. have an aptitude and the right attitude This will bring greater precision to their Important considerations include the size for adopting new technology-enabled activities and an evidence-based approach of the organization as well as existing IT practices, then prepare a plan for intensive to decision-making. resources such as skills and experience. skills development. Building a Data-driven Culture Additionally, moving to a data-driven approach will involve big changes for No matter where an organization is in its Organizational culture is crucial. organizational culture, specifically around adoption of data-driven analytics, there is Organizations need to foster a data- how data are shared and how decisions scope to systematically incorporate data friendly environment where the power of are made. The organization will need to into its processes and decision-making. data is celebrated, and where people are be prepared to provide ongoing support Practitioners can take small steps to begin empowered and encouraged to explore in during the change and should be prepared to rigorously test their clients’ needs and order to find ways to improve outcomes. to manage expectations from staff and preferences, to monitor performance As a result, there is the need to invest in management. Current levels of data internally and to understand the impact operational team skills, tools and ideas in management maturity are also important. of their business activities. Most crucially, order to do data justice. Organizational The DFS provider may wish to look at the goals an organization sets for tracking leadership must clearly articulate the current data sources, reporting framework business performance must be quantifiable vision and the fundamental standards and usage of data in decision-making to and measurable. that will form the foundation of its place themselves on the maturity curve. data management program. Leadership Understanding where one sits on the All Data Are Good Data must also form a strong commitment to data management maturity scale will help Data analytics offers an opportunity developing the company’s data capacities, the provider develop a roadmap leading for DFS providers to gain a much more both in terms of vision and budget. toward the desired goal. granular understanding of their customers. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 145 2.2_RESOURCES These insights can be used to design Using Data Visualization Data visualization is related to but separate better processes and procedures that align A picture is worth a thousand words, from data dashboards. A dashboard with customer needs and preferences. would likely include one or more discrete or perhaps, a thousand numbers. Using Data analytics is about understanding visualizations. Dashboards are go-to visualizations to graphically illustrate the customers, with the aim of that customer reference points, often serving as entry results from standard data management deriving greater value from the product. points to more detailed data or reporting reports can help decision-making and tools. This is where KPIs are visualized to Notably, combining insights from different monitoring. Graphical representations allow provide at-a-glance information, typically methodologies and data sources can the audience to identify trends and outliers for managers who need a concise snapshot enrich understanding. As an example, while quickly. This holds true with respect to internal of operational status. Simple dashboards quantitative data can provide insights into data science teams who are exploring the can be implemented in Excel, for example. what is happening, qualitative data and data, and also for broader communications, Usually the dashboard concept refers to research will elucidate why it is happening. when data trends and results can have more more sophisticated data representations, Similarly, several DFS providers have used impact than tables by visualizing relationships incorporating the ideas of interactivity and a combination of predictive modeling or data-driven conclusions. dynamism that the broader concept of data and geolocation analysis to identify the visualization encompasses. Additionally, A chart or a plot is a data visualization, target areas where they must focus their more sophisticated dashboards are likely to in its most basic sense. With that said, marketing efforts. include real-time data and responsiveness ‘visualization’ as a concept and an to user queries. While data visualization For the vast mass market that DFS emerging discipline is much broader, and data dashboards are inherently related providers serve, in many cases there both with respect to the tools available and often overlapping, it is also important may not be formal financial history or and the results possible. For example, an to recognize that they are conceptually repayment data history to use as a base. infographic may be a data visualization in different and judged by different criteria. In these situations, alternative data can many contexts, but it is not necessarily a Doing this helps certify the right tools allow DFS providers to verify cash flows plot. In some cases, this breadth may also are applied for the right job, and ensures through proxy information, such as include mixed media. A pioneer in this vendors and products are procured for MNO data. Here, DFS providers have the area, for example, is Hans Rosling, whose their intended purposes. choice of working directly with an MNO work to combine data visualization with or with a vendor. The decision depends interactive mixed-media story telling Data Science is Data Art on the respective markets as well as the earned him a place on Time’s 100 most Chapter 1 noted the history of ’data institution’s preparedness. Many providers influential people list.40 These elements of science’ as a term. Interestingly, those who may not have the technical know-how to dynamism and interactivity have elevated coined it vacillated between calling the design scoring models based on MNO data the field of data visualization far above discipline’s practitioners ‘data scientists’ – in this case, partnering with a vendor charts and plots, even though the field also and ‘data artists’. While data science who provides this service is a good option. encompasses these more traditional tools. won the official title, it is important to 40 Hans Rosling. In Wikipedia, the Free Encyclopedia, accessed April 3, 2017, https://en.wikipedia.org/wiki/Hans_Rosling 146 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES recognize that creativity, design and even coming to prominence in 2008 (see Figure segments, often drawing from social media artistic sensibility remain critical to the 6 in Part 1). Since then, smartphones have technology. Marketing strategies are field. Following the above discussion of become ubiquitous, computing power has tuned by rigorous statistical A|B testing, data visualization, the process of turning grown substantially and storage costs have which was promulgated by companies like bits of data into informative, interactive, plummeted. Technology companies have Amazon or Yahoo! to refine their website aesthetically pleasing and visually engaging introduced new products that have been designs. Additionally, geographic customer tools require both technical skills and rapidly assimilated into daily life, such as segmentation analysis, mapping P2P flows, creative insights. In reference to Rosling, Google Maps, Apple’s FaceTime video chat and identifying optimal agent placement, the process of making data visualization the and Amazon’s at-home AI, Alexa. Data- are all aided by geospatial analysis and leading character in what can most rightly driven products are rapidly taking hold the tools that deliver Google Maps and be described as a theatrical performance in all sectors, as large datasets and data OpenStreetMap technology. As technology further underlines the interplay between science tools deliver innovative value in continues to evolve, DFS providers can data science and data art. The role of the established markets. The mid-2000s saw anticipate new solutions will emerge to data scientists, regardless of functional title, the emergence of data analytics grow help better understand customers, reach is to draw on technical skill and creative prominently beyond the tech industry, larger markets and deliver products and intuition to explore patterns, extract value particularly making early strides in the services tuned to customer needs. from those relationships and communicate Fast Moving Consumer Goods sector, their importance. such as among grocery and department Data for Financial Inclusion stores. Global industry has changed in In the financial inclusion sector, data are This dualism of structured organization a few short years, summarized by the important because the target customer and emergent patterns describes one of widely publicized observation by Tom base often lacks access to banks or other the overarching complexities of many data Goodwin: “Uber, the world’s largest taxi financial services or has limited exposure projects. On the one hand, there is the need company, owns no vehicles. Facebook, and is unfamiliar with financial services. for clear goals, defined architecture and the world’s most popular media owner, Their needs and expenditure patterns are precise expertise to ensure project delivery creates no content. Alibaba, the most diverse and different. Data allows DFS is on time and on budget. On the other valuable retailer, has no inventory. And providers to create products and services hand, there is the very important need for Airbnb, the world’s largest accommodation that better reflect customer preferences open-ended flexibility to enable discovering provider, owns no real estate. Something and aspirations. DFS has changed access patterns, exploring new ideas, mining data interesting is happening.” Data-driven and affordability of financial services in to uncover possible anomalies, testing solutions have enabled new entrants to emerging markets by serving the needs hypotheses, and creatively designing disrupt established sectors, and technology of low-income clients, thereby increasing visualizations to tell the data’s story. companies continue to push the envelope. financial inclusion. Global Industry Alternative credit scoring methods are Data brings with it the opportunity to The field of data science has existed for less finding new data sources that enable improve financial inclusion. However, this than a decade, with the term itself only products to reach new customer must be done while ensuring consumer DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 147 2.2_RESOURCES protection and data privacy are not compromised. Data are being produced and collected passively through digital devices such as cell phones and computers, among others. Many stakeholders have expressed concern that low-income households, the primary producers of these data in the financial inclusion context, may not be aware that these data are being collected, analyzed and monetized. In the absence of a uniform policy, there are differing standards applied across provider types and some instances where consumer rights have been violated. With the proliferation of data analytics, it is critical that all stakeholders – DFS providers, regulators, policymakers, development finance institutions, and investors – discuss the issues associated to data privacy and consumer protection in order to find solutions. Some practitioners may feel pressured to adopt new technology or methodologies to keep up with the prevailing trends or because of actions taken by their competitors. Needless to say, such efforts could be nullified if the organization does not have the technical skill to manage the project or does not have the ability to act on the basis of the insights. Thus, practitioners should identify the business problems they are trying to resolve, assess what data and analytical capability they currently possess, and then make decisions about how to implement the data project. The business goal must be at the heart of any data management project. 148 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Glossary Term Explanation A|B Testing A|B testing is a method to check two different versions of a product or service to assess how a small change in product attributes can impact customer behavior. This kind of experimentation allows DFS providers to choose multiple variations of a product or service, statistically test the resulting uptake on customers and compare results across target groups. Active Account An account that is active has been used for at least one transaction in the previous period, usually reported as 30-day active or 90-day active. It does not include non-financial transactions such as changing a PIN code. Agent A person or business contracted to process transactions for users. The most important of these are cash in and cash out (that is, loading value into the mobile money system, and then converting it back out again). In many instances, agents also register new customers. Agents usually earn commissions for performing these services. They also often provide front-line customer service, such as teaching new users how to complete transactions on their phones. Typically, agents will conduct other kinds of business in addition to mobile money. Agents will sometimes be limited by regulation, but small-scale traders, MFIs, chain stores, and bank branches serve as agents in certain markets. Some industry participants prefer the terms ‘merchant’ or ‘retailer’ to avoid certain legal connotations of the term ‘agent’ as it is used in other industries. (GSMA, 2014). Alternate Delivery Channels that expand the reach of financial services beyond the traditional branch. These include ATMs, Internet banking, Channel mobile banking, e-wallets, some cards; POS device services, and extension services. Anti-Money Laundering AML/CFT are legal controls applied to the financial sector to help prevent, detect and report money-laundering activities. and Combating the AML/CFT controls include maximum amounts that can be held in an account or transferred between accounts in any one Financing of Terrorism transaction, or in any given day. They also include mandatory financial reporting of KYC for all transactions in excess of (AML/CFT) $10,000, including declaring the source of funds, as well as the reason for transfer. Algorithm In mathematics and computer science, an algorithm is a self-contained sequence of actions to be performed. Algorithms perform calculations, data processing or automated reasoning tasks. Alternative Data Non-financial data from MNOs, social media, and their transactional DBs. Access to other alternative data such as payment history and utility bills can also enable the creation of credit scores for clients who may be otherwise unserviceable. Application Program A method of specifying a software component in terms of its operations by underlining a set of functionalities that are Interface (API) independent of their respective implementation. APIs are used for real-time integration to the CBS or management information system (MIS), which specify how two different systems can communicate with each other through the exchange of ‘messages’. Several different types of APIs exist, including those based on the web, Transmission Control Protocol (TCP) communication, direct integration to a DB, or proprietary APIs written for specific systems. Artificial Intelligence (AI) AI is an area of computer science that emphasizes the creation of intelligent machines that work and react like humans. Average An average is the sum of a list of numbers divided by the number of numbers in the list. In mathematics and statistics, this would be called the arithmetic mean. Average Revenue Per ARPU is a measure used primarily by MNOs, defined as the total revenue divided by the number of subscribers. User (ARPU) Big Data Big data are large datasets, whose size is measured by five distinct characteristics: volume, velocity, variety, veracity, and complexity. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 149 Byte It is a unit of digital information, considered a unit of memory size. It consists of 8 bits, and 1024 bytes equals 1 kilobyte. Call Center A centralized office used for the purpose of receiving or transmitting a large volume of requests by telephone. As well as handling customer complaints and queries, it can also be used as an alternative delivery channel (ADC) to improve outreach and attract new customers via various promotional campaigns. Call Detail Records (CDR) This is the MNO record of a voice call or an SMS, with details such as origin, destination, duration, time of day, or amount charged for each call or SMS. Channel The customer’s access point to a FSP, namely who or what the customer interacts with to access a financial service or product. Complexity Combining the four big data attributes (volume, velocity, variety, and veracity) requires advanced analytical processes. There are a variety of analytical processes that have emerged to deal with these large datasets. Analytical processes target specific types of data such as text, audio, web, and social media. Another methodology that has received extensive attention is around machine learning, where an algorithm is created and fed to a computer along with historical data. This allows the algorithm to predict relationships between seemingly unconnected variables. Credit History A credit history is a record of a borrower’s repayment of debts; responsible repayment is interpreted as a favorable credit history, while delinquency or defaults are factors that create a negative credit history. A credit report is a record of the borrower’s credit history from a number of sources, traditionally including banks, credit card companies, collection agencies, and governments. Credit Scoring A statistical analysis performed by lenders and FIs to access a person’s credit worthiness. Lenders use credit scoring, among other things, to arrive at a decision on whether to extend credit. A person’s credit score is a number between 300 and 850, with 850 being the highest credit rating possible. Digital Financial Services The use of digital means to offer financial services. DFS encompasses all mobile, card, POS, and e-commerce offerings, (DFS) including services delivered to customers via agent networks. Dashboard A BI dashboard is a data visualization tool that displays the current status of metrics and KPIs for an enterprise. Dashboards consolidate and arrange numbers, metrics and sometimes performance scorecards on a single screen. Data Data is an umbrella term that is used to describe any piece of information, fact or statistic that has been gathered for any kind of analysis or for reference purposes. There are many different kinds of data from a variety of different sources. Data are generally processed, aggregated, manipulated, or consolidated to produce information that provides meaning. Data Analytics Data analytics refers to qualitative and quantitative techniques and processes used to generate information, enhance productivity and create business gains. Data are extracted and categorized to identify and analyze behavioral data and patterns, and data analytics techniques vary according to organizational requirements. Data Architecture Data architecture is a set of rules, policies, standards, and models that govern and define the type of data collected and how it is used, stored, managed, and integrated within an organization and its DB systems. It provides a formal approach to creating and managing the flow of data and how it is processed across an organization’s IT systems and applications. Data Cleansing Data cleansing is the process of altering data in a given storage resource to make sure it is accurate and correct. Data Cube In computing, multi-dimension data, often with time as a third dimension of columns and rows. In business operations, this is a generic term that refers to corporate systems that enable users to specify and download raw data reports. Many include drag-and-drop fields to design a reporting request or simple data aggregations. 150 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Data Lake A data lake is a massive, easily accessible, centralized repository of large volumes of structured and unstructured data. Data Management Data management is the development, execution and supervision of plans, policies, programs, and practices that control, protect, deliver, and enhance the value of data and information assets. Data Mining Data mining is the computational process of discovering patterns in large datasets. It is an interdisciplinary subfield of computer science. The overall goal of the data mining process is to extract information from a dataset and transform it into an understandable structure for further use. Data Privacy Data privacy, also called information privacy, is the aspect of IT that deals with the ability an organization or individual has to determine what data in a computer system can be shared with third parties. Data Processing Data processing is, generally, the collection and manipulation of items of data to produce meaningful information. In this sense, it can be considered a subset of information processing, or the change (processing) of information in any manner detectable by an observer. Data Scraping It is a technique in which a computer program extracts data from human-readable output coming from another digital source such as a website, reports or computer screens. Data Scientist A data scientist is an individual, organization or team that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information. Data Security Data security refers to protective digital privacy measures that are applied to prevent unauthorized access to computers, DBs, websites, and any other place where data are stored. Data security also protects data from corruption. Data security is an essential aspect of IT for organizations of every size and type. Data Storage Data storage is a general term for archiving data in electromagnetic or other forms, for use by a computer or device. Different types of data storage play different roles in a computing environment. In addition to forms of hard data storage, there are now new options for remote data storage, such as cloud computing, that can revolutionize the ways users access data. Data Warehouse A collection of corporate information and data derived from operational systems and external data sources. A data warehouse is designed to support business decisions by allowing data consolidation, analysis and reporting at different aggregate levels. Descriptive Analytics, The least complex analytical methodologies are descriptive in nature; they provide historical descriptions of the Methodologies institutional performance, analysis around reasons for this performance and information on the current institutional performance. Techniques include alerts, querying, searches, reporting, visualization, dashboards, tables, charts, narratives, correlations, as well as simple statistical analysis. Electronic Banking The provision of banking products and services through digital delivery channels. E-money Short for ‘electronic money,’ it is stored value held on cards or in accounts such as e-wallets. Typically, the total value of e-money issued is matched by funds held in one or more bank accounts. It is usually held in trust, so that even if the provider of the e-wallet service was to fail, users could recover the full value stored in their accounts. E-wallets An e-money account belonging to a DFS customer and accessed via mobile phone. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 151 Exabyte (EB) The Exabyte (EB) is a multiple of the unit byte for digital information. In the International System of Units, the prefix exam indicates multiplication by the sixth power of 1000 (1018). Therefore, one EB is one quintillion bytes (short scale). The symbol for the Exabyte is EB. Financial Institution (FI) A provider of financial services including credit unions, banks, non-banking FIs, MFIs, and mobile FSPs. File Transfer Protocol File Transfer Protocol (FTP) is a client-server protocol used for transferring files to, or exchanging files with a host (FTP) computer. FTP is the Internet standard for moving or transferring files from one computer to another using TCP or IP networks. Float (Agent Float) The balance of e-money, or physical cash, or money in a bank account that an agent can immediately access to meet customer demands to purchase (cash in) or sell (cash out) electronic money. Geospatial Data Information about a physical object that can be represented by numerical values in a geographic coordinate system. Global System for The GSM Association (commonly referred to as ‘the GSMA’) is a trade body that represents the interests of mobile Mobile Communications operators worldwide. Approximately 800 mobile operators are full GSMA members and a further 300 companies in the Association (GSMA) broader mobile ecosystem are associate members. Hypothesis A hypothesis is an educated prediction that can be tested. Image Processing Image processing is a somewhat broad term that refers to using analytic tools as a means to process or enhance images. Many definitions of this term specify mathematical operations or algorithms as tools for the processing of an image. Key Performance A KPI is a measurable value that demonstrates how effectively a company is achieving key business objectives. Indicator (KPI) Organizations use KPIs at multiple levels to evaluate their success at reaching targets. High-level KPIs may focus on the overall performance of the enterprise, while low-level KPIs may focus on processes in departments such as sales, marketing or a call center. Key Risk Indicator (KRI) A KRI is a measure used to indicate how risky an activity is. It differs from a KPI in that the latter is meant as a measure of how well something is being done, while the former indicates how damaging something may be if it occurs and how likely it is to occur. Know Your Customer Rules related to AML/CFT that compel providers to carry out procedures to identify a customer and assess the value of the (KYC) information for detecting, monitoring and reporting suspicious activities. Linear Regression Mathematical technique for finding the straight line that best fits the values of a linear function, plotted on a scatter graph as data points. Machine Learning Machine learning is a type of AI that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can change when exposed to new data. Market Segmentation The process of defining and subdividing a large homogeneous market into clearly identifiable segments having similar needs, wants or demand characteristics. Its objective is to design a marketing mix that precisely matches the expectations of customers in the targeted segment. Master Agent A person or business that purchases e-money from a DFS provider wholesale and then resells it to agents, who in turn sell it to users. Unlike a super agent, master agents are responsible for managing the cash and electronic-value liquidity requirements of a particular group of agents. Merchant A person or business that provides goods or services to a customer in exchange for payment. 152 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Metadata Metadata describes other data. They provide information about a certain item’s content. For example, an image may include metadata that describe how large the picture is, the color depth, the image resolution, when the image was created, and other data. Microfinance Institution A FI specializing in banking services for low-income groups, small-scale businesses, or people. (MFI) Mobile Banking The use of a mobile phone to access conventional banking services. This covers both transactional and non-transactional services, such as viewing financial information and executing financial transactions. Sometimes called ‘m-banking’. Mobile Money Service, A DFS that is provided by issuing virtual accounts against a single pooled bank account as e-wallets, that are accessed Mobile Financial Service using a mobile phone. Most mobile money providers are a MNO or a PSP. Mobile Network A company that has a government-issued license to provide telecommunications services through mobile devices. Operator (MNO) Mobile Phone Type - A feature phone is a type of mobile phone that has more features than a standard mobile phone but is not equivalent to a Feature Phone smartphone. Feature phones can provide some of the advanced features found on a smartphone such as a portable media player, digital camera, personal organizer, and Internet access, but do not usually support add-on applications. Mobile Phone Type - A mobile phone that has the processing capacity to perform many of the functions of a computer, typically having a Smartphone relatively large screen and an operating system capable of running a complex set of applications, with internet access. In addition to digital voice service, modern smartphones provide text messaging, e-mail, web browsing, still and video cameras, MP3 players, and video playback with embedded data transfer, GPS capabilities. Mobile Phone Type - A basic mobile phone that can make and receive calls, send text messages and access the USSD channel, but has very Standard Phone limited additional functionality. Monte Carlo Methods Models that use randomized approaches to model complex systems by setting a probabilistic weight to various decision points in the model. The results show a statistical distribution pattern that may be used to predict the likelihood of certain results given the inputs into the system being modeled. These models are typically used for optimization problems or probability analysis. Natural Language The field of study that focuses on the interactions between human language and computers is called Natural Language Processing (NLP) Processing, or NLP for short. It sits at the intersection of computer science, AI and computational linguistics. NLP is a field that covers a computer’s understanding and manipulation of human language. Non-parametric A commonly used method in statistics where small sample sizes are used to analyze nominal data. A non-parametric Methodology method is used when the researcher does not know anything about the parameters of the sample chosen from the population. Open Data Open data are data that anyone can access, use or share. Point of Sale (POS) Electronic device used to process card payments at the point at which a customer makes a payment to the merchant in exchange for goods and services. The POS device is a hardware (fixed or mobile) device that runs software to facilitate the transaction. Originally these were customized devices or personal computers, but increasingly include mobile phones, smartphones and tablets. Person to Person (P2P) Person-to-person funds transfer. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 153 Parametric Statistics Parametric statistics is a branch of statistics that assumes sample data comes from a population that follows a probability distribution based on a fixed set of parameters. Most well-known elementary statistical methods are parametric. Pattern Recognition In IT, pattern recognition is a branch of machine learning that emphasizes the recognition of data patterns or data regularities in a given scenario. It is a subdivision of machine learning and it should not be confused with an actual machine learning study. Pattern recognition can be either ‘supervised,’ where previously known patterns can be found in a given data, or ‘unsupervised,’ where entirely new patterns are discovered. Peripheral Data Typically, the most useful peripheral data sources are call center data, data from CRM (ticketing systems), information from the knowledge base of frequently asked questions, from approval mails, blacklist and whitelist trackers, or shared Excel trackers. Predictive Analytics, Predictive analytics provide much more complex analysis of existing data to provide a forecast for the future. Techniques Methodologies include regression analysis, multivariate statistics, pattern matching, data mining, predictive modeling, and forecasting. Predictive Modeling Predictive modeling is a process that uses data mining and probability to forecast outcomes. Each model is made up of a number of predictors, which are variables that are likely to influence future results. Once data has been collected for relevant predictors, a statistical model is formulated. Prescriptive Analysis, Prescriptive analysis goes a step further – it provides information to feed into optimal decisions for a set of predicted future Methodologies outcomes. Techniques include graph analysis, neural networks, machine, and deep learning. Primary and Secondary Primary research is original data collected through its own approach, often a study or survey. Secondary research uses Research existing results from previously conducted studies and data collection. Probability Probability is the measure of the likelihood that an event will occur. Probability is quantified as a number between zero and one (where ‘0’ indicates impossibility and ‘1’ indicates certainty). The higher the probability of an event, the more certain that the event will occur. Psychographic Psychographic segmentation involves dividing the market into segments based on different personality traits, values, Segmentation attitudes, interests, and consumer lifestyles. Psychometric Scoring Psychometrics refers to the measurement of knowledge, abilities, attitudes, and personality traits. In psychometric scoring Model models, psychometric principles are applied to credit scoring by using advanced statistical techniques to forecast an applicant’s probability of default. Qualitative Data Data that approximates or characterizes, but does not measure the attributes, characteristics, or properties of a thing or phenomenon. Qualitative data describes, whereas quantitative data defines. Quantitative Data Data that can be quantified and verified, and is amenable to statistical manipulation. Qualitative data describes, whereas quantitative data defines. Randomized Controlled A randomized controlled trial is a scientific experiment where the people participating in the trial are randomly allocated Trial (RCT) to different intervention contexts and then compared to each other. Randomization minimizes selection bias during the design of the scientific experiment. The comparison groups allow the researchers to determine any effects of the intervention when compared with the no intervention (control) group, while other variables are kept constant. 154 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Scientific Method Problem solving using a step-by-step approach consisting of (1) identifying and defining a problem, (2) accumulating relevant data, (3) formulating a hypothesis, (4) conducting experiments to test the hypothesis, (5) interpreting the results objectively, and (6) repeating the steps until an acceptable solution is found. Semi-structured Data Semi-structured data are a form of structured data that do not conform to the formal structure of data models associated with relational DBs or other forms of data tables. Nonetheless, they contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Service Level A SLA is the service contract component between a service provider and customer. SLAs provides specific and measurable Agreements (SLAs) aspects related to service offerings. For example, SLAs are often included in signed agreements between internet service providers and customers. SLA is also known as an Operating Level Agreement (OLA) when used in an organization without an established or formal provider-customer relationship. Short Message Service A ‘store and forward’ communication channel that involves the use of the telecom network and short message peer to (SMS) peer (SMPP) protocol to send a limited amount of text between phones or between phones and servers. Small and Medium Small and medium-sized enterprises, or SMEs, are non-subsidiary, independent firms that employ less than a given number Enterprises (SMEs) of employees. This number varies across countries. Social Network Analysis Social network analysis, or SNA, is the process of investigating social structures through the use of network and graph (SNA) theories. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties, edges, or links (relationships or interactions) that connect them. Standard Deviation In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (or average) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. Statistical Distribution The distribution of a variable is a description of the relative number of times each possible outcome will occur in a number of trials. Structured Data Structured Data refers to any data that resides in a fixed field within a record or file. This includes data contained in relational DBs. Super Agent A business, sometimes a bank, which purchases electronic money from a DFS provider wholesale and then resells it to agents, who in turn sell it to users. Supervised Learning Supervised learning is a method used to enable machines to classify objects, problems or situations based on related data fed into the machines. Machines are fed data such as characteristics, patterns, dimensions, color and height of objects, people, or situations repetitively until the machines are able to perform accurate classifications. Supervised learning is a popular technology or concept that is applied to real-life scenarios. Supervised learning is used to provide product recommendations, segment customers based on customer data, diagnose disease based on previous symptoms, and perform many other tasks. Support Vector Machines A support vector machine, or SVM, is a machine learning algorithm that analyzes data for classification and regression (SVM) analysis. SVM is a supervised learning method that looks at data and sorts it into one of two categories. An SVM outputs a map of the sorted data with the margins between the two as far apart as possible. SVMs are used in text categorization, image classification, handwriting recognition, and in the sciences. A support vector machine is also known as a support vector network (SVN). DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 155 Text Mining Analytics Text mining, also referred to as text data mining and roughly equivalent to text analytics, is the process of deriving high- quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves: structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a DB); deriving patterns within the structured data; and evaluation and interpretation of the output. Traditional Data Traditional data refers to commonly used structured internal data (such as transactional) and external data (such as information from credit bureaus) that are used in the decision-making process. It may include data that are generated from interaction with clients such as surveys, registration forms, salary, and demographic information. Unstructured Data Usually refers to information that does not reside in a traditional row-column DB. Unstructured Data files often include text and multimedia content. Examples include: e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages, and many other kinds of business documents. Unsupervised Learning Unsupervised learning is a method used to enable machines to classify both tangible and intangible objects without providing the machines with any prior information about the objects. The things machines need to classify are varied, such as customer purchasing habits, behavioral patterns of bacteria, or hacker attacks. The main idea behind unsupervised learning is to expose the machines to large volumes of varied data and allow them to learn and infer from the data. However, the machines must first be programmed to learn from data. Unstructured A protocol used by GSM mobile devices to communicate with the service provider’s computers or network. This channel is Supplementary Service supported by all GSM handsets, enabling an interactive session consisting of a two-way exchange of messages based on a Data (USSD) defined application menu. Variety The digital age has diversified the kinds of data available. Traditional, structured data fit into existing DBs that are meant for well-defined information that follows a set of rules. For example, a banking transaction has a time stamp, amounts and location. However, today, 90 percent of the data that is being generated is ‘unstructured,’ meaning it comes in the form of tweets, images, documents, audio files, customer purchase histories, and videos. Velocity A large proportion of data are being produced and made available in real time. By 2018, it is estimated that 50,000 gigabytes of data are going to be uploaded and downloaded on the internet every second. Every 60 seconds, 204 million emails are sent. As a consequence, these data have to be stored, processed, and analyzed at very high speeds, sometimes at the rate of tens of thousands of bytes every second. Veracity Veracity refers to the trustworthiness of the data. Business managers need to know that the data they use in the decision- making process is representative of their customer segment’s needs and desires. Thus, data management practices in businesses must ensure that the data cleaning process is ongoing and rigorous. This will safeguard against the inclusion of misleading or incorrect data in the analysis. Volume The sheer quantity of data that are being produced is mind-boggling. It is estimated that approximately 2.5 quintillion bytes of data are produced every day. To get a sense of the quantity, this amount of data would fill approximately 10 million Blu-ray discs. The maturity of these data have gotten increasingly younger, which is to say, that the amount of data that are less than a minute old has been rising consistently. In fact, 90 percent of these data have been produced in the last two years. It is expected that the amount of data in the world will rise by 44 times between 2009 and 2020. 156 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES Author Bios DEAN CAIRE Credit Scoring Specialist, IFC Dean worked for the past 15 years as a credit scoring consultant, 12 with the company DAI Europe and thereafter as an independent consultant. Over this time, he has helped clients from 77 financial institutions in 45 countries develop more than 100 custom credit scoring models for the following segments: consumer loans (including DFS), standard asset leases, micro enterprise loans, small business loans (including digital financial merchant services), agriculture loans and equipment leases (including DFS), microloans to solidarity groups, and large loans to unlisted companies. Dean strives to transfer model development and management skills to FI counterparts so that they can take full ownership of the models and manage them into the future. LEONARDO CAMICIOTTI Executive Director, TOP-IX Consortium Reporting to the Board of Directors, Leonardo is responsible for the strategic, administrative and operational activities of the TOP-IX Consortium. He manages the TOP-IX Development Program, which fosters new business creation by providing infrastructural support (i.e. internet bandwidth, cloud computing, and software prototyping) to startups and promotes innovation projects in different sectors, such as big data and high-performance computing, open manufacturing and civic technologies. Previously, he was Research Scientist, Strategy and Business Development Officer and Business Owner at Philips Corporate Research. He graduated in Electronic Engineering from the University of Florence and holds an MBA from the University of Turin. DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES 157 SOREN HEITMANN Operations Officer, IFC Soren leads the IFC-MasterCard Foundation partnership applied research and integrated Monitoring, Evaluation and Learning (MEL) program. He works at the nexus of data-driven research and technology to help drive learning and innovation for IFC’s DFS projects in Sub-Saharan Africa. Previously, Soren led results measurement for IFC’s Risk VPU and the Regional Monitoring and Evaluation Portfolio Management team for Europe and Central Asia. He has a background in database management, software engineering and web technology, which he now incorporates into his work providing data operations support to IFC clients. Soren holds a degree in Cultural Anthropology from Boston University and an MA in Development Economics from Johns Hopkins SAIS. SUSIE LONIE Digital Financial Services Specialist, IFC Susie spent three years in Kenya creating and operationalizing the M-PESA mobile payments service, after which she facilitated its launch in several other markets including India, South Africa and Tanzania. In 2010, Susie was the co-winner of The Economist Innovation Award for Social and Economic Innovation for her work on M-PESA. She became an independent DFS consultant in 2011 and works with banks, MNOs and other clients on all aspects of providing financial services to people who lack access to banks or other financial services in emerging markets, including mobile money, agent banking, international money transfers, and interoperability. Susie works on DFS strategy, financial evaluation, product design and functional requirements, operations, agent management, risk assessment, research evaluation, and sales and marketing. Her degrees are in Chemical Engineering from Edinburgh and Manchester, United Kingdom. 158 DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES CHRISTIAN RACCA Design Engineer, TOP-IX Consortium Christian manages the TOP-IX BIG DIVE program aimed at providing training courses for data scientists, data-driven education initiatives for companies, organizations and consultancy projects in the (big) data-exploitation field. After graduating in telecommunication engineering at Politecnico di Torino, Christian joined TOP-IX Consortium, working on data streaming and cloud computing, and later on web startups. He has mentored several projects on business model, product development and infrastructure architecture and cultivated relationships with investors, incubators, accelerators and the Innovation ecosystem in Italy and Europe. MINAKSHI RAMJI Associate Operations Officer, IFC Minakshi leads projects on DFS and financial inclusion within IFC’s Financial Institutions Group in Sub-Saharan Africa. Prior to this, she was a consultant at MicroSave, a financial inclusion consulting firm based in India, where she was a Senior Analyst in their Digital Financial Services practice. She also worked at the Centre for Microfinance at IFMR Trust in India, focused on policy related to access to finance issues in India. She holds a master’s degree in Economic Development from the London School of Economics and a BA in Mathematics from Bryn Mawr College in the United States. QIUYAN XU Chief Data Scientist, Cignifi Qiuyan Xu is the Chief Data Scientist at Cignifi Inc., leading the Big Data Analytics team. Cignifi is a fast-growing financial technology start-up company in Boston, United States, that has developed the first proven analytic platform to deliver credit and marketing scores for consumers using mobile phone behavior data. Doctor Xu has expertise in big data analysis, cloud computing, statistical modeling, machine learning, operation optimization and risk management. She served as Director of Advanced Analytics at Liberty Mutual and Manager of Enterprise Risk Management at Travelers Insurance. Doctor Xu holds a PhD in statistics from the University of California, Davis and a Financial Risk Manager certification from The Global Association of Risk Professionals. CONTACT DETAILS Anna Koblanck IFC, Sub-Saharan Africa akoblanck@ifc.org www.ifc.org/financialinclusionafrica 2017