Enormous amounts of data have been generated and stored over the past few years. The McKinsey Global Institute reports this huge volume of data, which is generated, stored, and mined to support both strategic and operational decisions, is increasingly relevant to businesses, government, and consumers alike,7 as they extract useful knowledge from it.11
There is no globally accepted definition of "big data," although the Vs concept introduced by Gartner analyst Doug Laney in 2001 has emerged as a common structure to describe it. Initially, 3Vs were used, and another 3Vs were added later.13 The 6Vs that characterize big data today are volume, or very large amounts of data; velocity, or data generated and processed quickly; variety, or a large number of structured and unstructured data types processed; value, or aiming to generate significant value for the organization; veracity, or reliability of the processed data; and variability, or the flexibility to adapt to new data formats through collecting, storing, and processing.
Big data sources can include an overall company itself (such as through log files, email messages, sensor data, internal Web 2.0 tools, transaction records, and machine-generated), as well as external applications (such as data published on websites, GPS signals, open data, and messages posted in public social networks).
This data cannot be managed efficiently through traditional methods17 (such as relational databases) since big data requires balancing data integrity and access efficiency, building indices for unstructured data, and storing data with flexible and variable structures. Aiming to address these challenges, the NoSQL and NewSQL database systems provide solutions for different scenarios.
Big data analytics can be used to extract useful knowledge and analyze large-scale, complex data from applications to acquire intelligence and extract unknown, hidden, valid, and useful relationships, patterns, and information.1 Various methods are used to deal with such data, including text analytics, audio analytics, video analytics, social media analytics, and predictive analytics; see the online appendix "Main Methods for Big Data Analytics," dl.acm.org/citation.cfm?doid=3210752&picked=formats.
Big data reflects a complex, interconnected, multilayered ecosystem of high-capacity networks, users, and the applications and services needed to store, process, visualize, and deliver results to destination applications from multiple data sources.26 The main components in that ecosystem include properties, infrastructure, life cycle, models and structures, and security infrastructure.10
Big data and business management. In order to succeed in today's complex business world, organizations have to find ways to differentiate themselves from their competitors. With the rise of cloud computing, social media, and mobile devices, the quantity and quality of data generated every moment of every day is constantly being enhanced, and organizations need to take advantage of it. If they use data properly, they can become more collaborative, accurate, virtual, agile, adaptive, and synchronous. Data and information are thus primary assets for organizations, with most trying to collect, process, and manage the potential offered by big data.5 To take advantage, organizations need to generate or obtain a large amount of data, use advanced analytical tools, and staff with appropriate skills to manage the tools and the data.3
Big data is a key factor for organizations looking to gain a competitive advantage,4 as it can help develop new products and services, make automated strategic and operational decisions more quickly, identify what has happened and predict what will happen in the immediate future, identify customer behavior, guide targeted marketing, produce greater return on investments, recognize sales and market opportunities, plan and forecast, increase production performance, guide customer-based segmentation, calculate risk and market trends, generate business insight more directly, identify consumer behavior from click sequences, understand business disruption, implement product changes that prevent future problems, obtain feedback from customers, calculate price comparisons proactively, recommend future purchases or discounts, and refine internal processes.25
Big data analytics can be seen as a more advanced form of business intelligence that works with structured company databases, focusing on obtaining reports and indicators to measure and assess business performance. Conversely, big data works with semi-structured and unstructured data from multiple sources, focusing on extracting value related to exploration, discovery, and prediction.9
Big data frameworks. Developing and implementing a big data ecosystem in an organization involves not only technology but management of the organization's policies and people.28 A number of frameworks have thus been proposed in the literature.8,10,12,14,18,27,28 A framework might describe concepts, features, processes, data flows, and relationships among components (such as software development), with the aim of creating a better understanding (such as descriptions of components or design) or guidance toward achieving a specific objective.23 Frameworks consist of (usually interrelated) dimensions or their component parts.
Big data frameworks focus on assisting organizations to take advantage of big data technology for decision making. Each has its good points, although each also has weaknesses that must be addressed, including that none include all dimensions (such as data architecture, organization, data sources, data quality, support tools, and privacy/security). Moreover, they lack a methodology to guide the steps to be followed in the process of developing and implementing a big data ecosystem, making the process easier. They fail to provide strong case studies in which they are evaluated, so their validity has not been proved. They do not consider the impact of the implementation of big data on human resources or organizational and business processes. They do not consider previous feasibility studies of big data ecosystem projects. They lack systems monitoring and a definition of indicators. They fail to study or identify the type of knowledge they need to manage. Moreover, they fail to define the type of data analysis required to address organizational goals; see the online appendix for more on the frameworks and their features and weaknesses.
In addition to big data frameworks, system developers should also consider big data maturity models that define the states, or levels, where an enterprise or system can be situated, a set of good practices, goals, and quantifiable parameters that make it possible to determine on which of the levels the enterprise stands, and a series of proposals with which to evolve from one level of maturity to a higher level.2 Several such models have been proposed,15,16,24 all focused on assessing big data maturity (the "as is") and building a vision for what the organization's future big data state should be and why (the "to be"). There is thus a need for a new framework for managing big data ecosystems that can be applied effectively and simply, accounting for the main features of big data technology and avoiding the weaknesses so identified.
In this context, the IRIS (the Spanish acronym for Systems Integration and Re-Engineering) research group at the Universitat Jaume I of Castellón, Spain, has proposed the Big Data IRIS (BD-IRIS) framework to deal with big data ecosystems, reflecting the literature dealing with this line of research. The BD-IRIS framework focuses on data and the tasks of collecting, storing, processing, analyzing, and visualizing necessary to make use of it. However, unlike other frameworks, it focuses not only on operations affecting data but also other aspects of management like human and material resources, economic feasibility, profit estimation, type of data analysis, business processes re-engineering, definition of indicators, and system monitoring.
Data and information are thus primary assets for organizations, with most trying to collect, process, and manage the potential offered by big data.
The BD-IRIS framework includes seven interrelated dimensions (see Figure 1): methodology, data architecture, organization, data sources, data quality, support tools, and privacy/security. The core is the methodology dimension that serves as a guide for the steps involved in implementing an ecosystem with big data technology includes phases, activities, and tasks supported by the six other dimensions. These other dimensions include various techniques, tools, and good practices that support each phase, activity, and task of the methodology. Additionally, they include properties and characteristics that must be fulfilled in certain stages of such development. With the exception of a methodology, the other six dimensions are included in some of the seven frameworks outlined earlier, though none includes all dimensions.
Methodology dimension. This is the main axis of the framework; the other dimensions are techniques, tools, and good practices that support each phase, and the activities and tasks within it. The methodology provides practical guidance for managing an entire project life cycle by indicating the steps needed to execute development and implementation of big data ecosystems. The methodology consists of phases that in turn consist of activities that in turn consist of tasks, whereby each one must be completed before the next one can begin. Table 2 (see in the online appendix) lists the phases and activities that constitute the methodology, along with the main dimensions that support execution of the activities and tasks. The support-tools dimension is not included in Table 2 because it is present or can be present in all tasks of the methodology, as different information technology tools are available to support each of them.
The methodology can be applied in waterfall mode, or sequentially, for each phase, activity, and task. It can also be applied iteratively, whereby the project is divided into subprojects executed in waterfall mode, with each subproject begun when the previous one has finished; for example, each subproject can cover an individual knowledge block or a tool.
Data architecture dimension. This dimension identifies the proposed steps the software engineer performs during data analysis. The order in which each task is executed in each of the steps and its relationship with the other dimensions of the framework are specified in the methodology dimension. The data architecture dimension is divided into levels ranging from identifying the location and structure of the data to the display of the results requested by the organization. Figure 2 outlines the levels that make up the data architecture, including:
Content. Here, the location and characteristics of the data are identified (such as format and source of required data, both structured and unstructured). In addition, the software engineer performs a verification process to check that data location and characteristics are valid for the next level. Data can be generated offline, through the traditional ways of entering data (such as open data sources and relational databases in enterprise resource planning, customer relationship management systems, and other management information systems). In addition, data can also be obtained online through social media (such as LinkedIn, Facebook, Google+, and Twitter).
Acquisition. Here, filters and patterns are applied by software engineers to ensure only valuable data is collected. Traditional data sources are easier to link to because they consist of structured data. But social software poses a greater technological challenge, as it contains human information that is complex, unstructured, ubiquitous, multi-format, and multi-channel.
Enhancement. The main objectives here are to endow the collected data with value, identify and extract information, and discover otherwise unknown relationships and patterns. To add such endowment, various advanced data-analysis techniques are applied, perhaps divided into two main groups: research and modeling. Valuable information is obtained as a result of applying these techniques to the collected data. Metadata is also generated, reducing the complexity and processing of queries or operations that must be performed while endowing the data with meaning. Data and metadata are stored in a database for future queries, processing, generation of new metadata, and/or training and validation of the models.
Inquiry. Here, the system can access the data and metadata stored in the system database generated at the enhancement level. The main mode of access is through queries, usually based on the Structured Query Language, that extract the required information as needed.
Visualization. This level addresses presentation and visualization of the results, as well as interpretation of the meaning of the discovered information. Due to the nature of big data and the large amount of data to be processed, clarity and precision are important in the presentation and visualization of the results.
Organizational dimension. This dimension is related to the characteristics and needs of the organization to provide data and processing and making use of it. It is also related to all the decisions the organization has to make to adapt the system to its needs.
On the one hand, the organization's strategy must be analyzed, since big data projects must align with the organization's business strategy. If not aligned, the results obtained may not be as valuable as they could be for the organization's decision making. To achieve such alignment, the organization must determine the objectives the project is intended to achieve, as well as the organizational challenges involved and the project's target users, including customers, suppliers, and employees. It is also necessary to define the overall corporate transformation it is willing to make and the new business roles required to exploit big data technology. For example, a big data project could aim to use the knowledge extracted from customer data, products, and operations through the organization's processes to change its business model and create value, optimize business management, and identify new business opportunities. These projects are thus potentially able to increase customer acquisition and satisfaction, as well as increase loyalty and reduce the rate of customer abandonment. They can also improve business efficiency by, say, eliminating overproduction and reducing the launch time of new products or services. In addition, they can help negotiate better prices with suppliers and improve customer service. The project will thus be defined by the organization's business strategy. On the other hand, the resources offered and the knowledge acquired through big data technology allows optimization of existing business processes by improving them as much as possible.
To integrate enterprise strategy, business process, and human resources, the BD-IRIS framework uses the ARDIN (the Spanish acronym for Reference Architecture for INtegrated Development) enterprise reference architecture, allowing project managers to redefine the conceptual aspects of the enterprise (such as mission, vision, strategy, policies, and enterprise values), redesign and implement the new business process map, and reorganize and manage human resources considering in light of the new information and communication technologiesbig data in this caseto improve them.6
In addition, models of the business processes must be developed so weak points and areas in need of improvement are detected. BD-IRIS uses several modeling languages:
I*. I* makes it possible for project engineers to gain a better understanding of organizational environments and business processes, understand the motivations, intentions, goals, and rationales of organizational management, and illustrate the various characteristics seen in the early phases of requirement specification.30
Business Process Model and Notation (BPMN). BPMN,20 designed to model an overall map of an enterprise's business processes, includes 11 graphical, or modeling, elements classified into four categories: core elements (the BPD core element set), flow objects, connecting objects, and "swimlanes" and artifacts. BPMN 2.0 extends BPMN.
Unified Modeling Language. UML2.019 is also used to model interactions among users and the technological platform in greater detail without ambiguity.
In selecting these modeling languages, we took into account that they are intuitive, well-known by academics and practitioners alike, useful for process modeling and information-system modeling, and proven in real-world enterprise-scale settings.
Support-tools dimension. This dimension consists of information-technology tools that support all dimensions in the framework, facilitating execution of the tasks to be performed in each dimension. Each such task can be supported by tools with certain characteristics; for example, some tools support only certain tasks, and some tasks can be carried out with and without the help of tools.
The tools that can be used in each dimension, except for data architecture, are standard tools that can be used in any software-engineering project. Types of tools include business management, office, case, project management, indicator management, software testing, and quality management. The data architecture dimension requires specific tools for each of its levels; see Table 3 in the online appendix for examples of tools that can be used at each level in the data architecture dimension.
Several tools are able to perform the same tasks, and the choice of appropriate tool for each project depends on the scenario in which it is used. The table here lists criteria to help prompt the questions that project engineers must address when choosing the appropriate tools for the particular needs of each project.
Data sources dimension. Considering that the foundation of big data ecosystems is data, it is essential that such data is reliable and provides value. This dimension refers to the sources of the data processed in big data ecosystems. Big data technology is able to process both structured data (such as from relational databases, ERPs, CRMs, and open data), as well as data from semi-structured and unstructured data (such as from log files, machine-generated data, social media, transaction records, sensor data, and GPS signals). Objectives depend on the data that is available to the organization. To ensure optimal performance, the organization must define what data is of interest, identify its sources and formats, and perform, as needed, the pre-processing of raw data. Data is transformed into a format that is more readily "processable" by the system. Methods for preprocessing raw data include feature extraction (selecting the most significant specific data for certain contexts), transformation (modifying it to fit a particular type of input), sampling (selecting a representative subset from a large dataset), normalization (organizing it with the aim of allowing more efficient access to it), and "de-noising" (eliminating existing noise in it). Once such operations are performed, data is available to the system for processing.
Considering that the foundation of big data ecosystems is data, it is essential that such data is reliable and provides value.
Data-quality dimension. The aim here is to ensure quality in the acquisition, transformation, manipulation, and analysis of data, as well as in the validity of the results. Quality is the consequence of multiple factors, including complexity (lack of simplicity and uniformity in the data), usability (how readily data can be processed and integrated with existing standards and systems), time (timelines and frequency of data), accuracy (degree of accuracy describing the measured phenomenon), coherence (how the data meets standard conventions and is internally consistent, over time, with other data sources), linkability (how readily the data can be linked or joined with other data), validity (the data reflects what it is supposed to measure), accessibility (ease of access to information), clarity (availability of clear and unambiguous descriptions, together with the data), and relevance (the degree of fidelity of the results with regard to user needs, in terms of measured concepts and represented populations).29
The United Nations Economic Commission for Europe29 has identified the actions software engineers should perform to ensure quality in data input and output results, thereby minimizing the risk in each of the various factors; see Table 4 in the online appendix.
Privacy/security dimension. Big data ecosystems usually deal with sensitive data, and the knowledge obtained from the data that may be scattered and lacking in value by itself. Due to such scattering, the customers and users who generate the data are often unaware of its value, disclosing it without reflection or compensation. Meanwhile, lack of awareness can lead to unexpected situations where the generated information is personally identifiable and metadata is more important than the data itself. Moreover, big data involves the real-time collection, storage, processing, and analysis of large amounts of data in different formats. Organizations that want to use big data must consider the risks, as well as their legal and ethical obligations, when processing and circulating it.
This dimension considers the privacy and security aspects of data management and communications, included in the same dimension because they are strongly related to each other, as explained in the online appendix.
Once the framework is developed, the next task is to validate and improve it, a process consisting of two phases: expert assessment and case studies. The aims are to validate the framework by verifying and confirming its usefulness, accuracy, and quality and improve the framework with the feedback obtained from the organizations involved and the conclusions drawn from the case studies. In such a case study, the framework is applied to a single organization. For example, we applied it to a Spanish "small and medium-size enterprise" from the metal fabrication market with 250 employees, using it to guide development and implementation of a social CRM system supported by a big data ecosystem.21 In another case study, we applied it to the Spanish division of a large oil and gas company, using it to guide development and implementation of a knowledge management system 2.0 as supported by a big data ecosystem;22 see the online appendix for results.
Big data helps companies increase their competitiveness by improving their business models or their business processes. Big data has emerged over the past five years in companies, forcing them to deal with multiple business, management, technological, processing, and human resources challenges. Seven big data frameworks have been proposed in the IT literature, as outlined here, to deal with them in a satisfactory way. A framework can be defined as a structure consisting of several dimensions that are fitted and joined together to support or enclose something, in this case development and implementation of a big data ecosystem.
Although proper integration of big data in a company is recognized as a key success factor in all big data projects, only two existing frameworks provide any guidance about the need to consider corporate management implications.
Big data frameworks also have weakness. First, none includes a methodology, understood as a documented approach for performing all activities in a big data project life cycle in a coherent, consistent, accountable, repeatable manner. This lack of a methodology is a big handicap because big data is still a novel area, and only a limited supply of well-trained professionals know what steps to take, in what order to take them, and how they should be performed.13 It is thus difficult for IT professionals, even those well trained in big data projects, to successfully take on a project employing the existing frameworks. In addition, in large-scale big data projects employing multiple teams of people, decisions regarding procedures, technologies, methods, and techniques can produce a lack of consistency and poor monitoring procedures. Second, each of the six dimensions of the big data frameworkdata architecture, organization, sources, quality, support tools, and privacy/securityaddresses a different aspect of a project. However, although existing frameworks consider several dimensions, none of the seven frameworks proposed in the IT literature considers all six dimensions. Using only one of these frameworks means some important questions are ignored. Third, the approaches in each dimension are not fitted and joined together and are sometimes too vague and general or do not cover all the activities of the whole project life cycle. For example, although proper integration of big data in a company is recognized as a key success factor in all big data projects,3 only two existing frameworks provide any guidance about the need to consider corporate management implications. Neither do they explain when and how to improve business strategy or when and how to carry out reengineering of a business process using big data. As a result, opportunities for improving business performance can be lost.
For this reason, the BD-IRIS framework needs to be structured in all seven dimensions. The main innovation is the BD-IRIS methodology dimension, along with the fact that it takes into account all the dimensions a big data framework should have within a single framework. The BD-IRIS methodology represents a guide to producing a big data ecosystem according to a process, covering the big data project life cycle and identifying when and how to use the approaches proposed in the other six dimensions. The utility of the framework and its completeness, level of detail, and accuracy of the relations among the methodology tasks and the approaches to other dimensions were validated in 2016 by five expert professionals from a Spanish consulting company with experience in big data projects, and by managers of the two organizations (not experts in big data projects) participating in our case studies. Lack of validation is a notable weakness of the existing frameworks.
This article has explored a framework for guiding development and implementation of big data ecosystems. We developed its initial design from the existing literature while providing additional knowledge. We then debugged, refined, improved, and validated this initial design through two methodsexpert assessment and case studiesin a Spanish metal fabrication company and the Spanish division of an international oil and gas company. The results show the framework is considered valuable by corporate management where the case studies were applied.
The framework is useful for guiding organizations that wish to implement a big data ecosystem, as it includes a methodology that indicates in a clear and detailed way each activity and task that should be carried out in each of its phases. It also offers a comprehensive understanding of the system. Moreover, it provides control over a project and its scope, consequences, opportunities, and needs.
Although the framework has been validated through two different methodsexpert evaluation and case studiesit also involves some notable limitations. For example, the methods we used for the analysis and validation in the two case studies are qualitative and not as precise as quantitative ones and based on the perceptions of the people involved in the application of the framework in the case studies and the consultants who evaluated it. Moreover, the evaluation experts were chosen from the same consulting company to avoid potential bias. Finally, we applied the framework in two companies in two different industrial sectors but have not yet tested its validity in other types of organization.
Regarding the scope of future work, we are exploring four areas: apply and assess the framework in companies from different industrial sectors; evaluate the ethical implications of big data systems; refine techniques for converting different input data formats into a common format to optimize the processing and analysis of data in big data systems; and finally, refine the automatic identification of people in different social networks, allowing companies to gather information entered by the same person in a given social network.
7. Chui, M., Manyika, J., and Bughin, J. Big data's potential for businesses. Financial Times (May 13, 2011); https://www.ft.com/content/64095dba-7cd5-11e0-994d-00144feabdc0
9. Debortoli, S., Müller, O., and Vom Brocke, J. Comparing business intelligence and big data skills: A text mining study using job advertisements. Business & Information Systems Engineering 6, 5 (Oct. 2014), 289300.
10. Demchenko, Y., de Laat, C., and Membrey, P. Defining architecture components of the big data ecosystem. In Proceedings of the International Conference on Collaboration Technologies and Systems (Minneapolis, MN, May 1923). IEEE Press, 2014, 104112.
11. Elgendy, N. and Elragal, A. Big data analytics: A literature review. In Proceedings of the 14th Industrial Conference on Data Mining (St. Petersburg, Russia, July 1620). Lecture Notes in Computer Science, 8557. Springer International Publishing, Switzerland, 2014, 214227.
12. Ferguson, M. Architecting a Big Data Platform for Analytics. IBM White Paper, Oct. 2012; http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IML14333USEN
13. Flouris, I., Giatrakos, N., Deligiannakis, A., Garofalakis, M., Kamp, M., and Mock, M. Issues in complex event processing: Status and prospects in the big data era. Journal of Systems and Software 127 (May 2017), 217236.
15. Halper, F. and Krishnan, K. TDWI Big Data Maturity Model Guide. TDWI Research, Renton, WA, 2013; https://tdwi.org/whitepapers/2013/10/tdwi-big-data-maturity-model-guide.aspx
16. Hortonworks. Hortonworks Big Data Maturity Model, 2016; http://hortonworks.com/wp-content/uploads/2016/04/Hortonworks-Big-Data-Maturity-Assessment.pdf
19. Object Management Group. Unified Modeling Language. OMG, 2000; http://www.uml.org/
20. Object Management Group. Business Process Model and Notation. OMG, 2011; http://www.omg.org/spec/BPMN/2.0
22. Orenga-Roglá, S. and Chalmeta, R. Methodology for the implementation of knowledge management systems 2.0: A case study in an oil and gas company. Business & Information Systems Engineering (Dec. 2017), 119; https://doi.org/10.1007/s12599-017-0513-1
23. Pawlowski, J. and Bick, M. The global knowledge management framework: Towards a theory for knowledge management in globally distributed settings. Electronic Journal of Knowledge Management 10, 1 (Jan. 2012), 92108.
27. Sun, H. and Heller, P. Oracle Information Architecture: An Architect's Guide to Big Data. Oracle White Paper, Aug. 2012; https://d2jt48ltdp5cjc.cloudfront.net/uploads/test1_3021.pdf
29. United Nations Economic Commission for Europe. A Suggested Framework for the Quality of Big Data. Deliverables of the UNECE Big Data Quality Task Team. UNECE, Dec. 2014; http://www.unece.org/unece/search?q=A+Suggested+Framework+for+the+Quality+of+Big+Data.+&op=Search
30. Yu, E. Why agent-oriented requirements engineering. In Proceedings of the Third International Workshop on Requirements Engineering: Foundation of Software Quality (Barcelona, Spain, June 1617). Presses Universitaires de Namur, Namur, Belgium, 1997, 171183.
©2019 ACM 0001-0782/19/01
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2019 ACM, Inc.
No entries found