All posts by hzpvincent

Big Data

What is big data?

Originally from wiki (http://en.wikipedia.org/wiki/Big_data). I personally just reorganized.

Big data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.[1] Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.

In a 2001 research report[2] and related lectures, Doug Laney (Gartner’s analyst) defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this “3Vs” model for describing big data.[3] In 2012, Gartner updated its definition as follows: “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”[4] Additionally, a new V “Veracity” is added by some organizations to describe it.[17]

If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:[6]

  • Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.;
  • Big data uses inductive statistics and concepts from nonlinear system identification [7] to infer laws (regressions, nonlinear relationships, and causal effects) from large data sets [8] to reveal relationships, dependencies and perform predictions of outcomes and behaviors. [9]

Big data can also be defined as “Big data is a large volume unstructured data which cannot be handled by standard database management systems like DBMS, RDBMS or ORDBMS”.

Market

Big data has increased the demand of information management specialists in that Software AG, Oracle Corporation, IBM, FICO, Microsoft, SAP, EMC, HP and Dell have spent more than $15 billion on software firms only specializing in data management and analytics. In 2010, this industry on its own was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.[10]

Developed economies make increasing use of data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet.[10] Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. The world’s effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007[11] and it is predicted that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2014.[10] It is estimated that one third of the globally stored information is in the form of alphanumeric text and still image data,[12] which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content).

While many vendors offer off-the-shelf solutions for Big Data, experts recommend the development of in-house solutions custom-tailored to solve the companies’ problem at hand if the company has sufficient technical capabilities.[13]

Technologies

Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report[14] suggests suitable technologies include A/B testing, crowdsourcing, data fusion and integration, genetic algorithms, machine learning, natural language processing, signal processing, simulation, time series analysis and visualization. Multidimensional big data can also be represented as tensors, which can be more efficiently handled by tensor-based computation,[15] such as multi-linear subspace learning.[16] Additional technologies being applied to big data include massively parallel-processing (MPP) databases, search-based applications, data-mining grids, distributed file systems, distributed databases, cloud based infrastructure (applications, storage and computing resources) and the Internet.

Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.[17]

DARPA’s Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called Ayasdi.[18]

The practitioners of big data analytics processes are generally hostile to slower shared storage,[19] preferring direct-attached storage (DAS) in its various forms from solid state drive (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—Storage area network (SAN) and Network-attached storage (NAS) —is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.

Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques.

There are advantages as well as disadvantages to shared storage in big data analytics, but big data analytics practitioners as of 2011 did not favor it.[20]

  1. Matzat, U., & Reips, U.-D. (2012). ‘Big Data’: Big gaps of knowledge in the field of Internet. International Journal of Internet Science, 7, 1-5
  2. Laney, Douglas. “3D Data Management: Controlling Data Volume, Velocity and Variety”. Gartner. Retrieved 6 February 2001
  3. Beyer, Mark. “Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just Managing Volumes of Data”. Gartner. Archived from the original on 10 July 2011. Retrieved 13 July 2011
  4. Laney, Douglas. “The Importance of ‘Big Data’: A Definition”. Gartner. Retrieved 21 June 2012
  5. “What is Big Data?”. Villanova University
  6. http://www.bigdataparis.com/presentation/mercredi/PDelort.pdf?PHPSESSID=tv7k70pcr3egpi2r6fi3qbjtj6#page=4
  7. Billings S.A. “Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains”. Wiley, 2013
  8. Delort P., Big data Paris 2013 http://www.andsi.fr/tag/dsi-big-data/
  9. Delort P., Big Data car Low-Density Data ? La faible densité en information comme facteur discriminant http://lecercle.lesechos.fr/entrepreneur/tendances-innovation/221169222/big-data-low-density-data-faible-densite-information-com
  10. “Data, data everywhere”. The Economist. 25 February 2010. Retrieved 9 December 2012
  11. Hilbert & López2011
  12. What is the content of the World’s Technologically Mediated Information and Communication Capacity: How Much Text, Image, Audio, and Video?”, Martin Hilbert (2014), The Information Society; free access to the article through this link: martinhilbert.net/WhatsTheContent_Hilbert.pdf
  13. Rajpurohit, Anmol (2014-07-11). “Interview: Amy Gershkoff, Director of Customer Analytics & Insights, eBay on How to Design Custom In-House BI Tools”. KDnuggets. Retrieved 2014-07-14. “Dr. Amy Gershkoff: “Generally, I find that off-the-shelf business intelligence tools do not meet the needs of clients who want to derive custom insights from their data. Therefore, for medium-to-large organizations with access to strong technical talent, I usually recommend building custom, in-house solutions””
  14. Manyika, James; Chui, Michael; Bughin, Jaques; Brown, Brad; Dobbs, Richard; Roxburgh, Charles; Byers, Angela Hung (May 2011). Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute
  15. “Future Directions in Tensor-Based Computation and Modeling”. May 2009
  16. Lu, Haiping; Plataniotis, K.N.; Venetsanopoulos, A.N. (2011). “A Survey of Multilinear Subspace Learning for Tensor Data”. Pattern Recognition 44 (7): 1540–1551. doi:10.1016/j.patcog.2011.01.004
  17. Monash, Curt (30 April 2009). “eBay’s two enormous data warehouses”
  18. Resources on how Topological Data Analysis is used to analyze big data”. Ayasdi
  19. CNET News (April 1, 2011). “Storage area networks need not apply”
  20. “How New Analytic Systems will Impact Storage”. September 2011

 

 

Advertisements