Even the largest computers struggle with complex problems that have a lot of variables and large data sets. Imagine if one person had to sort through 26,000 boxes of large balls containing sets of 1,000 balls each with one letter of the alphabet: the task would take days. But if you separated the contents o the 1,000 unit boxes into 10 smaller equal boxes and asked 10 separate people to work on these smaller tasks, the job would be completed 10 times faster. This notion of parallel processing is one of the cornerstones of many Big Data projects.
Apache Hadoop (named after the creator Doug Cutting’s child’s toy elephant) is a free programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop is part of the Apache project sponsored by the Apache Software Foundation and although it originally used Java, any programming language can be used to implement many parts of the system.
Hadoop was inspired by Google’s Map-Reduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any computer connected in an organised group called a cluster. Hadoop makes it possible to run applications on thousands of individual computers involving thousands of terabytes of data. Its distributed file system facilitates rapid data transfer rates among nodes and enables the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of computers become inoperative.