A Basic Understanding of Data Lakes
A data lake is exactly what the name implies, a large repository or “pool” where data is stored. The difference between a data lake and the much more accustomed data warehouse is the format of the data being stored. Generally, in a data warehouse, data is stored in an enriched or appended state. This data is ready for review and use. A data lake is specific for raw data storage that has yet to be changed or analyzed. In short, a data lake is for data in its native format and a data warehouse is usually for data in a usable state.
Over time data lakes have evolved to include a Hadoop Distributed File System structure. Per TechTarget, this structure involves “a data management platform comprising one or more Hadoop clusters used principally to process and store non-relational data.” In short, Hadoop is a way to organize, separate, and categorize non-related data.
The term “data lake” has evolved since its inception by Pentahos’ chief technical officer and founder James Dixon. Dixon explains “when people ask what a data lake is, I tell them it’s what you used to have on tape. Take what you have on tape and pour it into a data lake and start exploring that data.”
The original idea of a data lake was a safe repository to dumb data and explore what you had. Basically, a data lake was much like a pirate’s secret stash. It was a place to dump, sort, remove, and keep what was necessary. While the Hadoop platform has allowed data lakes to take on more diverse uses, Dixon notes “our story was always only put into Hadoop what you need to; if you want to combine information from the data lake with information in your CRM system, well just do a join, do that blending of data only when you need to.”
Returning to Exploration
While the definition has evolved, the platform and original uses of data lakes could not keep pace. Instead of implementing data management systems businesses were attempting to mold data lakes into organizational structures. The reality is that data in its raw form needs professional analysis before it becomes useful. Hiring a data scientist or buying data analysis software are the best options for enriching raw data.
Yet, it’s important to remember that data lakes are excellent tools for pure exploration of data. In order to truly get the best experience with a data lake, it’s imperative to understand and implement practices.
Per Enterprise Apps Today, the best practices for data lakes include:
- Becoming familiar with data lake use cases
- Establishing data management practices
- Researching the best architecture
- Being aware of metadata
While it all starts with the data lake and the raw data within its repository, it’s up to the specific organization or business to make that data lake work for them. Per Nick Heudecker with Gartner’s IT Leaders Data and Analytics group, “I think once you get beyond that discovery phase, you need to do more.”