• What we do
  • The People
  • About Us
  • Why Innovation Africa
  • Contact Us
Innovation AfricaCreating the Future Today
  • Feature Articles
  • Innovation
  • Agriculture
  • ICT
  • Technology
  • Entrepreneurship
  • Health
  • Store
  • Contact Us
Menu
  • Feature Articles
  • Innovation
  • Agriculture
  • ICT
  • Technology
  • Entrepreneurship
  • Health
  • Store
  • Contact Us
  • The problem of managing schemas

    November 9, 2014 Editor 0

    filing_cabinets_foam_Flickr

    When a team first starts to consider using Hadoop for data storage and processing, one of the first questions that comes up is: which file format should we use?

    This is a reasonable question. HDFS, Hadoop’s data storage, is different from relational databases in that it does not impose any data format or schema. You can write any type of file to HDFS, and it’s up to you to process it later.

    The usual first choice of file formats is either comma delimited text files, since these are easy to dump from many databases, or JSON format, often used for event data or data arriving from a REST API.

    There are many benefits to this approach — text files are readable by humans and therefore easy to debug and troubleshoot. In addition, it is very easy to generate them from existing data sources and all applications in the Hadoop ecosystem will be able to process them.

    But there are also significant drawbacks to this approach, and often these drawbacks only become apparent over time, when it can be challenging to modify the file formats across the entire system.

    Part of the problem is performance — text formats have to be parsed every time they are processed. Data is typically written once but processed many times; text formats add a significant overhead to every data query or analysis.

    But the worst problem by far is the fact that with CSV and JSON data, the data has a schema, but the schema isn’t stored with the data. For example, CSV files have columns, and those columns have meaning. They represent IDs, names, phone numbers, etc. Each of these columns also has a data type: they can represent integers, strings, or dates. There are also some constraints involved — you can dictate that some of those columns contain unique values or that others will never contain nulls. All this information exists in the head of the people managing the data, but it doesn’t exist in the data itself.

    The people who work with the data don’t just know about the schema; they need to use this knowledge when processing and analyzing the data. So the schema we never admitted to having is now coded in Python and Pig, Java and R, and every other application or script written to access the data.

    And eventually, the schema changes. Someone refactors the code generating the JSON and moves fields around, perhaps renaming few fields. The DBA added new columns to a MySQL table and this reflects in the CSVs dumped from the table. Now all those applications and scripts must be modified to handle both file formats. And since schema changes happen frequently, and often without warning, this results in both ugly and unmaintainable code, and in grumpy developers who are tired of having to modify their scripts again and again.

    There is a better way of doing things.

    Apache Avro is a data serialization project that provides schemas with rich data structures, compressible file formats, and simple integration with many programming languages. The integration even supports code generation — using the schema to automatically generate classes that can read and write Avro data.

    Schema changes happen frequently, and often without warning.Since the schema is stored in the file, programs don’t need to know about the schema in order to process the data. Humans who encounter the file can also easily extract the schema and better understand the data they have.

    When the schema inevitably changes, Avro uses schema evolution rules to make it easy to interact with files written using both older and newer versions of the schema — default values get substituted for missing fields, unexpected fields are ignored until they are needed, and data processing can proceed uninterrupted through upgrades. When starting a data analysis project, most developers don’t think about how they’ll be able to handle gradual application upgrades through a large organization. The ability to independently upgrade the applications that are writing the data and the applications reading the data makes development and deployment significantly easier.

    The problem of managing schemas across diverse teams in a large organization was mostly solved when a single relational database contained all the data and enforced the schema on all users. These days, data is not nearly as unified — it moves between many different data stores, structured, unstructured or semi-structured. Avro is a very versatile and convenient way of bringing order to chaos. Avro formatted data can be stored in files, in unstructured stores like HBase or Cassandra, and can be sent through messaging systems like Kafka. All the while, applications can use the same schemas to read the data, process it, and analyze it — regardless of where and how it is stored.

    Decisions made early in the project can come back to bite later. Hadoop offers a rich ecosystem of tools and solutions to choose from, making the decision process more challenging than it was back when data was always stored and processed in relational databases. File formats are no exception — there are probably 10 different file types that are supported through the Hadoop ecosystem. Some of the formats are easy to use by beginners, some offer special performance optimizations for specific use-cases. But for general-purpose data storage and processing, I always tell my customers: just use Avro.

    Gwen Shapira will talk more about architectural considerations for Hadoop applications at Strata + Hadoop World Barcelona. For more information and to register, visit the Strata + Hadoop World website.

    Cropped image on article and category pages by foam on Flickr, used under a Creative Commons license.

    This post is part of our on-going investigation into the evolving, maturing marketplace of big data components.

    Related:

    • Hadoop Application Architectures — early release book by authors Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira.


    Go to Source

    Related Posts

    • Big data’s big ideasBig data’s big ideas
    • 8 Steps to Publish Open Data and Prepare for IATI8 Steps to Publish Open Data and Prepare for IATI
    • Does Bigger Data Lead to Better Decisions?Does Bigger Data Lead to Better Decisions?
    • Can Your C-Suite Handle Big Data?Can Your C-Suite Handle Big Data?
    • Survey to help spur innovation in Islamic countriesSurvey to help spur innovation in Islamic countries
    • How Ugandan Youth Are Making Money with ICT for AgricultureHow Ugandan Youth Are Making Money with ICT for Agriculture
    Sovrn
    Share

    Categories: Technology

    Tags: Data Intensive Computing, Data management, Data modeling, Schema evolution

    The Internet-Connected Engine Will Change Trucking Artificial intelligence: summoning the demon

    Leave a Reply Cancel reply

    You must be logged in to post a comment.

Subscribe to our stories


 

Recent Posts

  • Entrepreneurial Alertness, Innovation Modes, And Business Models in Small- And Medium-Sized Enterprises December 30, 2021
  • The Strategic Role of Design in Driving Digital Innovation June 10, 2021
  • Correction to: Hybrid mosquitoes? Evidence from rural Tanzania on how local communities conceptualize and respond to modified mosquitoes as a tool for malaria control June 10, 2021
  • BRIEF FOCUS: Optimal spacing for groundnuts in smallholder farming systems June 9, 2021
  • COVID-19 pandemic: impacts on the achievements of Sustainable Development Goals in Africa June 9, 2021

Categories

Archives

Popular Post-All time

  • A review on biomass-based... 1k views
  • Apply Now: $500,000 for Y... 798 views
  • Can blockchain disrupt ge... 797 views
  • Test Your Value Propositi... 749 views
  • Prize-winning projects pr... 722 views

Recent Posts

  • Entrepreneurial Alertness, Innovation Modes, And Business Models in Small- And Medium-Sized Enterprises
  • The Strategic Role of Design in Driving Digital Innovation
  • Correction to: Hybrid mosquitoes? Evidence from rural Tanzania on how local communities conceptualize and respond to modified mosquitoes as a tool for malaria control
  • BRIEF FOCUS: Optimal spacing for groundnuts in smallholder farming systems
  • COVID-19 pandemic: impacts on the achievements of Sustainable Development Goals in Africa
  • Explicit knowledge networks and their relationship with productivity in SMEs
  • Intellectual property issues in artificial intelligence: specific reference to the service sector
  • Africa RISING publishes a livestock feed and forage production manual for Ethiopia
  • Transforming crop residues into a precious feed resource for small ruminants in northern Ghana
  • Photo report: West Africa project partners cap off 2020 with farmers field day events in Northern Ghana and Southern Mali

Tag Cloud

    africa African Agriculture Business Business model Business_Finance Company Crowdsourcing data Development East Africa economics Education Entrepreneur entrepreneurs Entrepreneurship ethiopia ghana Health_Medical_Pharma ict Information technology Innovation kenya knowledge Knowledge Management Leadership marketing mobile Mobile phone nigeria Open innovation Organization Research rwanda science Science and technology studies social enterprise social entrepreneurship south africa Strategic management strategy tanzania Technology Technology_Internet uganda

Categories

Archives

  • A review on biomass-based hydrogen production for renewable energy supply 1k views
  • Apply Now: $500,000 for Your Big Data Innovations in Agriculture 798 views
  • Can blockchain disrupt gender inequality? 797 views
  • Test Your Value Proposition: Supercharge Lean Startup and CustDev Principles 749 views
  • Prize-winning projects promote healthier eating, smarter crop investments 722 views

Copyright © 2005-2020 Innovation Africa Theme created by PWT. Powered by WordPress.org