Lecture Slides available: PDF PowerPoint

Database Analysis

Introduction, database analysis life cycle, three-level database model, relationships, degree of a relationship, replacing ternary relationships, cardinality, optionality, entity sets, confirming correctness, deriving the relationship parameters, redundant relationships, redundant relationships example, splitting n:m relationships, splitting n:m relationships - example, constructing an er model.

This unit it concerned with the process of taking a database specification from a customer and implementing the underlying database structure necessary to support that specification.

Data analysis is concerned with the NATURE and USE of data. It involves the identification of the data elements which are needed to support the data processing system of the organization, the placing of these elements into logical groups and the definition of the relationships between the resulting groups.

Other approaches, e.g. D.F.Ds and Flowcharts, have been concerned with the flow of data-dataflow methodologies. Data analysis is one of several data structure based methodologies ­ Jackson SP/D is another.

Systems analysts often, in practice, go directly from fact finding to implementation dependent data analysis. Their assumptions about the usage of properties of and relationships between data elements are embodied directly in record and file designs and computer procedure specifications. The introduction of Database Management Systems (DBMS) has encouraged a higher level of analysis, where the data elements are defined by a logical model or `schema' (conceptual schema). When discussing the schema in the context of a DBMS, the effects of alternative designs on the efficiency or ease of implementation is considered, i.e. the analysis is still somewhat implementation dependent. If we consider the data relationships, usages and properties that are important to the business without regard to their representation in a particular computerised system using particular software, we have what we are concerned with, implementation­independent data analysis.

It is fair to ask why data analysis should be done if it is possible, in practice to go straight to a computerised system design. Data analysis is time consuming; it throws up a lot of questions. Implementation may be slowed down while the answers are sought. It is more expedient to have an experienced analyst `get on with the job' and come up with a design straight away. The main difference is that data analysis is more likely to result in a design which meets both present and future requirements, being more easily adapted to changes in the business or in the computing equipment. It can also be argued that it tends to ensure that policy questions concerning the organisations' data are answered by the managers of the organisation, not by the systems analysts. Data analysis may be thought of as the `slow and careful' approach, whereas omitting this step is `quick and dirty'.

From another viewpoint, data analysis provides useful insights for general design principals which will benefit the trainee analyst even if he finally settles for a `quick and dirty' solution.

The development of techniques of data analysis have helped to understand the structure and meaning of data in organisations. Data analysis techniques can be used as the first step of extrapolating the complexities of the real world into a model that can be held on a computer and be accessed by many users. The data can be gathered by conventional methods such as interviewing people in the organisation and studying documents. The facts can be represented as objects of interest. There are a number of documentation tools available for data analysis, such as entity­relationship diagrams. These are useful aids to communication, help to ensure that the work is carried out in a thorough manner, and ease the mapping processes that follow data analysis. Some of the documents can be used as source documents for the data dictionary.

In data analysis we analyse the data and build a systems representation in the form of a data model (conceptual). A conceptual data model specifies the structure of the data and the processes which use that data.

Data Analysis = establishing the nature of data.

Functional Analysis = establishing the use of data.

However, since Data and Functional Analysis are so intermixed, we shall use the term Data Analysis to cover both.

Building a model of an organisation is not easy. The whole organisation is too large as there will be too many things to be modelled. It takes too long and does not achieve anything concrete like an information system, and managers want tangible results fairly quickly. It is therefore the task of the data analyst to model a particular view of the organisation, one which proves reasonable and accurate for most applications and uses. Data has an intrinsic structure of its own, independent of processing, reports formats etc. The data model seeks to make explicit that structure

Data analysis was described as establishing the nature and use of data.

When a database designer is approaching the problem of constructing a database system, the logical steps followed is that of the database analysis life cycle:

Often referred to as the three-level model, this is where the design moves from a written specification taken from the real-world requirements to a physically-implementable design for a specific DBMS. The three levels commonly referred to are `Conceptual Design', `Data Model Mapping', and `Physical Design'.

The specification is usually in the form of a written document containing customer requirements, mock reports, screen drawings and the like, written by the client to indicate the requirements which the final system is to have. Often such data has to be collected together from a variety of internal sources to the company and then analysed to see if the requirements are necessary, correct, and efficient.

Once the Database requirements have been collated, the Conceptual Design phase takes the requirements and produces a high-level data model of the database structure. In this module, we use ER modelling to represent high-level data models, but there are other techniques. This model is independent of the final DBMS which the database will be installed in.

Next, the Conceptual Design phase takes the high-level data model it taken and converted into a conceptual schema, which is specific to a particular DBMS class (e.g. relational). For a relational system, such as Oracle, an appropriate conceptual schema would be relations.

Finally, in the Physical Design phase the conceptual schema is converted into database internal structures. This is specific to a particular DBMS product.

Entity Relationship (ER) modelling

When ternary relationships occurs in an ER model they should always be removed before finishing the model. Sometimes the relationships can be replaced by a series of binary relationships that link pairs of the original ternary relationship.

Relationships are usually verbs, so name the new entity type by the relationship verb rewritten as a noun.

A relationship can be optional or mandatory.

Sometimes it is useful to try out various examples of entities from an ER model. One reason for this is to confirm the correct cardinality and optionality of a relationship. We use an `entity set diagram' to show entity examples graphically. Consider the example of `course is_studied_by student'.

To check we have the correct parameters (sometimes also known as the degree) of a relationship, ask two questions:

Some ER diagrams end up with a relationship loop.

A many to many relationship in an ER model is not necessarily incorrect. They can be replaced using an intermediate entity. This should only be done where:

Consider the case of a car hire company. Customers hire cars, one customer hires many card and a car is hired by many customers.

The many to many relationship can be broken down to reveal a `hire' entity, which contains an attribute `date of hire'.

Before beginning to draw the ER model, read the requirements specification carefully. Document any assumptions you need to make.

ER modelling is an iterative process, so draw several versions, refining each one until you are happy with it. Note that there is no one right answer to the problem, but some solutions are better than others!

in-database analytics

TechTarget Contributor

In-database analytics is a technology that allows data processing to be conducted within the database by building analytic logic into the database itself. Doing so eliminates the time and effort required to transform data and move it back and forth between a database and a separate analytics application.

An in-database analytics system consists of an enterprise data warehouse ( EDW ) built on an analytic database platform. Such platforms provide parallel processing , partitioning , scalability and optimization features geared toward analytic functionality. 

In-database analytics allows analytical data marts to be consolidated in the enterprise data warehouse. Data retrieval and analysis are much faster and corporate information is more secure because it doesn’t leave the EDW. This approach is useful for helping companies make better predictions about future business risks and opportunities, identify trends, and spot anomalies to make informed decisions more efficiently and affordably.

Companies use in-database analytics for applications requiring intensive processing – for example, fraud detection, credit scoring, risk management , trend and pattern recognition, and balanced scorecard analysis. In-database analytics also facilitates ad hoc analysis , allowing business users to create reports that do not already exist or drill deeper into a static report to get details about accounts, transactions, or records.

See also: predictive analytics , association rules , data mining , business analytics , MapReduce

Related Terms

Dig deeper on business intelligence technology.

analysis of database

How graph technology is making a dent in the database market

AaronTan

New SAS, SingleStore integration boosts speed, efficiency

EricAvidon

multidimensional database (MDB)

AlexanderGillis

NoSQL database types explained: Column-oriented databases

AlexWilliams

As data governance gets increasingly complicated, data stewards are stepping in to manage security and quality. Without one, ...

Data mesh brings a variety of benefits to data management, but it also presents challenges if organizations don't have the right ...

As organizational data grows more complex, discovery processes help organizations identify patterns to solve potential issues and...

Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...

There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

AWS users face a choice when deploying Kubernetes: run it themselves on EC2 or let Amazon do the heavy lifting with EKS. See ...

DAM systems offer a central repository for rich media assets and enhance collaboration within marketing teams. However, users may...

SharePoint Syntex is Microsoft's foray into the increasingly popular market of content AI services. This introduction explores ...

What is media asset management, and what can it do for your organization? It's like digital asset management, but it aims for ...

With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...

Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...

The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

When its ERP system became outdated, Pandora chose S/4HANA Cloud for its business process transformation. The new system is ...

Florida Crystals' consolidation of its SAP landscape to a managed services SaaS deployment on AWS has enabled the company to ...

SAP Signavio Process Explorer is a next step in the evolution of process mining, delivering recommendations on transformation ...

Trending now

What is ordinal data definition, examples, variables and analysis, working toward a lucrative career in business analytics, data science career guide: a comprehensive playbook to becoming a data scientist, webinar wrap-up: beyond the textbook-decision analysis principles for data scientists, why every data scientist needs to specialize, data analyst vs. data scientist, top 25 excel formulas you should know, 50 excel shortcuts that you should know in 2023, get certified in data science with a caltech ctme bootcamp, a one-stop solution to calculate percentage in excel, what is data analysis methods, process and types explained.

What is Data Analysis? Methods, Process and Types Explained

Table of Contents

Businesses today need every edge and advantage they can get. Thanks to obstacles like rapidly changing markets, economic uncertainty, shifting political landscapes, finicky consumer attitudes, and even global pandemics , businesses today are working with slimmer margins for error.

Companies that want to stay in business and thrive can improve their odds of success by making smart choices while answering the question: “What is data analysis?” And how does an individual or organization make these choices? They collect as much useful, actionable information as possible and then use it to make better-informed decisions!

This strategy is common sense, and it applies to personal life as well as business. No one makes important decisions without first finding out what’s at stake, the pros and cons, and the possible outcomes. Similarly, no company that wants to succeed should make decisions based on bad data. Organizations need information; they need data. This is where data analysis or data analytics enters the picture.

The job of understanding data is currently one of the growing industries in today's day and age, where data is considered as the 'new oil' in the market. Our Data Analytics Program can help you learn how to make sense of data and get trends from them.

Now, before getting into the details about the data analysis methods, let us first answer the question, what is data analysis? 

Become an Expert in Data Analytics!

Become an Expert in Data Analytics!

What Is Data Analysis?

Although many groups, organizations, and experts have different ways of approaching data analysis, most of them can be distilled into a one-size-fits-all definition. Data analysis is the process of cleaning, changing, and processing raw data and extracting actionable, relevant information that helps businesses make informed decisions. The procedure helps reduce the risks inherent in decision-making by providing useful insights and statistics, often presented in charts, images, tables, and graphs.

A simple example of data analysis can be seen whenever we make a decision in our daily lives by evaluating what has happened in the past or what will happen if we make that decision. Basically, this is the process of analyzing the past or future and making a decision based on that analysis.

It’s not uncommon to hear the term “ big data ” brought up in discussions about data analysis. Data analysis plays a crucial role in processing big data into useful information. Neophyte data analysts who want to dig deeper by revisiting big data fundamentals should go back to the basic question, “ What is data ?”

Why is Data Analysis Important?

Here is a list of reasons why data analysis is crucial to doing business today.

What Is the Data Analysis Process?

Answering the question “what is data analysis” is only the first step. Now we will look at how it’s performed. The process of data analysis, or alternately, data analysis steps, involves gathering all the information, processing it, exploring the data, and using it to find patterns and other insights. The process of data analysis consists of:

Want to Become a Data Analyst? Learn From Experts!

Want to Become a Data Analyst? Learn From Experts!

What Is the Importance of Data Analysis in Research?

A huge part of a researcher’s job is to sift through data. That is literally the definition of “research.” However, today’s Information Age routinely produces a tidal wave of data, enough to overwhelm even the most dedicated researcher.

Data analysis, therefore, plays a key role in distilling this information into a more accurate and relevant form, making it easier for researchers to do to their job.

Data analysis also provides researchers with a vast selection of different tools, such as descriptive statistics, inferential analysis, and quantitative analysis.

So, to sum it up, data analysis offers researchers better data and better ways to analyze and study said data.

What is Data Analysis: Types of Data Analysis

A half-dozen popular types of data analysis are available today, commonly employed in the worlds of technology and business. They are: 

Next, we will get into the depths to understand about the data analysis methods.

Your Dream Career is Just Around The Corner!

Your Dream Career is Just Around The Corner!

Data Analysis Methods

Some professionals use the terms “data analysis methods” and “data analysis techniques” interchangeably. To further complicate matters, sometimes people throw in the previously discussed “data analysis types” into the fray as well! Our hope here is to establish a distinction between what kinds of data analysis exist, and the various ways it’s used.

Although there are many data analysis methods available, they all fall into one of two primary types: qualitative analysis and quantitative analysis .

We can further expand our discussion of data analysis by showing various techniques, broken down by different concepts and tools. 

Top 7 Data Analysis Tools

So, here's a list of the top seven data analysis tools in terms of popularity, learning, and performance.

Tableau Public

It is a free data visualization application that links to any data source you can think of whether it's a corporate Data Warehouse, Microsoft Excel, or web-based information. It also generates data visualizations, maps, dashboards, and so on, all with real-time changes that are shown on the web. These may also be shared on social media or with your customer, and you can download the files in several formats. 

However, it truly shines when you have an excellent data source. That's when you realize Tableau's ultimate potential. Tableau's Big Data features make it indispensable. Its approach to data analysis and visualization is considerably better than that of any other data visualization software on the market.

R Programming

Well, R is the industry's premier analytics tool, and it's extensively used for statistics and data modeling. It can readily alter data and show it in a variety of formats. It has outperformed SAS in several aspects, including data capacity, performance, and results. 

R may be compiled and run on a broad range of systems, including Windows, UNIX, and macOS. It offers 11,556 packages and lets you explore them by category. Also, R has tools for installing all packages automatically based on user needs, which may be used with Big Data.

It's a scripting language that is simple to understand, write, as well as maintain. Furthermore, it's a free open-source tool. Guido van Rossum developed it in the late 1980s and it supports both structured and functional programming methodologies. Python is simple to learn since it is related to Ruby, JavaScript, and PHP. 

Python also contains excellent machine learning packages such as Tensorflow, Theano, Scikitlearn, and Keras. Another useful characteristic of Python is that it can be built on any platform, such as a MongoDB database, SQL browser, or JSON. It also excels at handling text data.

Apache Spark

Apache was created in 2009 by the AMP Lab at the University of California, Berkeley. Apache Spark is a large-scale data processing engine that performs applications hundred times quicker when it comes to memory and 10 times faster on disk in Hadoop clusters. 

It is based on data science, and its design makes data science simple. Spark is also popular for developing data pipelines and machine learning models. Spark also contains the MLlib package, which provides a progressive collection of machine algorithms for recurring data science procedures like classification, collaborative filtering, regression, clustering, and so on.

SAS is basically a data manipulation programming ecosystem and language that is a market leader in analytics. The SAS Institute created it in 1966, and it was expanded upon in the 1980s as well as the 1990s. It is simple to use and administer, and it can analyze data from any source. 

In 2011, SAS released a significant collection of solutions for customer intelligence, as well as numerous SAS modules for social media, online, and marketing analytics. These are now often used to profile clients and prospects. It can also forecast their actions and manage and improve communications.

Excel is a popular, basic, and frequently leveraged analytical tool in practically all industries. Whether you are a Sas, R, or Tableau specialist, you will still need to utilize Excel. When analytics on the client's internal data is required, Excel comes in handy. 

It analyzes the hard work of summarizing the data with a preview of pivot tables, which aids in filtering the data according to the client's needs. Excel includes a sophisticated business analytics feature that aids in modeling skills. It has prebuilt tools such as automated relationship recognition, DAX measure generation, and time grouping.

It is an extremely capable comprehensive data analysis tool. It's created by the same house that does predictive analysis as well as other advanced analytics such as machine learning, text analysis, visual analytics, and data mining without the use of programming. 

RapidMiner supports all data source types, including Microsoft SQL, Excel, Access, Oracle, Teradata, Dbase, IBM SPSS, MySQL, Ingres, IBM DB2, Sybase, and others. This tool is quite powerful, as it can provide analytics based on real-world data transformation settings, allowing you to customize the data sets and formats for predictive analysis.

Our Data Analyst Master's Program will help you learn analytics tools and techniques to become a Data Analyst expert! It's the pefect course for you to jumpstart your career. Enroll now!

Artificial Intelligence and Machine Learning

AI is on the rise and has proven a valuable tool in the world of data analysis. Related analysis techniques include:

Mathematics and Statistics

This is the technique where you find number-crunching data analytics. The techniques include:

Learn The Latest Trends in Data Analytics!

Learn The Latest Trends in Data Analytics!

Graphs and Visualization

We are visually oriented creatures. Images and displays attract our attention and stay in our memory longer. The techniques include:

Have a look at the video below that will give you a brief understanding of who is a data analyst, the various responsibilities of a data analyst, and the skills required to become a data analyst.

How to Become a Data Analyst

NOw that we have answered the question “what is data analysis”, if you want to pursue a career in data analytics , you should start by first researching what it takes to become a data analyst . You should follow this up by taking selected data analytics courses, such as the Data Analyst Master’s certification training course offered by Simplilearn.

This seven-course Data Analyst Master’s Program is run in collaboration with IBM and will make you an expert in data analysis. You will learn about data analysis tools and techniques, working with SQL databases, the R and Python languages, creating data visualizations, and how to apply statistics and predictive analytics in a commercial environment. 

You can even check out the PG Program in Data Analytics in partnership with Purdue University and in collaboration with IBM. This program provides a hands-on approach with case studies and industry-aligned projects to bring the relevant concepts live. You will get broad exposure to key technologies and skills currently used in data analytics.

According to Forbes, the data analytics profession is exploding . The United States Bureau of Labor Statistics forecasts impressively robust growth for data science jobs skills and predicts that the data science field will grow about 28 percent through 2026. Amstat.org backs up these predictions, reporting that, by the end of 2021, almost 70 percent of business leaders surveyed will look for prospective job candidates that have data skills.

Payscale reports that Data Analysts can earn a yearly average of USD 62,559. Payscale also shows Data Analysts in India making an annual average of ₹456,667.

So, if you want a career that pays handsomely and will always be in demand, then check out Simplilearn and get started on your new, brighter future!

Build your career in Data Analytics with our Data Analyst Master's Program ! Cover core topics and important concepts to help you get started the right way!

1. What is the role of data analytics?

Data Analytics is the process of collecting, cleaning, sorting, and processing raw data to extract relevant and valuable information to help businesses. An in-depth understanding of data can improve customer experience, retention, targeting, reducing operational costs, and problem-solving methods.

2. What are the types of data analytics?

Diagnostic Analysis, Predictive Analysis, Prescriptive Analysis, Text Analysis, and Statistical Analysis are the most commonly used data analytics types. Statistical analysis can be further broken down into Descriptive Analytics and Inferential Analysis.

3. What are the analytical tools used in data analytics?

The top 10 data analytical tools are Sequentum Enterprise, Datapine, Looker, KNIME, Lexalytics, SAS Forecasting, RapidMiner, OpenRefine, Talend, and NodeXL. The tools aid different data analysis processes, from data gathering to data sorting and analysis. 

4. What is the career growth in data analytics?

Starting off as a Data Analysis, you can quickly move into Senior Analyst, then Analytics Manager, Director of Analytics, or even Chief Data Officer (CDO).

5. Why Is Data Analytics Important?

Data Analysis is essential as it helps businesses understand their customers better, improves sales, improves customer targeting, reduces costs, and allows for the creation of better problem-solving strategies. 

6. Who Is Using Data Analytics?

Data Analytics has now been adopted almost across every industry. Regardless of company size or industry popularity, data analytics plays a huge part in helping businesses understand their customer’s needs and then use it to better tweak their products or services. Data Analytics is prominently used across industries such as Healthcare, Travel, Hospitality, and even FMCG products.

Find our Professional Certificate Program in Data Analytics Online Bootcamp in top cities:

About the author.

Karin Kelley

Karin has spent more than a decade writing about emerging enterprise and cloud technologies. A passionate and lifelong researcher, learner, and writer, Karin is also a big fan of the outdoors, music, literature, and environmental and social sustainability.

Recommended Programs

Professional Certificate Program in Data Analytics

Data Analyst

Introduction to Data Analytics Course

*Lifetime access to high-quality, self-paced e-learning content.

Find Professional Certificate Program in Data Analytics in these cities

Data Analysis in Excel: The Best Guide

Data Analysis in Excel: The Best Guide

Recommended resources.

The Rise of the Data-Driven Professional: 6 Non-Data Roles That Need Data Analytics Skills

The Rise of the Data-Driven Professional: 6 Non-Data Roles That Need Data Analytics Skills

Why Python Is Essential for Data Analysis and Data Science?

Why Python Is Essential for Data Analysis and Data Science?

The Best Spotify Data Analysis Project You Need to Know

The Best Spotify Data Analysis Project You Need to Know

Big Data Career Guide: A Comprehensive Playbook to Becoming a Big Data Engineer

Big Data Career Guide: A Comprehensive Playbook to Becoming a Big Data Engineer

Exploratory Data Analysis [EDA]: Techniques, Best Practices and Popular Applications

Exploratory Data Analysis [EDA]: Techniques, Best Practices and Popular Applications

All the Ins and Outs of Exploratory Data Analysis

All the Ins and Outs of Exploratory Data Analysis

Database analysis and Big Data

X26_1_data_analysis1.jpg

Increasingly, the phrase Market Intelligence is being used to describe the use of data science to link and analyse databases of information held within an organisation. Although, we see market intelligence more broadly than, database analysis is now a central business function for market insight.

The aim of data science and database analysis is to build predictive statistical models with the aim of increasing a customers interest in purchasing, the amount they spend or to influence their purchase behaviour.

Data is typically found in transaction or sales databases, contact and customer service databases, loyalty programs, vast web or internet app-based databases of online behaviour and purchasing, and can be combined with external data.

For an analyst, the basic procedures for analysing database information, whether a simple contact database or Big Data are:

It is not uncommon for there to be many separate databases in an organisation, each holding different information. Newer companies are more likely to have unified database systems, but it is more common to have operational databases that are lightly linked (eg customer ID) that then need pulling together and unifying for analysis.

For on-going database analysis, automating as many of these tasks as possible becomes vital with a large dataset, both to ensure that the data is of the same quality for each run of analysis and to save time and effort repeating the same work with each data snapshot. While smaller data snapshots can be handled by hand, anything over a few tens of thousands of records needs to be properly automated and documented.

Our data science analysts start by building scripts to extract, pool and link the data sources before then taking the data to statistical analysis, modeling or into machine learning/AI.

Extracting information

Many internal databases grow and develop through use and contingency, and consequently identifying and extracting the data can be complicated. For long-standing or legacy systems, particularly where an operational database has evolved over time, databases and tables can be poorly documented, with data that is missing or has been moved, or where table schema or data fields have changed in definition over time.

The data from live systems needs to be pulled for analysis, and fields matched and checked for content quality and table relationships confirmed and validated.

For external data, such as social media feeds, data may be brought in from data brokers, or obtained directly by scraping (subject to privacy rules). These data feeds also need to be cleaned and matched and may bring additional complications such as de-translating.

Once data has been obtained, it has to be cleaned. Many databases tend to build up inaccuracies and duplications over time. For instance as addresses change, postcodes are entered incorrectly, or there may be duplication of records sometimes caused by mistaken data entry, but more often than not, because customers have changed and duplicate records have been created (in a typical business-to-business database 20-25% of the data will be out of date after a year just because of people changing jobs). Similarly text feeds need a level of processing to standardise the data and to screen for potential problems.

Within an internal database, or when merging datasets, deduping is an important, but sometimes challenging task. Automated systems exist, but some level of 'eyeballing' has to be done to check the quality of the dedupe.

Next data may need to be recoded, and missing or erroneous values identified. When looking at aspects such as purchase histories, it is often the case that the data has to be grouped up and reclassified. For instance each product on a database will have a separate product code, but for analysis several individual products may need to be grouped together.

The process of cleaning eventually leads to automated scripts including de-duplication and cleaning up missing or bad data, but often there is an element of verification that needs doing by hand - often by examining smaller samples of data.

Once the individual data sources have been cleaned, they can be merged with other data sources. Merging again is not entirely straightforward as some allowance may be necessary for the same customer having a different name on different databases. For instance Acme Building Contractors might also be known as ABC. Consequently, there may also be a second period of cleaning necessary once the data has been merged.

A common merge for consumer sets is to add geographical-based classification data from external agencies such as the ACORN or MOSAIC or to link in external data from consumer data companies such as Experian. These provide an additional layer of classification or segmentation data on top of the existing data that can add fine detail for modeling.

There are many different types of analysis that can be carried out on the data from the database. The first part of any analysis is usually an exploratory stage just to see what's there. A very common simple approach is called Pareto Analysis which involves ranking customers by value and then breaking them into quintiles or deciles to see who the most valuable customers are and what their purchasing characteristics are. In text analysis it might be a simple word frequency count prior to any attempt at sentiment or concept analysis.

Standard transactional database measures are recency, frequency and value. So who bought in the last month, 3 months, 6 months? Who has bought once a year, twice a year etc? How much did they spend? What was the difference between those spending a lot and those spending in the next category down (and so can we get uplift).

Increasingly businesses look to track customers and then look at customer journeys - particularly for web-based analytics - what transactions happened when and how did a customer move from one transaction to the next; what did the customer journey look like?

The core aim for many types of analysis is to build a 'propensity model'. That is a model to identify customers who are most likely act in a certain way. For instance, those people who are most likely to buy,  or those people who would be most likely to respond to a particular communication, or those who are likely to leave or stop buying.

Various types of statistical tools and analysis can be used to build propensity models. From classifying, grouping and labelling customers, to various forms of regression. Much large scale database analysis is done via machine learning using automated statistical investigation or artificial intelligence using deep neural networks. Data is typically analysed, and then validated against hold out samples to reduce the likelihood of overfitting.

The classification and grouping means database data can be used for segmentation . A major difference between database segmentation and market research segmentation is that the results can be marked back onto the database - each customer is labelled with their segment. This means that if you need to contact or track a particular segment from the database this is entirely possible, whereas for market research you are typically taking a second level guess.

Implement, and learn

Once the analysis is creating marketing insights, the next stage is implementation - that is to use the data to affect customer behaviour. For instance, to apply a segmentation with tailored communication, specifically targeted offers and a system of response measurement and management.

Implementation means tracking how well the analysis performs compared to the modeling, and so reflects back onto the databases.

This need for multi-faceted implementation leads to the development of algorithmic and experimental marketing and the importance of bringing the analysis back to websites.

Blending Big Data and research

A recurring view of Big Data is the idea that all the information you need is sitting in the databases and just needs to proper analysis and the business will be able to predict exactly what the customer wants and will do. Unfortunately, that is far from the truth.

Big Data analysis can find relationships and correlations in the data and therefore help improve and optimise products and services, but the main problem with database data is that it is backwards looking - that is it tells you what customers have done. If a new competitor enters the market, or you launch a new product, there is no data about what will happen next. There is also an 'analytical delay' - that is analysis, finding useful insights, takes time. By the time the analysis is finished the market may have moved on to new things.

For this reason, research and experimentation are also still required. Big Data can be combined with small-scale live experimentation to test how people react to changes, offers and communications, or blended with research to understand the why of behaviour, for instance tracking e-commerce journeys and then following up with research into purchase motivations and objectives.

For help and advice on the effective use of database analysis and Big Data contact [email protected]

Go to Notanant menu

Access level: public

Twitter

Newsletters

What is Data Analysis and Data Mining?

The exponentially increasing amounts of data being generated each year make getting useful information from that data more and more critical. The information frequently is stored in a data warehouse, a repository of data gathered from various sources, including corporate databases, summarized information from internal systems, and data from external sources. Analysis of the data includes simple query and reporting, statistical analysis, more complex multidimensional analysis, and data mining.

Data analysis and data mining are a subset of business intelligence (BI), which also incorporates data warehousing, database management systems, and Online Analytical Processing (OLAP). 

The technologies are frequently used in customer relationship management (CRM) to analyze patterns and query customer databases. Large quantities of data are searched and analyzed to discover useful patterns or relationships, which are then used to predict future behavior.

Some estimates indicate that the amount of new information doubles every three years. To deal with the mountains of data, the information is stored in a repository of data gathered from various sources, including corporate databases, summarized information from internal systems, and data from external sources. Properly designed and implemented, and regularly updated, these repositories, called data warehouses, allow managers at all levels to extract and examine information about their company, such as its products, operations, and customers' buying habits.

With a central repository to keep the massive amounts of data, organizations need tools that can help them extract the most useful information from the data. A data warehouse can bring together data in a single format, supplemented by metadata through use of a set of input mechanisms known as extraction, transformation, and loading (ETL) tools. These and other BI tools enable organizations to quickly make knowledgeable business decisions based on good information analysis from the data.

Analysis of the data includes simple query and reporting functions, statistical analysis, more complex multidimensional analysis, and data mining (also known as knowledge discovery in databases, or KDD). Online analytical processing (OLAP) is most often associated with multidimensional analysis, which requires powerful data manipulation and computational capabilities.

With the increasing data being produced each year, BI has become a hot topic. The increasing focus on BI has caused a number of large organizations have begun to increase their presence in the space, leading to a consolidation around some of the largest software vendors in the world. Among the notable purchases in the BI market were Oracle's purchase of Hyperion Solutions; Open Text's acquisition of Hummingbird; IBM's buy of Cognos; and SAP's acquisition of Business Objects.

The purpose of gathering corporate information together in a single structure, typically an organization's data warehouse, is to facilitate analysis so that information that has been collected from a variety of different business activities may be used to enhance the understanding of underlying trends in their business. Analysis of the data can include simple query and reporting functions, statistical analysis, more complex multidimensional analysis, and data mining. OLAP, one of the fastest growing areas, is most often associated with multidimensional analysis. According to The BI Verdict (formerly The OLAP Report), the definition of the characteristics of an OLAP application is "fast analysis of shared multidimensional information.

Data warehouses are usually separate from production systems, as the production data is added to the data warehouse at intervals that vary, according to business needs and system constraints. Raw production data must be cleaned and qualified, so it often differs from the operational data from which it was extracted. The cleaning process may actually change field names and data characters in the data record to make the revised record compatible with the warehouse data rule set. This is the province of ETL.

A data warehouse also contains metadata (structure and sources of the raw data, essentially, data about data), the data model, rules for data aggregation, replication, distribution and exception handling, and any other information necessary to map the data warehouse, its inputs, and its outputs. As the complexity of data analysis grows, so does the amount of data being stored and analyzed; ever more powerful and faster analysis tools and hardware platforms are required to maintain the data warehouse.

A successful data warehousing strategy requires a powerful, fast, and easy way to develop useful information from raw data. Data analysis and data mining tools use quantitative analysis, cluster analysis, pattern recognition, correlation discovery, and associations to analyze data with little or no IT intervention. The resulting information is then presented to the user in an understandable form, processes collectively known as BI. Managers can choose between several types of analysis tools, including queries and reports, managed query environments, and OLAP and its variants (ROLAP, MOLAP, and HOLAP). These are supported by data mining, which develops patterns that may be used for later analysis, and completes the BI process.

Business Intelligence Components

The ultimate goal of Data Warehousing is BI production, and analytic tools represent only part of this process. Three basic components are used together to prepare a data warehouse for use and to develop information from it, including:

Analytic tools continue to grow within this framework, with the overall goal of improving BI, improving decision analysis, and, more recently, promoting linkages with business process management (BPM), also known as workflow.

Data Mining

Data mining can be defined as the process of extracting data, analyzing it from many dimensions or perspectives, then producing a summary of the information in a useful form that identifies relationships within the data. There are two types of data mining: descriptive, which gives information about existing data; and predictive, which makes forecasts based on the data.

Basic Requirements

A corporate data warehouse or departmental data mart is useless if that data cannot be put to work. One of the primary goals of all analytic tools is to develop processes that can be used by ordinary individuals in their jobs, rather than requiring advanced statistical knowledge. At the same time, the data warehouse and information gained from data mining and data analysis needs to be compatible across a wide variety of systems. For this reason, products within this arena are evolving toward ease of use and interoperability, though these have become major challenges.

For all analytic tools, it is important to keep business goals in mind, both in selecting and deploying tools and in using them. In putting these tools to use, it is helpful to look at where they fit into the decision-making processes. The five steps in decision-making can be identified as follows:

Standard reports are the results of normal database queries that tell how the business is performing and provide details of key business factors. When exceptions occur, the details of the situation must be easily obtainable. This can be done by data mining, or by developing hypotheses and testing them using analytic tools such as OLAP. The conclusions can then be tested using "what-if" scenarios with simple tools such as spreadsheet applications. When a decision is made, and action is taken, the results must then be traced so that the decision-making process can be improved.

Although sophisticated data analysis may require the help of specialized data analysts and IT staff, the true value of these tools lies in the fact that they are coming closer to the user. The "dashboard" is becoming the leading user interface, with products such as Informatica's PowerCenter, Oracle's Hyperion Essbase, SAS Enterprise Miner and Arcplan Enterprise server tools designed to provide easily customizable personal dashboards.

One of the recurring challenges for data analysis managers is to disabuse executives and senior managers of the notion that data analysis and data mining are business panaceas. Even when the technology might promise valuable information, the cost and the time required to implement it might be prohibitive.

The 12 Rules

In 1993, E.F. Codd, S.B. Codd, and C.T. Salley presented a paper entitled "Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate" that offered 12 rules for evaluating analytical processing tools. These rules are essentially a list of "must haves" in data analysis, focusing on usability, and they continue to be relevant in evaluating analytic tools:

Since analytic tools are designed to be used by, or at the very least, their output understood by, ordinary employees, these rules are likely to remain valid for some time to come.

Current View

The analytic sector of BI can be broken down into two general areas: query and analysis and data mining. It is important to bear in mind the distinction, although these areas are often confused. Data analysis looks at existing data and applies statistical methods and visualization to test hypotheses about the data and discover exceptions. Data mining seeks trends within the data, which may be used for later analysis. It is, therefore, capable of providing new insights into the data, which are independent of preconceptions.

Data Analysis

Data analysis is concerned with a variety of different tools and methods that have been developed to query existing data, discover exceptions, and verify hypotheses. These include:

Queries and Reports. A query is simply a question put to a database management system, which then generates a subset of data in response. Queries can be basic (e.g., show me Q3 sales in Western Europe) or extremely complex, encompassing information from a number of data sources, or even a number of databases stored within dissimilar programs (e.g., a product catalog stored in an Oracle database, and the product sales stored under Sybase). A well-written query can exact a precise piece of information; a sloppy one may produce huge quantities of worthless or even misleading data.

Queries are often written in structured query language (SQL), a product-independent command set developed to allow cross-platform access to relational databases. Queries may be saved and reused to generate reports, such as monthly sales summaries, through automatic processes, or simply to assist users in finding what they need. Some products build dictionaries of queries that allow users to bypass knowledge of both database structure and SQL by presenting a drag-and-drop query-building interface. Query results may be aggregated, sorted, or summarized in many ways. For example, SAP's Business Objects unit offers a number of built-in business formulas for queries.

The presentation of the data retrieved by the query is the task of the report. Presentations may encompass tabular or spreadsheet-formatted information, graphics, cross tabulations, or any combination of these forms. A rudimentary reporting of products might simply show the results in a comprehensible fashion; more elegant output is usually advanced enough to be suitable for inclusion in a glossy annual report. Some products can run queries on a scheduled basis and configure those queries to distribute the resulting reports to designated users through email. Reporting products routinely produce HTML output and are often accessible through a user's Web browser.

Managed Query Environments. The term managed query environment has been adopted by the industry to describe a query and reporting package that allows IT control over users' access to data and application facilities in accordance with each user's level of expertise and business needs. For example, in some organizations, IT may build a set of queries and report structures and require that employees use only the IT-created structures; in other organizations, and perhaps within other areas of the same organization, employees are permitted to define their own queries and create custom reports.

A managed report environment (MRE) is a type of managed query environment. It is a report design, generation, and processing environment that permits the centralized control of reporting. To users, an MRE provides an intelligent report viewer that may contain hyperlinks between relevant parts of a document or allow embedded OLE objects such as Excel spreadsheets within the report. MREs have familiar desktop interfaces; for example, SAP's Business Objects tabbed interface allows employees to handle multiple reports in the same way they would handle multiple spreadsheets in an Excel workbook.

Some MREs, such as Information Builders' FOCUS Report Server, can handle the scheduling and distribution of reports, as well as their processing. For example, SAP Business Object's Crystal Reports can develop reports about previously created reports.

Online Analytical Processing (OLAP). The most popular technology in data analysis is OLAP. OLAP servers organize data into multidimensional hierarchies, called cubes, for high-speed data analysis. Data mining algorithms scan databases to uncover relationships or patterns. OLAP and data mining are complementary, with OLAP providing top-down data analysis and data mining offering bottom-up discovery.

OLAP tools allow users to drill down through multiple dimensions to isolate specific data items. For example, a hypercube (the multidimensional data structure) may contain sales information categorized by product, region, salesperson, retail outlet, and time period, in both units and dollars. Using an OLAP tool, a user need only click on a dimension to see a breakdown of dollar sales by region; an analysis of units by product, salesperson, and region; or to examine a particular salesperson's performance over time.

Information can be presented in tabular or graphical format and manipulated extensively. Since the information is derived from summarized data, it is not as flexible as information obtained from an ad hoc query; most tools offer a way to drill down to the underlying raw data. For example, PowerPlay provides the automatic launch of its sister product, Impromptu, to query the database for the records in question.

Although each OLAP product handles data structures and manipulation in its own way, an OLAP API, developed by a group of vendors who form the OLAP Council, standardizes many important functions and allows IT to offer the appropriate tool to each of its user groups. The MD-API specifies how an OLAP server and client connect, and it defines metadata, data fetch functions, and methods for handling status messages. It also standardizes filter, sort, and cube functions; compliant clients are able to communicate with any vendor's compliant server.

OLAP Variants: MOLAP, ROLAP, and HOLAP. OLAP is divided into multidimensional OLAP (MOLAP), relational OLAP (ROLAP), and hybrid OLAP (HOLAP).

ROLAP can be applied both as a powerful DSS product, as well as to aggregate and pre-stage multi-dimensional data for MOLAP environments. ROLAP products optimize data for multi-dimensional analysis using standard relational structures. The advantage of the MOLAP paradigm is that it can natively incorporate algebraic expressions to handle complex, matrix-based analysis. ROLAP, on the other hand, excels at manipulating large data sets and data acquisition, but is limited to SQL-based functions. Since all organizations will require both complex analysis and analysis of large data sets, it could be necessary to develop an architecture and set of user guidelines that will enable implementation of both ROLAP and MOLAP where each is appropriate.

HOLAP is the newest step in the ongoing evolution of OLAP. HOLAP combines the benefits of both ROLAP and MOLAP by storing only the most often used data in multidimensional cube format and processing the rest of the relational data in the standard on-the-fly method. This provides good performance in browsing aggregate data, but slower performance in "drilling down" to further detail.

Databases are growing in size to a stage where traditional techniques for analysis and visualization of the data are breaking down. Data mining and KDD are concerned with extracting models and patterns of interest from large databases. Data mining can be regarded as a collection of methods for drawing inferences from data. The aims of data mining and some of its methods overlap with those of classical statistics. It should be kept in mind that both data mining and statistics are not business solutions; they are just technologies. Additionally, there are still some philosophical and methodological differences between them.

This field is growing rapidly, due in large part to the increasing awareness of the potential competitive business advantage of using such information. Important knowledge has been extracted from massive scientific data, as well. What is useful information depends on the application. Each record in a data warehouse full of data is useful for daily operations, as in online transaction business and traditional database queries. Data mining is concerned with extracting more global information that is generally the property of the data as a whole. Thus, the diverse goals of data mining algorithms include: clustering the data items into groups of similar items, finding an explanatory or predictive model for a target attribute in terms of other attributes, and finding frequent patterns and sub-patterns, as well as finding trends, deviations, and interesting correlations between the attributes.

A problem is first defined, then data source and analytic tool selection are undertaken to decide the best way to approach the data. This involves a wide variety of choices.

Decision trees and decision rules are frequently the basis for data mining. They utilize symbolic and interpretable representations when developing methods for classification and regression. These methods have been developed in the fields of pattern recognition, statistics, and machine learning. Symbolic solutions can provide a high degree of insight into the decision boundaries that exist in the data and the logic underlying them. This aspect makes these predictive mining techniques particularly attractive in commercial and industrial data mining applications.

Applying machine-learning methods to inductively construct models of the data at hand has also proven successful. Neural networks have been successfully applied in a wide range of supervised and unsupervised learning applications. Neural-network methods are not commonly used for data mining tasks because they are the most likely to produce incomprehensible results and to require long training times. Some neural-network learning algorithms exist, however, that are able to produce good models without excessive training times.

In recent years, significant interest has developed in adapting numerical and analytic techniques from statistical physics to provide algorithms and estimates for good approximate solutions to hard optimization problems. Cluster analysis is an important technique in exploratory data analysis, because there is no prior knowledge of the distribution of the observed data. Partitional clustering methods, which divide the data according to natural classes present in it, have been used in a large variety of scientific disciplines and engineering applications. The goal is to find a partition of a given data set into several compact groups. Each group indicates the presence of a distinct category in the measurements.

In all data mining applications, results are considerably subject to interpretation, since it is a search for trends and correlation rather than an examination of hypotheses based on known real-world information. The possibility for spurious results is large, and there are many cases where the information developed will be of little real value for business purposes. Nonetheless, when pay dirt is struck, the results can be extremely useful.

Interest in data mining is growing, and it has recently been spotlighted by attempts to root out terrorist profiles from data stored in government computers. In a more mundane, but lucrative application, SAS uses data mining and analytics to glean insight about influencers on various topics from postings on social networks such as Twitter, Facebook, and user forums.

Data Mining and CRM

CRM is a technology that relies heavily on data mining. Comprising sales, marketing, and service, CRM applications use data mining techniques to support their functionality. Combining the two technology segments is sometimes referred to as "customer data mining." Proponents claim that positive results of customer data mining include improvements in prospecting and market segmentation; increases in customer loyalty, as well as in cross-selling and up-selling; a reduction in risk management need; and the optimization of media spending on advertising.

Recommendations

Since data analysis is such a key method for developing knowledge from the huge amounts of business data collected and stored each day, enterprises need to select the data analysis tools with care. This will help ensure that the tools' strengths match the needs of their business. Organizations must be aware of how the tools are to be used and their intended audience. It is also important to consider the Internet, as well as the needs of mobile users and power users, and to assess the skills and knowledge of the users and the amount of training that will be needed to get the most productivity from the tools.

Visual tools are very helpful in representing complex relationships in formats that are easier to understand than columns of numbers spread across a screen. Key areas of discovery found with visual tools can then be highlighted for more detailed analysis to extract useful information. Visual tools also offer a more natural way for people to analyze information than does mental interpretation of a spreadsheet.

Organizations should also closely consider the tool interface presented to users, because an overly complex or cluttered interface will lead to higher training costs, increased user frustration, and errors. Vendors are trying to make their tools as friendly as possible, but decision-makers should also consider user customization issues, because a push-button interface may not provide the flexibility their business needs. When considering their OLAP processes, companies need to determine which approach is best. The choices include a multi-dimensional approach, a relational analysis one, or a hybrid of the two. The use of a personalized "dashboard" style interface is growing, and ease of use has emerged as a key criterion in corporate purchasing decisions.

While data analysis tools are becoming simpler, more sophisticate techniques will require specialized staff. Data mining, in particular, can require added expertise because results can be difficult to interpret and may need to be verified using other methods.

Data analysis and data mining are part of BI, and require a strong data warehouse strategy in order to function. This means that attention needs to be paid to the more mundane aspects of ETL, as well as to advanced analytic capacity. The final result can only be as good as the data that feeds the system.

Arcplan: http://www.arcplan.com/ IBM Cognos: http://www.cognos.com/ Informatica: http://www.informatica.com/ Information Builders: http://www.informationbuilders.com/ The BI Verdict: http://www.bi-verdict.com/ Open Text: http://www.opentext.com/ OLAP Council: http://www.olapcouncil.org/ Oracle: http://www.oracle.com/ SAP BusinessObjects: http://www.sap.com/solutions/sapbusinessobjects/index.epx SAS: http://www.sas.com/ SmartDrill: http://www.smartdrill.com/ Sybase: http://www.sybase.com/

This article was adapted from the Faulkner Information Services library of reports covering computing and telecommunications. For more information contact www.faulkner.com . To subscribe to the Faulkner Information Services visit http://www.faulkner.com/showcase/subscription.asp .

White Papers

Logical Data Fabric

SQL Server Everywhere: Platform Choices Enrich Data-Driven Business

IOUG

Guru99

What is Data Analysis? Research, Types & Example

What is data analysis.

Data analysis is defined as a process of cleaning, transforming, and modeling data to discover useful information for business decision-making. The purpose of Data Analysis is to extract useful information from data and taking the decision based upon the data analysis.

A simple example of Data analysis is whenever we take any decision in our day-to-day life is by thinking about what happened last time or what will happen by choosing that particular decision. This is nothing but analyzing our past or future and making decisions based on it. For that, we gather memories of our past or dreams of our future. So that is nothing but data analysis. Now same thing analyst does for business purposes, is called Data Analysis.

In this Data Science Tutorial , you will learn:

Why Data Analysis?

Data analysis tools, types of data analysis: techniques and methods, data analysis process.

To grow your business even to grow in your life, sometimes all you need to do is Analysis!

If your business is not growing, then you have to look back and acknowledge your mistakes and make a plan again without repeating those mistakes. And even if your business is growing, then you have to look forward to making the business to grow more. All you need to do is analyze your business data and business processes.

Data Analysis Tools

There are several types of Data Analysis techniques that exist based on business and technology. However, the major Data Analysis methods are:

Text Analysis

Statistical analysis, diagnostic analysis, predictive analysis, prescriptive analysis.

Text Analysis is also referred to as Data Mining. It is one of the methods of data analysis to discover a pattern in large data sets using databases or data mining tools . It used to transform raw data into business information. Business Intelligence tools are present in the market which is used to take strategic business decisions. Overall it offers a way to extract and examine data and deriving patterns and finally interpretation of the data.

Statistical Analysis shows “What happen?” by using past data in the form of dashboards. Statistical Analysis includes collection, Analysis, interpretation, presentation, and modeling of data. It analyses a set of data or a sample of data. There are two categories of this type of Analysis – Descriptive Analysis and Inferential Analysis.

Descriptive Analysis

analyses complete data or a sample of summarized numerical data. It shows mean and deviation for continuous data whereas percentage and frequency for categorical data.

Inferential Analysis

Diagnostic Analysis shows “Why did it happen?” by finding the cause from the insight found in Statistical Analysis. This Analysis is useful to identify behavior patterns of data. If a new problem arrives in your business process, then you can look into this Analysis to find similar patterns of that problem. And it may have chances to use similar prescriptions for the new problems.

Predictive Analysis shows “what is likely to happen” by using previous data. The simplest data analysis example is like if last year I bought two dresses based on my savings and if this year my salary is increasing double then I can buy four dresses. But of course it’s not easy like this because you have to think about other circumstances like chances of prices of clothes is increased this year or maybe instead of dresses you want to buy a new bike, or you need to buy a house!

So here, this Analysis makes predictions about future outcomes based on current or past data. Forecasting is just an estimate. Its accuracy is based on how much detailed information you have and how much you dig in it.

Prescriptive Analysis combines the insight from all previous Analysis to determine which action to take in a current problem or decision. Most data-driven companies are utilizing Prescriptive Analysis because predictive and descriptive Analysis are not enough to improve data performance. Based on current situations and problems, they analyze the data and make decisions.

The Data Analysis Process is nothing but gathering information by using a proper application or tool which allows you to explore the data and find a pattern in it. Based on that information and data, you can make decisions, or you can get ultimate conclusions.

Data Analysis consists of the following phases:

Data Requirement Gathering

Data collection, data cleaning, data analysis, data interpretation, data visualization.

First of all, you have to think about why do you want to do this data analysis? All you need to find out the purpose or aim of doing the Analysis of data. You have to decide which type of data analysis you wanted to do! In this phase, you have to decide what to analyze and how to measure it, you have to understand why you are investigating and what measures you have to use to do this Analysis.

After requirement gathering, you will get a clear idea about what things you have to measure and what should be your findings. Now it’s time to collect your data based on requirements. Once you collect your data, remember that the collected data must be processed or organized for Analysis. As you collected data from various sources, you must have to keep a log with a collection date and source of the data.

Now whatever data is collected may not be useful or irrelevant to your aim of Analysis, hence it should be cleaned. The data which is collected may contain duplicate records, white spaces or errors. The data should be cleaned and error free. This phase must be done before Analysis because based on data cleaning, your output of Analysis will be closer to your expected outcome.

Once the data is collected, cleaned, and processed, it is ready for Analysis. As you manipulate data, you may find you have the exact information you need, or you might need to collect more data. During this phase, you can use data analysis tools and software which will help you to understand, interpret, and derive conclusions based on the requirements.

Data visualization is very common in your day to day life; they often appear in the form of charts and graphs. In other words, data shown graphically so that it will be easier for the human brain to understand and process it. Data visualization often used to discover unknown facts and trends. By observing relationships and comparing datasets, you can find a way to find out meaningful information.

You Might Like:

analysis of database

Try for free

SQL Tutorial

The sql tutorial for data analysis.

Using SQL in Mode

SQL Comparison Operators

SQL Logical Operators

SQL BETWEEN

SQL IS NULL

SQL ORDER BY

Intermediate SQL

Advanced SQL

SQL Analytics Training

Python Tutorial

Learn Python for business analysis using real-world data. No coding experience necessary.

Mode Studio

The Collaborative Data Science Platform

Sign Up Free

This tutorial is designed for people who want to answer questions with data. For many, SQL is the "meat and potatoes" of data analysis—it's used for accessing, cleaning, and analyzing data that's stored in databases. It's very easy to learn, yet it's employed by the world's largest companies to solve incredibly challenging problems.

In particular, this tutorial is meant for aspiring analysts who have used Excel a little bit but have no coding experience.

In this lesson we'll cover:

How the SQL Tutorial for Data Analysis works

What is sql, how do i pronounce sql, what's a database.

Though some of the lessons may be useful for software developers using SQL in their applications, this tutorial doesn't cover how to set up SQL databases or how to use them in software applications—it is not a comprehensive resource for aspiring software developers.

The entire tutorial is meant to be completed using Mode , an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. You should open up another browser window to Mode . You'll retain the most information if you run the example queries and try to understand results, and complete the practice exercises.

Note: You will need to have a Mode user account in order to start the tutorial. You can sign up for one at mode.com .

SQL (Structured Query Language) is a programming language designed for managing data in a relational database. It's been around since the 1970s and is the most common method of accessing data in databases today. SQL has a variety of functions that allow its users to read, manipulate, and change data. Though SQL is commonly used by engineers in software development, it's also popular with data analysts for a few reasons:

SQL is great for performing the types of aggregations that you might normally do in an Excel pivot table—sums, counts, minimums and maximums, etc.—but over much larger datasets and on multiple tables at the same time.

We have no idea.

From Wikipedia : A database is an organized collection of data.

There are many ways to organize a database and many different types of databases designed for different purposes. Mode's structure is fairly simple:

If you've used Excel, you should already be familiar with tables—they're similar to spreadsheets. Tables have rows and columns just like Excel, but are a little more rigid. Database tables, for instance, are always organized by column, and each column must have a unique name. To get a sense of this organization, the image below shows a sample table containing data from the 2010 Academy Awards:

analysis of database

Broadly, within databases, tables are organized in schemas . At Mode, we organize tables around the users who upload them, so each person has his or her own schema. Schemas are defined by usernames, so if your username is databass3000, all of the tables you upload will be stored under the databass3000 schema. For example, if databass3000 uploads a table on fish food sales called fish_food_sales , that table would be referenced as databass3000.fish_food_sales . You'll notice that all of the tables used in this tutorial series are prefixed with "tutorial." That's because they were uploaded by an account with that username.

You're on your way!

Now that you're familiar with the basics, it's time to dive in and learn some SQL.

Next Lesson

Get our weekly data newsletter

Work-related distractions for every data enthusiast.

Analytical Database Guide: A Criteria for Choosing the Right One

Nov 23, 2015

By Stephen Levin

When your analytics questions run into the edges of out-of-the-box tools, it’s probably time for you to choose a database for analytics. It’s not a good idea to write scripts to query your production database, because you could reorder the data and likely slow down your app. You might also accidentally delete important info if you have data analysts or engineers poking around in there.

You need a separate kind of database for analysis. But which one is right?

In this post, we’ll go over suggestions and best practices for the average company that’s just getting started. Whichever set up you choose, you can make tradeoffs along the way to improve the performance from what we discuss here.

Working with lots of customers to get their DB up and running, we’ve found that the most important criteria to consider are:

the type of data you’re analyzing

how much of that data you have

your engineering team focus

how quickly you need it

What is an analytics database?

An analytics database, also called an analytical database, is a data management platform that stores and organizes data for the purpose of business intelligence and analytics. Analytics databases are read-only systems that specialize in quickly returning queries and are more easily scalable. They are typically part of a broader data warehouse.

What types of data are you analyzing?

Think about the data you want to analyze. Does it fit nicely into rows and columns, like a ginormous Excel spreadsheet? Or would it make more sense if you dumped it into a Word Doc?

If you answered Excel, a relational database like Postgres, MySQL, Amazon Redshift or BigQuery will fit your needs. These structured, relational databases are great when you know exactly what kind of data you’re going to receive and how it links together — basically how rows and columns relate. For most types of analytics for customer engagement , relational databases work well. User traits like names, emails, and billing plans fit nicely into a table as do  user events and their properties .

On the other hand, if your data fits better on a sheet of paper, you should look into a non-relational (NoSQL) database like Hadoop or Mongo.

Non-relational databases excel with extremely large amounts of data points (think millions) of semi-structured data. Classic examples of semi-structured data are texts like email, books, and social media, audio/visual data, and geographical data. If you’re doing a large amount of text mining, language processing, or image processing, you will likely need to use non-relational data stores.

asset_MO0RkZgR.png

How much data are you dealing with?

The next question to ask yourself is how much data you’re dealing with. If you're dealing with large volumes of data, then it's more helpful to have a non-relational database because it won’t impose restraints on incoming data, allowing you to write faster and with scalability in mind.

Here’s a handy chart to help you figure out which option is right for you.

asset_XVe954K0.png

These aren’t strict limitations and each can handle more or less data depending on various factors — but we’ve found each to excel within these bounds.

If you’re under 1 TB of data, Postgres will give you a good price to performance ratio. But, it slows down around 6 TB. If you like MySQL but need a little more scale,  Aurora  (Amazon’s proprietary version) can go up to 64 TB. For petabyte scale, Amazon Redshift is usually a good bet since it’s optimized for running analytics up to 2PB. For parallel processing or even MOAR data, it’s likely time to look into Hadoop.

That said, AWS has told us they run Amazon.com on Redshift, so if you’ve got a top-notch team of DBAs you may be able to scale beyond the 2PB “limit.”

What is your engineering team focused on?

This is another important question to ask yourself in the database discussion. The smaller your overall team, the more likely it is that you’ll need your engineers focusing mostly on building product rather than database pipelines and management. The number of folks you can devote to these projects will greatly affect your options.

With some engineering resources you have more choices — you can go either to a relational or non-relational database. Relational DBs take less time to manage than NoSQL.

If you have some engineers to work on the setup, but can’t put anyone on maintenance, choosing something like  Postgres ,  Google SQL  (a hosted MySQL option) or  Segment Warehouses  (a hosted Redshift) is likely a better option than Redshift, Aurora or BigQuery, since those require occasional data pipeline fixes. With more time for maintenance, choosing Redshift or BigQuery will give you faster queries at scale.

Side bar: You can use Segment to collect customer data from anywhere and send it to your data warehouse of choice.

Relational databases come with another advantage: you can use SQL to query them . SQL is well-known among analysts and engineers alike, and it’s easier to learn than most programming languages.

On the other hand, running analytics on semi-structured data generally requires, at a minimum, an object-oriented programming background, or better, a code-heavy data science background. Even with the very recent emergence of analytics tools like  Hunk  for Hadoop, or  Slamdata  for MongoDB, analyzing these types of data sets will require an advanced analyst or data scientist.

How quickly do you need that data?

While  “real-time analytics”  is all the rage for use cases like fraud detection and system monitoring, most analyses don’t require real-time data or immediate insights.

When you’re answering questions like what is causing users to churn or how people are moving from your app to your website, accessing your data sources with a slight lag (hourly or daily intervals) is fine. Your data doesn’t change THAT much minute-by-minute.

Therefore, if you’re mostly working on after-the-fact analysis, you should go for a database that is optimized for analytics like Redshift or BigQuery. These kind of databases are designed under the hood to accommodate a large amount of data and to quickly read and join data, making queries fast. They can also load data reasonably fast (hourly) as long as you have someone vacuuming, resizing, and monitoring the cluster.

If you absolutely need real-time data, you should look at an unstructured database like Hadoop. You can design your Hadoop database to load very quickly, though queries may take longer at scale depending on RAM usage, available disk space, and how you structure the data.

Postgres vs. Amazon Redshift vs. Google BigQuery

You’ve probably figured out by now that for most types of user behavior analysis, a relational database is going to be your best bet. Information about how your users interact with your site and apps can easily fit into a structured format.

analytics.track('Completed Order')  —  select * from ios.completed_order

asset_o7H1sdGc.png

So now the question is, which SQL database to use? There are four criteria to consider.

Scale vs. Speed

When you need  speed , consider Postgres: Under 1TB, Postgres is quite fast for loading and querying. Plus, it’s affordable. As you get closer to their limit of 6TB (inherited by Amazon RDS), your queries will slow down.

That’s why when you need  scale , we usually recommend you check out Redshift. In our experience we’ve found Redshift to have the best cost to value ratio.

Flavor of SQL

Redshift is built on a variation of Postgres, and both support good ol’ SQL. Redshift doesn’t support every single  data type  and  function  that postgres does, but it’s much closer to industry standard than BigQuery, which has its own flavor of SQL.

Unlike many other SQL-based systems, BigQuery uses the comma syntax to indicate table unions, not joins according to their  docs . This means that without being careful regular SQL queries might error out or produce unexpected results. Therefore, many teams we’ve met have trouble convincing their analysts to learn BigQuery’s SQL.

Third-party Ecosystem

Rarely does your data warehouse live on its own. You need to get the data into the database, and you need to use some sort of software on top for data analysis. (Unless you’re a-run-SQL-from-the-command-line kind of gal.)

That’s why folks often like that Redshift has a very large ecosystem of third-party tools. AWS has options like  Segment Data Warehouses  to load data into Redshift from an analytics API, and they also work with nearly every data visualization tool on the market. Fewer third-party services connect with Google, so pushing the same data into BigQuery may require more engineering time, and you won’t have as many options for BI software.

You can see Amazon’s partners  here , and Google’s  here .

That said, if you already use Google Cloud Storage instead of Amazon S3, you may benefit from staying in the Google ecosystem. Both services make loading data easiest if if already exists in their respective cloud storage repository, so while it won’t be a deal breaker either way, it’s definitely easier if you already use one to stay with that provider.

Getting Set Up

Now that you have a better idea of what database to use, the next step is figuring out how you’re going to get your data into the database in the first place.

Many people that are new to database design underestimate just how hard it is to build a scalable data pipeline. You have to write your own extraction layer, data collection API, queuing and transformation layers. Each has to scale. Plus, you need to figure out the right schema down to the size and type of each column. The MVP is replicating your production database in a new instance, but that usually means going with a database that’s not optimized for analytics.

Luckily, there are a few options on the market that can help bypass some of these hurdles and automatically do the ETL for you.

But whether you build or buy, getting data into SQL is worth it.

Only with your raw user data in a flexible, SQL format can you answer granular questions about what your customers are doing, accurately measure attribution, understand cross-platform behavior, build company-specific dashboards, and more.

Segment can help!

You can use Segment to collect user data and send it to data warehouses like Redshift, Snowflake, Big Query and more — all in real time and with our simple, powerful analytics API. Get started here 👉

Test drive Segment CDP today

It’s free to connect your data sources and destinations to the Segment CDP. Use one API to collect analytics data across any platform.

TS-CTA-Developer-Focus

Share article

Keep updated

Recommended articles

Comparison: mixpanel actions vs. mixpanel classic destinations when using segment.

In this blog, we shed clarity on what's changed in Segment's Actions destinations and how to further leverage Mixpanel and Segment to enhance your analytics capability.

5 Lessons from CDP Live 2023 – Our Half-Day Virtual Summit

Five key takeaways from CDP Live 2023 that highlight the importance of personalization, data, changing perspectives, omnichannel strategies, and (of course) how to future-proof your business.

How Twilio Segment proactively protects customer’s API tokens

Explore how Segment's Security Features Team built solutions to protect customers from committed and orphaned secrets.

Want to keep updated on Segment launches, events, and updates?

For information about how Segment handles your personal data, please see our privacy policy .

IMAGES

  1. What is an Analytical Database? Definition and FAQs

    analysis of database

  2. Basic database analysis(database)

    analysis of database

  3. The Importance Of Business Database Analysis

    analysis of database

  4. JSS Academy of Higher Education & Research

    analysis of database

  5. PPT

    analysis of database

  6. The Database Analysis Dashboard

    analysis of database

VIDEO

  1. USA LaoTu 美国老土 SAS Programming Course preview

  2. YSP BlueSciCon 2022 Lightning Talks

  3. Systat Correlations

  4. CNV Control Databaseの使い方

  5. 【統合TV】How to list the intersection of specific TFBS and upstream region of genes

  6. بدء محاكمة مصطفي المومري احمد حجر وعلا بجريمة إذاعة إشاعات مغرضة

COMMENTS

  1. Database Analysis

    Data analysis is concerned with the NATURE and USE of data. It involves the identification of the data elements which are needed to support the data

  2. What is in-database analytics?

    In-database analytics is a technology that allows data processing to be conducted within the database by building analytic logic into the database itself.

  3. What is Data Analysis? Methods, Process and Types Explained

    Data Analytics is the process of collecting, cleaning, sorting, and processing raw data to extract relevant and valuable information to help

  4. Database analysis and Big Data

    The aim of data science and database analysis is to build predictive statistical models with the aim of increasing a customers interest in purchasing

  5. Data analysis

    Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions

  6. What is Data Analysis and Data Mining?

    Data analysis and data mining tools use quantitative analysis, cluster analysis, pattern recognition, correlation discovery, and associations to

  7. What is Data Analysis? Research, Types & Example

    Data analysis is defined as a process of cleaning, transforming, and modeling data to discover useful information for business

  8. SQL for Data Analysis: Tutorial Introduction

    What is SQL? · It's semantically easy to understand and learn. · Because it can be used to access large amounts of data directly where it's stored, analysts don't

  9. Data Analysis database setup

    Before you begin · Create a database in one of the supported databases. The following databases are supported: · Create a Data Analysis project.

  10. Analytical Database Guide: A Criteria for Choosing the Right One

    An analytics database, also called an analytical database, is a data management platform that stores and organizes data for the purpose of