Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

BDA viva questions/ viva questions, Cheat Sheet of Data Analysis & Statistical Methods

University of Mumbai Data Analysis & Statistical Methods

I have written some questions that might be asked during BDA practical viva

Typology: Cheat Sheet

2022/2023

Uploaded on 12/09/2022

riddhesh-rajwadkar 🇮🇳

1 / 21

Partial preview of the text

Download BDA viva questions/ viva questions and more Cheat Sheet Data Analysis & Statistical Methods in PDF only on Docsity! Module 1 Q.1.Give difference between Traditional data management and analytics approach Versus Big data Approach Q2.What is Hadoop? Explain Hadoop ecosystem with core components. Explain physical architecture of hadoop. State its limitations Hadoop ecosystem is a platform or framework which helps in solving the big data problems. It comprises of different components and services ( ingesting, storing, analyzing, and maintaining) inside of it. The main four core components of Hadoop which include HDFS, YARN, MapReduce and Spark Zookeeper Components of Hadoop: ● HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets of structured or unstructured data ● Yet Another Resource Negotiator(YARN), as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System. ● MapReduce: By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. ○ Map() performs sorting and filtering of data and thereby organizing them in the form of group. ○ Reduce(), as the name suggests does the summarization by aggregating the mapped data. Physical Architecture Of Hadoop:- Name Node ● It is the master of HDFS (Hadoop file system). ● Contains Job Tracker, which keeps tracks of a file distributed to different data nodes. ● Failure of Name Node will lead to the failure of the full Hadoop system. Data node ● Data node is the slave of HDFS. ● A data node can communicate with each other through the name node to avoid replication in the provided task. ● Data nodes update the change to the data node. Job Tracker ● Determines which file to process. ● There can be only one job tracker for per Hadoop cluster. ● Only single task tracker is present per slave node. ● Performs tasks given by job tracker and also continuously communicates with the job tracker. SSN (Secondary Name Node) ● Its main purpose is to monitor. ● One SSN is present per cluster Limitations of Hadoop ● Issue with Small Files ● Slow Processing Speed Q 7 What are the technical Challenges of Big Data? Challenge #1 : Storing huge quantities of data. ● No storage machine is big enough to store the relentlessly growing quantity of data. Need to store in a large number of smaller inexpensive machines. ● There is the inevitable challenge of machine failure. Failure of a machine could entail a loss of data stored on it. Solution: Distribute data across a large scalable cluster of inexpensive commodity machines, ● Ensures that every piece of data is systematically replicated on multiple machines to guarantee that at least one copy is always available, ● Add more machines as needed ● Hadoop Distributed File System is popular for managing large volumes Challenge #2: Ingesting and processing streams at a fast pace ● Unpredictable and torrential streams of data too large to store, but must still be monitored. Solution: Creating fast scalable ingesting and processing systems ● Can open an unlimited number of channels for receiving data Data can be held in queues, from Which business applications can read and process data at their own pace and convenience Apache Kafka is a popular dedicated ingest system ● The stream processing engine can do its work while the batch processing does its work. Apache Spark is the most popular system for streaming applications. Challenge #3: Handling a variety of forms and functions of data ● Storing them in traditional flat or relational structures would be too impractical, wasteful and slow. ● Accessing and analyzing them requires different capabilities. Solution: Use non relational systems that relax many of the stringent conditions of the relational model. ● These are called NoSQL (Not only SOL) databases, These databases optimized for certain tasks such as query processing, or graph processing document processing, etc ● HBase and Cassandra are two of the better known NoSQL databases Challenge #4: Processing the huge quantities of data ● Moving large amounts of data from storage to the processor would consume enormous network capacity and choke the network. Solution: Move the processing to where the data is stored. ● Distribute the task logic throughout the cluster of machines where the data is stored, Machines work, in parallel, on the data assigned to them ● A follow-up process consolidates intermediate outputs and delivers final results. Module 2 Q1 What is the role of Job Tracker and Task Tracker in Map Reduce? Job Tracker :- Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not. Task Tracker :- Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within Q2 Give Map Reduce algorithm to perform relational algebra operations, selection, projection, unionand intersection of two sets. Relational-Algebra Operations • R, S - relation • t, t’ - a tuple • C - a condition of selection • A, B, C - subset of attributes • a, b, c - attribute values for a given subset of attributes Selection • Map: For each tuple t in R, test if it satisfies C. If so, produce the key-value pair (t, t). That is, both the key and value are t. • Reduce: The Reduce function is the identity. It simply passes each key-value pair to the output. Projection • Map: For each tuple t in R, construct a tuple t’ by eliminating from t those components whose attributes are not in A. Output the key-value pair (t’, t’). • Reduce: For each key t’ produced by any of the Map tasks, there will be one or more key-value pairs (t’,t’). Union • Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). • Reduce: Associated with each key t there will be either one or two values. Produce output (t, t) in either case. Intersection • Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). • Reduce: If key t has value list [t, t], then produce (t,t). Otherwise, produce nothing. Q3 Explain the concept of Map Reduce MapReduce is a Hadoop framework and programming model for processing big data using automatic parallelization and distribution in the Hadoop ecosystem. MapReduce consists of two essential tasks, i.e., Map and Reduce. The reduce task always follows the map task. In the Map task, data are divided into chunks and processed in parallel. The output of a map task is used as input in reduce tasks, and the data is shuffled and reduced. ● Map() performs sorting and filtering of data and thereby organizing them in the form of group. ● Reduce(), as the name suggests does the summarization by aggregating the mapped data. MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. MapReduce process has the following phases: ● Input Splits:- MapReduce splits the input into smaller chunks called input splits. ● Mapping:- The input data is processed and divided into smaller segments in the mapper phase, where the number of mappers is equal to the number of input splits. ● Shuffling:- In the shuffling phase, the output of the mapper phase is passed to the reducer phase by removing duplicate values and grouping the values. ● Sorting:- The Sorting phase involves merging and sorting the output generated by the mapper. ● Reducing:- In the reducer phase, the intermediate values from the shuffling phase are reduced to produce a single output value that summarizes the entire dataset. Module 3 Q1. Why NoSQL is better than SQL OR Explain different ways by which big data problems are handled by NoSQL OR Differentiate between RDBMS and NoSQL database NoSQL databases offer many benefits over relational databases. NoSQL databases have flexible data models, scale horizontally, have incredibly fast queries, and are easy for developers to work with. Flexible data models:NoSQL databases typically have very flexible schemas. A flexible schema allows you to easily make changes to your database as requirements change. You can iterate quickly and continuously integrate new application features to provide value to your users faster. Horizontal scaling: Most SQL databases require you to scale-up vertically (migrate to a larger, more expensive server) when you exceed the capacity requirements of your current server. Conversely, most NoSQL databases allow you to scale-out horizontally, meaning you can add cheaper commodity servers whenever you need to. Q4 What do you understand by BASE properties of NoSQL? BASE stands for: ● Basically Available – Rather than enforcing immediate consistency, BASE-modelled NoSQL databases will ensure availability of data by spreading and replicating it across the nodes of the database cluster. ● Soft State – Due to the lack of immediate consistency, data values may change over time. The BASE model breaks off with the concept of a database which enforces its own consistency, delegating that responsibility to Developers. ● Eventually Consistent – The fact that BASE does not enforce immediate consistency does not mean that it never achieves it. However, until it does, data reads are still possible (even though they might not reflect the reality). Module 4 Q1 Explain Data stream management Architecture? 1. Input streams : The input streams have the following characteristics:_ ● There can be one or more number of input streams entering the system. ● The streams can have different data types ● The rate of data flow of each stream may be different. ● Within a stream the time interval between the arrival of data items may differ. For example, suppose the second item arrives after 2 ms from the arrival of the first data item, then it is not necessary that the third data item arrive after 2 ms from the arrival of the second data item. It may arrive earlier or even later. 2. Stream processor: : All types of processing such as sampling, cleaning, filtering, and querying on the input stream data are done here. Two types of queries are supported which are standing queries and ad-hoc queries. 3. Working: A limited memory such as a disk or main memory is used as the working storage for storing parts of summaries stories so that queries can be executed. If faster processing is needed, main memory is used, otherwise a secondary storage disk is used. As the working storage is limited in size, it is not possible to store all the data received from all the streams. 4. Archival: : The archival store is a large· storage area· in which the streams may be archived but execution of queries directly on the archival store is not supported. Also, the fetching of data from this store takes a lot of time as The archival store is a large· storage area· in which the streams may be archived but execution of compared to the fetching of data from the working store. 5. Outpμt streams : The output consists of the fully processed streams and the results of the execution of queries on the streams. The difference between a conventional database-management system and a data-stream management system is that in case of the database-management system all of the data is available on the disk and the system can control the rate of data reads. On the other hand, in case of the data-stream-management system the rate of arrival of data is not in the control of the system and the system has to take care of the possibilities of data getting lost and take the necessary precautionary measures. Q2 What are the types of stream? Give example Transactional data stream – It is a log interconnection between entities ● Credit card – purchases by consumers from producer ● Telecommunications – phone calls by callers to the dialed parties ● Web – accesses by clients of information at servers Measurement data streams – ● Sensor Networks – a physical natural phenomenon, road traffic ● IP Network – traffic at router interfaces ● Earth climate – temperature, humidity level at weather stations Sources: Sensor Data – In navigation systems, sensor data is used. Imagine a temperature sensor floating about in the ocean, sending back to the base station a reading of the surface temperature each hour. The data generated by this sensor is a stream of real numbers. We have 3.5 terabytes arriving every day and we for sure need to think about what we can be kept continuing and what can only be archived. Image Data – Satellites frequently send down-to-earth streams containing many terabytes of images per day. Surveillance cameras generate images with lower resolution than satellites, but there can be numerous of them, each producing a stream of images at a break of 1 second each. Internet and Web Traffic – A bobbing node in the center of the internet receives streams of IP packets from many inputs and paths them to its outputs. Websites receive streams of heterogeneous types. For example, Google receives a hundred million search queries per day. Q5 What do you mean by counting Distinct Elements in a stream. Illustrate with an example working of a Flajolet-Martin Algorithm used to count number of distinct elements. Which algorithm is used to count frequent elements Suppose stream elements are chosen from some universal set. We would like to know how many different elements have appeared in the stream, counting either from the beginning of the stream or from some known time in the past. Flajolet-Martin Algorithm is used for estimating the number of unique elements in a stream in a single pass. The time Complexity of this algorithm is O(n) and the space complexity is O(log m) where n is the number of elements in the stream and m is the number of unique elements. Its idea is that the more different elements we see in the stream, the more different hash-values we shall see. As we see more different hash-values, it becomes more likely that one of these values will be “unusual.” The particular unusual property we shall exploit is that the value ends in many 0’s, although many other options exist. The major components of this algorithm are : o A collection of hash functions, and o A bit-string of length L such that n. A 64-bit string is sufficient for most cases. Each incoming element will be hashed using all the hash functions. Higher the number Of distinct elements in the stream, higher will be the number of different hash values. On applying a hash function h on an element of the stream e, the hash value h(e) is produced. We convert h(e) into an equivalent binary bit-string. This bit string will end in some number of zeroes. For instance, the 5-bit string 11010 with 1 zero and 10001 ends with no zeroes. This count of zeroes is known as the tail length. If R denotes the maximum tail length of any element e encountered thus far in the stream, then the estimate for the number of unique elements in the stream is 2R. Now to see that this estimate makes sense we have to use the following arguments using probability theory: - If m » 2~ the probability of finding a tail of length at least r approaches 1 . - If m « 2', the probability of finding a tail of length at least r approaches o. So, we can conclude that the estimate of 2R is neither going to be too low nor too high. - Let us now understand the working of the algorithm with an example : Stream: 5, 3, 9, 2, 7, 11 Q6 What are challenges when querying on large data set? ● The data model and query semantics must allow order-based and time-based operations (e.g. queries over a five-minute moving window). ● The inability to store a complete stream suggests the use of approximate summary structures, As a result, queries over the summaries may not return exact answers. ● Streaming query plans may not use blocking operators that must consume the entire input before any results are produced. ● Due to performance and storage constraints, backtracking over a data stream is not feasible. On-line stream algorithms are restricted to making only one pass over the data. ● Applications that monitor streams in real-time must react quickly to unusual data values. ● Long-running queries may encounter changes in system conditions throughout their execution lifetimes (e.g. variable stream rates). ● Shared execution of many continuous queries is needed to ensure scalability Module 5 Q1 Define Recommender system. And their types/ Explain Content-based recommendation Explain Collaborative filtering. A recommendation system is an artificial intelligence or AI algorithm, usually associated with machine learning, that uses Big Data to suggest or recommend additional products to consumers. These can be based on various criteria, including past purchases, search history, demographic information, and other factors. Recommender systems are highly useful as they help users discover products and services they might otherwise have not found on their own. Recommender systems are trained to understand the preferences, previous decisions, and characteristics of people and products using data gathered about their interactions. These include impressions, clicks, likes, and purchases. Because of their capability to predict consumer interests and desires on a highly personalized level, recommender systems are a favorite with content and product providers. They can drive consumers to just about any product or service that interests them, from books to videos to health classes to clothing. Types/ What is the category for recommendation of List of favourites? What is the Category for recommendation of Top 10, Most Popular videos on YouTube? Explain Hybrid recommendation Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part). This approach uses similarity of user preference behavior, given previous interactions between users and items, recommender algorithms learn to predict future interaction. These recommender systems build a model from a user’s past behavior, such as items purchased previously or ratings given to those items and similar decisions by other users. The idea is that if some people have made similar decisions and purchases in the past, like a movie choice, then there is a high probability they will agree on additional future selections. For example, if a collaborative filtering recommender knows you and another user share similar tastes in movies, it might recommend a movie to you that it knows this other user already likes. Content filtering, by contrast, uses the attributes or features of an item (this is the content part) to recommend other items similar to the user’s preferences. This approach is based on similarity of item and user features, given information about a user and items they have interacted with (e.g. a user’s age, the category of a restaurant’s cuisine, the average review for a movie), model the likelihood of a new interaction. For example, if a content filtering recommender sees you liked the movies You’ve Got Mail and Sleepless in Seattle, it might recommend another movie to you with the same genres and/or cast such as Joe Versus the Volcano. Hybrid recommender systems combine the advantages of the types above to create a more comprehensive recommending system Q6 Explain Girvan-Newman algorithm to mine Social Graphs. Divisive method: starts with the full graph and breaks it up to find communities. Too slow for many large networks (unless they are very sparse), and it tends to give relatively poor results for dense networks. The edge betweenness for edge e: Where σst is total number of shortest paths from node s to node t and σst(e) is the number of those paths that pass through e. The summation is n(n − 1) pairs for directed graphs and n(n − 1)/2 for undirected graphs. Steps(BFS -> DAG -> Recalculate ->Split/Graph Partitioning) Step 1. Find the edge of highest betweenness - or multiple edges of highest betweenness, if there is a tie - and remove these edges from the graph. This may cause the graph to separate into multiple components. If so, this is the first level of regions in the partitioning of the graph. Step 2. Recalculate all betweennesses, and again remove the edge or edges of highest betweenness. This may break some of the existing components into smaller components; if so, these are regions nested within the larger regions. Step 3 Proceed in this way as long as edges remain in graph, in each step recalculating all betweennesses and removing the edge or edges of highest betweenness Q7 Give Applications of Social Network mining. A social graph is a diagram that illustrates interconnections among people, groups and organizations in a social network. The term is also used to describe an individual's social network. When portrayed as a map, a social graph appears as a set of network nodes that are connected by lines Applications Social media applications information networks (documents, web graphs, patents), infrastructure networks (roads, planes, water pipes, powergrids), biological networks (genes, proteins, food-webs of animals eating each other), product co-purchasing networks(e.g., Groupon). Q8 How is classification algorithm used in recommendation system It is used to know the user's interest. By applying some function to new item we get . • some probability which user may like.Numeric values also help to know about the degree of interest with some particular item. Few of techniques are listed as follows : (1) Decision Tree and Rule Induction (2) Nearest Neighbour Method (3) Euclidean Distance Metric (4) Cosine similarity functuin Some other classification algorithm are 1) Relevance feedback and Rocchio's al · . h 2) Linear classification 3) probabilistic methods 4) Naive Bayes.

Documents

questions

BDA viva questions/ viva questions, Cheat Sheet of Data Analysis & Statistical Methods

Related documents

Partial preview of the text