Web Mining

Web Mining Introduction

 What is Web Mining?   
 


          Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services.Web mining deals with three main areas: web content mining, web usage mining and web structure mining. In web usage mining it is desirable to find the habits and relations between what the website’s users are looking for. To find the actual users some filtering has to be done to remove bots that indexes structures of a website. 
         Robots view all pages and links on a website to find relevant content. This creates many calls to the website server and thereby creates a false image of the actual web usage. The paper we have chosen to start with [Tang et al. 2002] does not in depth discuss web content and web structure mining, but instead look closer upon web usage mining. This field is supposed to describe relations between web pages based on the interests of users, i.e. finding links often clicked in a specific order which are of greater relevance to the user. The patterns revealed will then be used to create a more visitor customized website by highlighting or otherwise expose web pages to increase commerce.                This is often demonstrated as a price cut in one product which will increase sales in another. On the other hand it is also important to not to misclassify actual users that make thorough searches of websites and label them as robots. 

 1. Motivation / Opportunity The WWW is huge, widely distributed, global information service centre and, therefore, constitutes a rich source for data mining.

 01 Personalization, Recommendation Engines.
 02 Web-commerce applications. 
 03 Building the Semantic Web. 
 04 Intelligent Web Search.
 05 Hypertext classification and Categorization. 
 06 Information / trend monitoring. 
 07 Analysis of online communities. 

 2. The Web 
 01. Over 1 billion HTML pages, 15 terabytes
 02. Wealth of information a. Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes,yellow & white pages, maps, markets, ......... 
 b. Diverse media types: text, images, audio, video c. Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3 
03. Highly Dynamic a. 1 million new pages each day b. Average page changes in a few weeks 
04. Graph structure with links between pages a. Average page has 7-10 links b. in-links and out-links follow power-law distribution 
05. Hundreds of millions of queries per day 

 3. Abundance and authority crisis. 
 01. Liberal and informal culture of content generation and dissemination 
 02. Redundancy and non-standard form and content 
 03. Millions of qualifying pages for most broad queries Example: java or kayaking 
 04. No authoritative information about the reliability of a site 
 05. Little support for adapting to the background of specific users 

 4. One Interesting Approach 
 01. The number of web servers was estimated by sampling and testing random IP address numbers and determining the fraction of such tests that successfully located a web server 
 02. The estimate of the average number of pages per server was obtained by crawling a sample of the servers identified in the first experiment 
 03. Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature, 400(6740): 107–109. 

 4. Applications of web mining a.
 a. E-commerce (Infrastructure) 
 b. Generate user profiles -> improving customization and provide users with pages, advertisements of interest 
 c. Targeted advertising -> Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites. 
Internet advertising is probably the “hottest” web mining application today 
 d. Fraud -> Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought). If buying pattern changes significantly, then signal fraud  
 e. Network Management 
 f. Performance management -> Annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three. Result is frequent congestion. During a major event (World cup), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world 
 g. Fault management -> analyze alarm and traffic data to carry out root cause analysis of faults 
 h. Information retrieval (Search) on the Web 
i. Automated generation of topic hierarchies 
j. Web knowledge bases 

 5. Why is Web Information Retrieval Important? 
 a. According to most predictions, the majority of human information will be available on the Web in ten years
 b. Effective information retrieval can aid in 
 c. Research: Find all papers about web mining 
 d. Health/Medicene: What could be reason for symptoms of “yellow eyes”. 
 e.high fever and frequent vomitting 
 f. Travel: Find information on the tropical island of St. Lucia
 g. Business: Find companies that manufacture digital signal processors
 h. Entertainment: Find all movies starring Marilyn Monroe during the years 1960 and 1970
 i. Arts: Find all short stories written by Jhumpa Lahiri 

 6. Why is Web Information Retrieval Difficult? 
 a. The Abundance Problem (99% of information of no interest to 99% of people)
 b. Hundreds of irrelevant documents returned in response to a search query
 c. Limited Coverage of the Web (Internet sources hidden behind search interfaces)
 d. Largest crawlers cover less than 18% of Web pages
 e. The Web is extremely dynamic 
f. Lots of pages added, removed and changed every day 
g. Very high dimensionality (thousands of dimensions) 
h. Limited query interface based on keyword-oriented search 
i. Limited customization to individual users 

 6. Web Mining Taxonomy 
 01. Web content mining: focuses on techniques for assisting a user in finding documents that meet a certain criterion (text mining) 
 02. Web structure mining: aims at developing techniques to take advantage of the collective judgement of web page quality which is available in the form of hyperlinks 
 03. Web usage mining: focuses on techniques to study the user behaviour when navigating the web (also known as Web log mining and clickstream analysis) 

 7. Web Content Mining 

 01. Can be thought of as extending the work performed by basic search engines 
 02. Search engines have crawlers to search the web and gather information, indexing techniques to store the information, and query processing support to provide information to the users 
 03. Web Content Mining is: the process of extracting knowledge from web contents 

 8. Structuring Textual Information
 01. Many methods designed to analyze structured data
 02. If we can represent documents by a set of attributes we will be able to use existing data mining methods
 03. How to represent a document?
 04. Vector based representation (referred to as “bag of words” as it is invariant to permutations)
 05. Use statistics to add a numerical dimension to unstructured text. 

 9. Text Mining 
 01. Document classification
 02. Document clustering
 03. Key-word based association rules

 10. Web Search 
1.Domain-specific search engines 
a. www.buildingonline.com 
b. www.lawcrawler.com
c. www.drkoop.com (medical) 
d. Meta-searching 
e. Connects to multiple search engines and combine the search results f. www.metacrawler.com g. www.dogpile.com
 h. www.37.com 

 2. Post-retrieval analysis and visualization
 a. www.vivisimo.com 
b. www.tumba.pt 
c. www.kartoo.com 
d. Natural language processing 
e. www.askjeeves.com 
f. Search Agents 
g. Instead of storing a search index, search agents can perform realtime searches on the Web.
h. Fresher data, slower response time and lower coverage. 

11. Web Structure Mining First generation of search engines
 01. Early days: keyword based searches
      a. Keywords: “web mining”
      b. Retrieves documents with “web” and mining”
02. Later on: cope with
a. synonymy problem
b. polysemy problem
c. stop words
03. Common characteristic: Only information on the pages is used.

12. Modern search engines

 01. Link structure is very important a. Adding a link: deliberate act b. Harder to fool systems using in-links c. Link is a “quality mark”
 02. Modern search engines use link structure as important source of information.
 1. The Web Structure
      a. If the web is treated as an undirected graph 90% of the pages form a single connected component    
      b. If the web is treated as a directed graph four distinct components are identified, the four with similar size.

 13. Some statistics
 01. Only between 25% of the pages there is a connecting path BUT
 02. If there is a path:
        a. Directed: average length <17 -="" 03.="" a="" average="" b.="" it="" length="" s="" small="" undirected:="" world=""> between two people only chain of length 6! 
a. High number of relatively small cliques >
b. Small diameter 

04. Internet (SCC) is a small world graph.
14. Applications Web mining is an important tool to gather knowledge of the behaviour of Websites’ visitors and thereby to allow for appropriate adjustments and decisions with respect to Websites’ actual users and traffic patterns. Along with a description of the processes involved in Web mining [Srivastava, 1999] states that Website Modification, System Improvement, Web Personalization and Business Intelligence are four majomajor application areas for Web mining. These are briefly described in the following sections.

15. Website Modification The content and structure of the Website is important to the user experience/impression of the site and the site’s usability. The problem is that different types of users have different preferences, background, knowledge etc. making it difficult (if not impossible) to find a design that is optimal for all users. Web usage mining can then be used to detect which types of users are accessing the website, and their behaviour, knowledge which can then be used to manually design/re-design the website, or to automatically change the structure and content based on the profile of the user visiting it. Adaptive Websites are described in more detail in [Perkowitz & Etzioni. 1998].

16. System Improvement The performance and service of Websites can be improved using knowledge of the Web traffic in order to predict the navigation path of the current user. This may be used e.g. for cashing, load balancing or data distribution to improve the performance. The path prediction can also be used to detect fraud, break-ins, intrusion etc. [Srivastava, 1999].

17. Web Personalization Web Personalization is an attractive application area for Web based companies, allowing for recommendations, marketing campaigns etc. to be specifically customized for different categories of users, and more importantly to do this in real-time, automatically, as the user accesses the Website. For example, [Mobasher et. al. 1999] and [Yan et al. 1996] uses association rules and clustering for grouping users and discover the type of user currently accessing the Website (based of the user’s path through the Website), in real-time, to dynamically adapt hyperlinks and content of the Website.

18. Business Intelligence For Web based companies Web mining is a powerful tool to collect business intelligence to get competitive advantages. Patterns of the customers’ activities on the Website can be used as important knowledge in the decision-making process, e.g. predicting customers’ future behaviour, recruiting new customers and developing new products are beneficial choices. There are many companies providing (among other things) services in the field of Web Mining and Web traffic analysis for extracting business intelligence, e.g.[BizInetl, 2011] and [WebTrends, 2011].

 19. Summary

01. Web is huge and dynamic
02. Web mining makes use of data mining techniques to automatically discover and extract information from Web documents/services 
03. Web content mining
04. Web structure mining
05. Web usage mining
06. Semantic web: "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila.

       Thanks For read My Article.Any Query Comment Below.

Introduction to Search Engine Optimization

Introduction to Search Engine Optimization


01.What is SEO? 
      Search engine optimization (SEO) refers to techniques that help your website rank higher in organic (or “natural”) search results, thus making your website more visible to people who are looking for your product or service via search engines.SEO is part of the broader topic of Search Engine Marketing (SEM), a term used to describe all marketing strategies for search. SEM entails both organic and paid search.With paid search, you can pay to list your website on a search engine so that your website shows up when someone types in a specific keyword or phrase. Organic and paid listings both appear on the search engine, but they are displayed in different locations on the page. So, why is it important for your business‟ website to be listed on search engines? On Google alone, there are over 694,000 searches conducted every second.
      I Think about that. Every second that your website is not indexed on Google, you are potentially missing out on hundreds, if not thousands of opportunities for someone to visit your website, read your content, and potentially buy your product or service. Practicing SEO basics, as well as more advanced techniques after those, can drastically improve your website‟s ability to rank in the search engines and get found by your potential customers. What about paid search? Yes, you can pay to have your website listed on the search engines.
      However, running paid search campaigns can be quite costly if you don‟t know what you‟re doing. Not to mention, about 88% of search engine users never click on paid search ads anyway. Because the sole purpose of a search engine is to provide you with relevant and useful information, it is in everyone‟s best interest (for the search engine, the searcher, and you) to ensure that your website is listed in the organic search listings. In fact, it is probably best to stay away from paid search all together until you feel you have a firm grasp on SEO and what it takes to rank organically.

02.How Search Engines Work?
       Search engines have one objective – to provide you with the most relevant results possible in relation to your search query.
      If the search engine is successful in providing you with information that meets your needs, then you are a happy searcher. And happy searchers are more likely to come back to the same search engine time and time again because they are getting the results they need. In order for a search engine to be able to display results when a user types in a query, they need to have an archive of available information to choose from. Every search engine has proprietary methods for gathering and prioritizing website content. Regardless of the specific tactics or methods used, this process is called indexing. Search engines actually attempt to scan the entire online universe and index all the information so they can show it to you when you enter a search query.
      How do they do it? Every search engine has what are referred to as bots, or crawlers, that constantly scan the web, indexing websites for content and following links on each webpage to other webpages. If your website has not been indexed, it is impossible for your website to appear in the search results. Unless you are running a shady online business or trying to cheat your way to the top of the search engine results page (SERP), chances are your website has already been indexed. So, big search engines like Google, Bing, and Yahoo are constantly indexing hundreds of millions, if not billions, of webpages. How do they know what to show on the SERP when you enter a search query? The search engines consider two main areas when determining what your website is about and how to prioritize it.
      1. Content on your website: When indexing pages, the search engine bots scan each page of your website, looking for clues about what topics your website covers and scanning your website‟s back-end code for certain tags, descriptions, and instructions.
     
      2. Who’s linking to you: As the search engine bots scan webpages for indexing, they also look for links from other websites. The more inbound links a website has, the more influence or authority it has. Essentially, every inbound link counts as a vote for that website‟s content. Also, each inbound link holds different weight. For instance, a link from a highly authoritative website like The New York Times (nytimes.com) will give a website a bigger boost than a link from a small blog site. This boost is sometimes referred to as link juice. When a search query is entered, the search engine looks in its index for the most relevant information and displays the results on the SERP. The results are then listed in order of most relevant and authoritative. If you conduct the same search on different search engines, chances are you will see different results on the SERP. This is because each search engine uses a proprietary algorithm that considers multiple factors in order to determine what results to show in the SERP when a search query is entered.
     
      3.A few factors that a search engine algorithm may consider when deciding what information to show in the SERP include: a. Geographic location of the searcher b. Historical performance of a listing (clicks, bounce rates, etc.) c. Link quality (reciprocal vs. one-way) d. Webpage content (keywords, tags, pictures) e. Back end code or HTML of webpage f. Link type (social media sharing, link from media outlet, blog, etc.) With a 200B market cap, Google dominates the search engine market. Google became the leader by fundamentally revolutionizing the way search engines work and giving searchers better results with their advanced algorithm. With 64% market share, according to Compete, Inc., Google is still viewed as the primary innovator and master in the space. Before the days of Google (circa 1997), search engines relied solely on indexing web page content and considering factors like keyword density in order to determine what results to put at the top of the SERP. This approach gave way to what are referred to as black-hat SEO tactics, as website engineers began intentionally stuffing their webpages with keywords so they would rank at the top of the search engines, even if their webpages were completely irrelevant to the search result.
     
       4.What it Takes to Rank It is not difficult to get your website to index and even rank on the search engines. However, getting your website to rank for specific keywords can be tricky. There are essentially 3 elements that a search engine considers when determining where to list a website on the SERP: rank, authority, and relevance. Rank Rank is the position that your website physically falls in on the SERP when a specific search query is entered. If you are the first website in the organic section of the SERP (don‟t be confused by the paid ads at the very top), then your rank is 1. If your website is in the second position, your rank is 2, and so on.
      As discussed previously in How Search Engines Work, your rank is an indicator of how relevant and authoritative your website is in the eyes of the search engine, as it relates to the search query entered. Tracking how your website ranks for a specific keyword over time is a good way to determine if your SEO techniques are having an impact. However, since there are so many other factors beyond your control when it comes to ranking, do not obsess over it. If your website jumps 1-5 spots from time to time, that‟s to be expected. It‟s when you jump 10, 20, 30 spots up in the rankings that it makes sense to pat yourself on the back.

       5.Authority As previously discussed in the How Search Engines Work section, search engines determine how authoritative and credible a website.s content is by calculating how many inbound links (links from other websites) it has. However, the number of inbound links does not necessarily correlate with higher rankings. The search engines also look at how authoritative the websites that link to you are, what anchor text is used to link to your website, and other factors such as the age of your domain. You can track over time how authoritative your website is by monitoring a few different metrics. There are a variety of tools to help you keep track. HubSpot offers a free tool called Website Grader that will show you how many domains are linking to your website, and also provide your website. Moz rank. MozRank is SEOmoz's general, logarithmically scaled 10-point measure of global link authority or popularity. It is very similar in purpose to the measures of link importance used by the search engines (e.g., Google's PageRank).
      6.Relevance Relevance is a one of the most critical factors of SEO. The search engines are not only looking to see that you are using certain keywords, but they are also looking for clues to determine how relevant your content is to a specific search query. Besides actual text on your webpages, the search engines will review your website‟s structure, use of keywords in your URLs, page formatting (such as bolded text), and what keywords are in the headline of the webpage versus those in the body text. While there is no way to track how relevant your website is, there are some SEO basics you can practice to cover your bases and make sure you are giving the search engines every possible opportunity to consider your website.
      We‟ll get to that in just a bit. Search engines are extremely complex. Bottom line: the search engines are trying to think like human beings. It is very easy to get caught up in modifying your website‟s content just so you rank on the search engines. When in doubt, always err on the side of providing relevant and coherent content that your website‟s audience (your prospects) can digest. If you find yourself doing something solely for the search engines, you should take a moment to ask yourself why.
      7.Content is King We‟ve all heard it - when it comes to SEO, content is king. Without rich content, you will find it difficult to rank for specific keywords and drive traffic to your website. Additionally, if your content does not provide value or engage users, you will be far less likely to drive leads and customers. It is impossible to predict how people will search for content and exactly what keywords they are going to use. The only way to combat this is to generate content and lots of it. The more content and webpages you publish, the more chances you have at ranking on the search engines. Lottery tickets are a good analogy here. The more lottery tickets you have, the higher the odds are that you will win.
      Imagine that every webpage you create is a lottery ticket. The more webpages you have, the higher your chances are of ranking in the search engines. As you already know, the search engines are smart. If you create multiple webpages about the same exact topic, you are wasting your time. You need to create lots of content that covers lots of topics. There are multiple ways you can use content to expand your online presence and increase your chances of ranking without being repetitive. Here are few examples: Homepage: Use your homepage to cover your overall value proposition and high-level messaging.
      If there was ever a place to optimize for more generic keywords, it is your homepage. Product/Service Pages: If you offer products and/or services, create a unique webpage for each one of them. Resource Center: Provide a webpage that offers links to other places on your website that cover education, advice, and tips.
     
      Blog: Blogging is an incredible way to stay current and fresh while making it easy to generate tons of content. Blogging on a regular basis (once per week is ideal) can have a dramatic impact on SEO because every blog post is a new webpage. While conducting SEO research, you may come across articles that discuss being mindful of keyword density (how often you mention a keyword on a page). Although following an approach like this may seem technically sound, it is not recommended. Remember: do not write content for the search engines. Write content for your audience and everything else will follow.
      Make sure each webpage has a clear objective and remains focused on one topic, and you will do just fine. How to Approach Your SEO Strategy When developing an SEO strategy, it is best to split your initiatives into two buckets: on-page SEO and off-page SEO. On-page SEO covers everything you can control on each specific webpage and across your website to make it easy for the search engines to find, index, and understand the topical nature of your content. Off-page SEO covers all aspects of SEO that happen off your website to garner quality inbound links. Let‟s dive into on-page SEO first, and then we‟ll tackle off-page SEO in the next section.
     
      Thanks For read My Article.Any Query Comment Below.

Web Hosting

An Introduction What is Web Hosting



1. What is web hosting?
Web hosting simply means internet hosting that enables businesses and individuals to make their online presence in the form of a website, accessible to the public via internet. A website needs two things to be hosted or to become accessible to everyone: Web space: Website files, HTML codes, images and everything else is stored in this space. The heavier your website, the more is the space you require to store its content.
Bandwidth: In the web hosting industry, bandwidth refers to the amount of data that can be transferred to and from a server or a website.
It is the allotted internet bandwidth that makes a website accessible to everyone online. The more the bandwidth, the better and faster is your network, connection and system. Bandwidth requirement is directly proportional to the number of visitors who visit a site. The more the number of visitors, the more is the bandwidth that is required. The requirement of space and bandwidth is fulfilled by a web hosting provider. However,in addition to these two, the hosting provider also maintains the server, ensures website uptime and provides data security.

2. Types of hosting: Depending upon your web space and bandwidth requirements, you can purchase the required hosting from three basic types of hosting available: Shared Hosting: In this hosting, multiple accounts are hosted on the same server and the resources are also shared among them. It is best for small businesses, whose websites have low to moderate traffic and CPU needs. It is like an apartment building, where you need to pay less amount, but you share the space with many other people. You also get affected if your neighbor decides to party as he and his visitors may create disturbance for you or may use up your parking space. Similarly, in shared hosting, if one user uses more bandwidth in case his site has more traffic, then your site may go down.
3. Dedicated Server: In this hosting, the owner has the complete server and all the resources exclusively for himself. However, the convenience and solitude makes it the costliest option in hosting. It is ideal for large enterprises and those organizations, whose websites have heavy traffic and CPU needs. It is like having your own home where you can live according to your convenience, but you will also have to bear the costs for its purchase and maintenance alone. Similarly, in dedicated hosting, you will have to pay more than any other type of hosting for availing server, bandwidth and resources all by yourself.
4. VPS Hosting: In this hosting, virtualization technology is used to partition a computer virtually into multiple servers. No physical partition is there, but because of virtual or software partition, each user is given much more privacy and security as compared to the shared hosting.
It is like living in a comfortable condo, where you get your own privacy, fewer neighbors and better space but you still share your walls and plot with others. The price for added space is obviously more than apartment housing, but is not as exorbitant as the cost of having your own home. Similarly, in VPS, you are less affected with busy websites and the cost is more than shared hosting and lesser than dedicated hosting.
VPS hosting also gives you the freedom to install whatever you want on the server. You can try any new programming language or deploy a custom Apache module. As you have the root access, you can make any changes using the remote desktop command line or remote desktop. VPS hosting offers many more advantages than shared hosting and almost the same advantages as dedicated hosting. But in terms of price, VPS hosting is less expensive than dedicated hosting but more expensive than shared hosting.

How do I choose the best hosting company?
When choosing a hosting company, there are certain ingredients that can make the difference between online success and an internet disaster. Do you want to be able to contact your hosting company at 10pm if your site and email develop problems?

Do you know how secure and stable the providers hosting setup is? Do you want to deal with someone who speaks English instead of technical jargon? There are a number of small hosting businesses set up in garages and basements or with limited staff resources. They may be cheaper, but the smaller the company, the fewer resources they have available to provide adequate customer support or servicing when you need it. It is not unknown for websites to go down only to discover that the person responsible for fixing the server is on holiday or otherwise unavailable. Every time your website is down, your business loses money, so compromises on hosting prices can cost you in other ways. The most reliable hosting companies provide dedicated customer support and technical services, capable of dealing with your issues day and night.
If your website goes offline, you don’t want to wait a couple of days before discovering the problem as you want to be sure your provider will identify and correct the issue before you even know about it. Also, although it is possible to host your website anywhere in the world, choosing a local hosting company can have greater advantages. Dealing with a hosting company in the United States can be frustrating.
Long distance calls are expensive and email support can often be slow and unhelpful outside of US business hours. If your website is suffering costly delays during Australian business hours, you don’t want to wait for New York to wake up before it can be fixed. A reputable hosting provider should have enough back-up safeguards to ensure your website never goes down (well, at least 99.99% of the time).
The last thing you want is the server with your website on it floating through a flooded building with no appropriate back-up stored high-and-dry elsewhere. The same goes for the connections to the web. If the local road works chop through a Telstra cable, wouldn’t you feel better knowing your server was also linked to the net through at least one or two other connections, providing uninterrupted service? Make sure you know how your provider can guarantee your online store will remain open for business 24x7.
Net registry houses servers in the largest data centre in the southern hemisphere, Global Switch in Pyrmont, as well as the E3 data centre in Alexandria, providing reliable, secure, and above all, local hosting. How much uptime is good enough? The last thing your website needs is downtime, which is why many hosting providers offer uptime guarantees. But how much is good enough? The difference between 99.5% and 99.9% uptime might not seem very much, but actually equates to 42.9 hours of your website being unavailable to customers each year.
That is nearly two days of forced closure with no sales and no money coming in. Therefore, 0.4% of difference can cost your business a lot more than you may think. What do I need to consider when choosing a hosting package? There are a number of factors to consider when deciding on the best value hosting solution for your website. It is important to not only understand your current goals, but have an idea for how the website is likely to evolve.
In choosing a hosting plan, there are a few key considerations. Data storage, data transfer, bandwidth, databases and scripting technology. Most hosting accounts come equipped with features such as email. Even so, there are some that consider email an added extra for a fee. Check for the features you need beforehand so you have a clear idea of the total cost of the package. How much data storage is necessary? Most hosting plans contain more than enough data storage for the majority of websites.
But certain files take up more space than others. Lots of image, video or audio files can chew up storage space quickly. It is possible to reduce the size of many large files, so it is worth working with someone with the technical ability to compress the amount of data without compromising quality. If your website is likely to grow over time, this needs to be accounted for.
An extreme example would be a site like YouTube, needing to find storage for thousands of large video files every day. You will probably never be faced with that kind of growth, but even a few large files added regularly over time can soon eat up your storage space. Alternatively, simple text pages with only a few small images take relatively little storage space. You may be able to add hundreds of these pages before filling the space taken by a handful of audio or video files. Where in the world is your web server? Do you know where your hosting company is storing your data? Most Australian hosting is actually stored offshore, although you wouldn’t know it. Storing your website overseas can also impact your website by being slower and being less responsive.
Net registry is 100% Australian owned and run, with the entire hosting infrastructure located and maintained in Sydney. What is bandwidth? Bandwidth is the are that connects your website to the internet. It is a measure of the amount of digital information that can be accessed within a given time period.
For example, if your main webpage is 300 kilobytes in size, then every time a web browser accesses it, 300kb of data travels down the ‘pipe’. How fast this data is transferred depends on how many people are accessing data down the pipe at the same time and how large your pipe is. Larger files, such as audio and video, eat up bandwidth just as they eat up data storage.
Putting a lot of large files in your website can clog the flow of data through the are and make your website much slower to load in a visitors browser if a lot of people access that content at once. Even with small files, your bandwidth can still become bottle-necked. If your data connection can cope with a certain amount of data transfer per second, but the number of people trying to access your website exceeds this, it can also cause a traffic jam. With the data not getting through fast enough, website visitors can receive are timed out error messages instead of your wonderful webpage. If too much traffic tries to access data through this connection at the same time, it is possible to crash the server, putting your website offline until the problem can be fixed. This can make some websites a victim of their own success, with their website becoming inaccessible when they are at their most popular.
If you are expecting a large spike in traffic or an increase in large files, it is worth talking to your web host about bandwidth solutions. This could mean a transfer to a different plan or short term strategies such as website mirrors (a duplicate of the website on a separate hosting server). What is clustered hosting? Hosting usually requires websites to share server space on the one unit, but imagine how upset you would be if the actions of one website impacted on the performance of your business because it shares the same server.
If a website receives too much traffic or overloads (crashes) the server through increased activity, it affects everything stored in the same place. Clustered hosting spreads the load across multiple machines so that no one website can affect any other. This allows for a far more secure and stable hosting environment with fewer risks. Always choose a hosting provider that offers clustered hosting to be sure that your website remains operating at its best.

Should I worry about data transfer limits?
Hosting accounts usually have a data transfer limit. This is the amount of data the hosting package can provide to the internet within the given time period (usually a month). This is similar to you home internet account,
for example, where you pay so much for your connection with a fixed download limit. If you go over your monthly limit, depending on your provider, you may find your internet speed throttled to a crawl or be charged extra for every additional megabyte of data. If either of these outcomes happened to your website, it is not an ideal outcome. Your website could become slow to load or you could receive a larger bill to pay for the extra data transfer. Thankfully, you don’t need to worry. All Netregistry hosting accounts have unlimited data transfer, no throttling and no additional data fees.
What is database technology?
More and more websites now allow visitors to interact and manipulate information on the webpage. This may be by entering and registering their information, performing searches to be presented with a page of specific results, or entering comments into a blog or forum, to mention just three examples. These websites are called are dynamic, and use database technology to store information in sections that can be reassembled into fresh webpages in answer to the visitor request. Classic examples of dynamic websites are blogs, eBay or any site that allows a user to sign in and create a profile. Also, any website that uses a content management system or shopping cart works in the same way. If you plan to include any of these features on your website, you will need to check for database technology on your hosting server.
An example is Netregistrys Business Hosting, providing the most commonly used database features for small business. The database is also bounded by storage limits, and it is worth knowing what these are before planning a complex website. If you don’t plan to use any dynamic features, you may only need basic are statichosting, such as Netregistrys Economy Hosting package, saving you money. Of course, if at any time you decide to develop the features on your website, it is possible to upgrade your hosting plan to the correct configuration. What is Scripting? The most common language used to code web pages is called HyperText Markup Language (HTML).
This code tells a web browser how to display the webpage. But when creating dynamic webpages, some additional scripting languages are needed to tell the web browser where to access the information needed to construct the elements on the page. Because your hosting server will need to interpret these scripts to provide the correct responses from the database, you need to be sure it has been configured with the relevant languages.

Common scripting languages are PHP, Perl, ASP, etc. Unless you are building the website yourself, it is unlikely you will need to know anything about these languages. Your web designer will be able to tell you which particular scripts to look for when choosing a hosting package, but most hosting servers with database technology support the majority of commonly used scripts. Where can I get more advice? You may now understand some of the basics, but relating them to your specific situation can sometimes require experience and further knowledge. If you are still unsure how to choose the best hosting plan for your website, the Net registry sales team is trained in determining your specific needs. The best hosting plan is one that doesn’t require regular attention or additional monthly fees, but can cope with the daily demands of your website without complaint. By addressing the key principles, you can ask the right questions of your hosting provider and give your website the best platform to reach your audience.
Net registry has been providing strong hosting advice and solutions to Australian businesses since 1997. With static, dynamic and e-commerce hosting solutions available, Net registry can provide the reliability and features you need to form the foundations for your new online empire.

SSL CERTIFICATE

WHY YOU NEED AN SSL CERTIFICATE 


                 Introduction Recent numbers from the U.S. Department of Commerce show that online retail is continuing its rapid growth. However, malicious phishing and pharming schemes and fear of inadequate online security cause online retailers to lose out on business as potential customers balk at doing business online, worrying that sensitive data will be abused or compromised. For e-businesses, the key is to build trust: Running a successful online business requires that your customers trust that your business effectively protects their sensitive information from intrusion and tampering. Installing an SSL Certificate from Starfield Technologies on your e-commerce Web site allows you to secure your online business and build customer confidence by securing all online transactions with up to 256-bit encryption.
                An SSL Certificate on your business’ Web site will ensure that sensitive data is kept safe from prying eyes. With a Starfield Technologies SSL Certificate, customers can trust your site. Before issuing a certificate, Starfield Technologies rigorously authenticates the requestor’s domain control and, in the case of High Assurance SSL Certificates, the identity and, if applicable, the business records of the certificate-requesting entity. The authentication process ensures that customers and business partners can rest assured that a Web site protected with a Starfield Technologies certificate can be trusted.
               A Starfield Technologies SSL Certificate provides the security your business needs and the protection your customers deserve. With a Starfield Technologies SSL Certificate, customers will know that your site is secure. Why You Need a Starfield Technologies SSL Certificate In the rapidly expanding world of electronic commerce, security is paramount. Despite booming Internet sales, widespread consumer fear that Internet shopping is not secure still keeps millions of potential shoppers from buying online.
               Only if your customers trust that their credit card numbers and personal information will be kept safe from tampering can you run a successful online business. For online retailers, securing their shopping sites is paramount. If consumers perceive that their credit card information might be compromised online, they are unlikely to do their shopping on the Internet. A Starfield Technologies SSL Certificate provides an easy, cost-effective and secure means to protect customer information and build trust.
               An SSL Certificate enables Secure Sockets Layer (SSL) encryption of your business’ online transactions, allowing you to build an impenetrable fortress around your customers’ credit card information.
 

Starfield Technologies SSL Certificates offer industry-leading security and versatility: 

1. Fully validated
2. Up to 256-bit encryption
3. One-, two- or three-year validity (Turbo SSL Certificates valid up to 10 years)
4. 99% percent browser recognition
5. Stringent authentication
6. Around-the-clock customer support A Starfield Technologies SSL Certificate helps you build an impenetrable fortress around your customers’ credit card information.

What is an SSL Certificate? 

An SSL certificate is a digital certificate that authenticates the identity of a Web site to visiting browsers and encrypts information for the server via Secure Sockets Layer (SSL) technology. A certificate serves as an electronic “passport” that establishes an online entity’s credentials when doing business on the Web. When an Internet user attempts to send confidential information to a Web server, the user’s browser will access the server’s digital certificate and establish a secure connection.

          Information contained in the certificate includes: n The certificate holder’s name (individual or company)* n The certificate’s serial number and expiration date n Copy of the certificate holder’s public key n The digital signature of the certificate-issuing authority To obtain an SSL certificate, one must generate and submit a Certificate Signing Request (CSR) to a trusted Certification Authority, such as Starfield Technologies, which will authenticate the requestor’s identity, existence and domain registration ownership before issuing a certificate.
          Public and Private Keys When you create a CSR, the Web server software with which the request is being generated, creates two unique cryptographic keys: A public key, which is used to encrypt messages to your (i.e., the certificate holder’s) server and is contained in your certificate, and a private key, which is stored on your local computer and “decrypts” the secure messages so they can be read by your server.
          In order to establish an encrypted link between your Web site and your customer’s Web browser your Web server will match your issued SSL certificate to your private key. Because only the Web server has access to its private key, only the server can decrypt SSL-encrypted data. *High Assurance Certificates only. Turbo SSL Certificates only contain the domain name and no information on the individual or company that purchased the certificate. Enabling Safe and Convenient Online Shopping
           A Starfield Technologies SSL Certificate secures safe, easy and convenient Internet shopping. Once an Internet user enters a secure area — by entering credit card information, e-mail address or other personal data, for example — the shopping site’s SSL certificate enables the browser and Web server to build a secure, encrypted connection. The SSL “handshake” process, which establishes the secure session, takes place discreetly behind the scenes, ensuring an uninterrupted shopping experience for the consumer.
           A “padlock” icon in the browser’s status bar and the “https://” prefix in the URL are the only visible indications of a secure session in progress. By contrast, if a user attempts to submit personal information to an unsecured Web site (i.e., a site that is not protected with a valid SSL certificate), the browser’s built-in security mechanism will trigger a warning to the user, reminding him/her that the site is not secure and that sensitive data might be intercepted by third parties. Faced with such a warning, most Internet users likely will look elsewhere to make a purchase.
           Up to 256-Bit Encryption Starfield Technologies SSL certificates support both industry-standard 128-bit (used by all banking infrastructures to safeguard sensitive data) and high-grade 256-bit SSL encryption to secure online transactions. The actual encryption strength on a secure connection using a digital certificate is determined by the level of encryption supported by the user's browser and the server that the Web site resides on. For example, the combination of a Firefox browser and an Apache 2.X Web server enables up to 256-bit AES encryption with Starfield Technologies certificates.
          Encryption strength is measured in key length — number of bits in the key. To decipher an SSL communication, one needs to generate the correct decoding key. Mathematically speaking, 2n possible values exist for an n-bit key. Thus, 40-bit encryption involves 240 possible values. 128- and 256-bit keys involve a staggering 2128 and 2256 possible combinations, respectively, rendering the encrypted data de facto impervious to intrusion. Even with a brute-force attack (the process of systematically trying all possible combinations until the right one is found) cracking a 128- or 256-bit encryption is computationally unfeasible. Stringent Authentication — A Matter of Trust Before Starfield Technologies issues an SSL Certificate, the applicant’s company or personal information undergoes a rigorous authentication procedure that serves to pre-empt online theft and to verify the domain control and, if applicable, the existence and identity of the requesting entity.
         Only through thorough validation of submitted data can the online customer rest assured that online businesses that utilize SSL certificates from Starfield Technologies indeed are to be trusted. SSL Certificates are only issued to entities whose domain control and, depending on certificate type, business credentials and contact information have been verified. Thus, a Starfield Technologies SSL certificate guarantees that the entity that owns the certificate is who it claims to be and has a legal right to use the domain from which it operates.

Starfield Technologies issues three types of SSL Certificates, each of which relies on authentication of a number of elements: High Assurance Certificate — Corporate: Starfield Technologies will authenticate that:
1. The certificate is being issued to an organization that is currently registered with a government authority.
2. The requesting entity controls the domain in the request.
3. The requesting entity is associated with the organization named in the certificate. High Assurance Certificate — Small Business/Sole Proprietor: Starfield Technologies will authenticate that:
4. The individual named in the certificate is the individual who requested the certificate.
5. The requesting individual controls the domain in the request. Starfield Technologies will authenticate that:
6. The requesting entity controls the domain in the request. Phishing and Pharming — How SSL Can Help Phishing and, recently, pharming pose constant threats to Internet users whose sensitive information is under siege by crackers and other cyber crooks.
         
             An SSL certificate from Starfield Technologies can clip the wings of Internet criminals and help prevent Internet users from being victimized by phishing and pharming schemes when attempting to visit your Web site. Phishing schemes – attempts to steal and exploit sensitive personal information – typically try to trick victims into accessing fraudulent sites that pose as legitimate, trusted entities, such as online businesses and banks. Because perpetrators of such attacks will be using and registering domains that resemble those of the spoofed sites, Starfield Technologies, through its stringent fraud-prevention measures, will detect the schemes and deny certificate requests for suspicious domains.
           More sophisticated than phishing, pharming revolves around the concept of hijacking an Internet Service Provider’s (ISP) domain name server (DNS) entries. When a “pharmer” succeeds in such DNS “poisoning” every computer using that ISP for Internet access is directed to the wrong site when the user types in a URL (e.g., www.ebay.com). SSL certificate technology can help prevent pharming attacks, as well. In essence, a “pharmer” simply will not be able to obtain an SSL certificate from Starfield Technologies, as he/she does not control the domain for which the certificate is requested.
           By protecting your Web site with a Starfield Technologies SSL certificate Internet users that attempt to access a site that poses as yours will be instantly alerted that there is a problem with the supposedly secure connection:
1. No lock icon: Because CAs usually won’t issue a certificate to fraudulent phishing or pharming sites, such sites usually do not use SSL encryption. Internet users, therefore, are alerted by the absence of a padlock icon in their browser’s status bar.
2. Name mismatch error: A pharming site could try to use a certificate issued by a CA for a domain owned by the attacker, but the user’s browser will warn the user that the visited URL does not match the certificate presented by the fake Web server.
3. Untrusted CA: A pharming site might attempt to use a certificate issued by an untrusted CA. In this case, the user’s browser will generate the following warning: “the security certificate was issued by a company you have not chosen to trust.” The alert Internet user will instantly abandon his/her activities/ transactions when presented with such warnings. Thus, a Starfield Technologies SSL certificate provides business owners and wary, savvy Internet users with an effective weapon against phishing, pharming and similar cyber swindles.
Establishing a Secure Connection — How SSL Works An SSL-encrypted connection is established via the SSL “handshake” process, which transpires within seconds — transparently to the end user.

In essence, the SSL “handshake” works thus:
1. When accessing an SSL-secured Web site area, the visitor’s browser requests a secure session from the Web server.
2. The server responds by sending the visitor’s browser its server certificate.
3. The browser verifies that the server’s certificate is valid, is being used by the Web site for which it has been issued, and has been issued by a Certificate Authority that the browser trusts.
4. If the certificate is validated, the browser generates a one-time “session” key and encrypts it with the server’s public key.
5. The visitor’s browser sends the encrypted session key to the server so that both server and browser have a copy.
6. The server decrypts the session key using its private key.
7. The SSL “handshake” process is complete, and an SSL connection has been established.
           A padlock icon appears in the browser’s status bar, indicating that a secure session is under way. Conclusion — The Key to Online Security Demand for reliable online security is increasing. Despite booming online sales many consumers continue to believe that shopping online is less safe than doing so at old-fashioned brick-and-mortar stores.
          The key to establishing a successful online business is to build customer trust. Only when potential customers trust that their credit card information and personal data is safe with your business, will they consider making purchases on the Internet.
         
Thanks For read My Article.Any Query Comment Below.

Data Warehousing

Data Warehousing



1. Data Warehousing

          The question most asked now is, How do I build a data warehouse? This is a question that is not so easy to answer. As you will see in this Artical, there are many approaches to building one. However, at the end of all the research, planning, and architecting, you will come to realize that it all starts with a firm foundation. Whether you are building a large centralized data warehouse, one or more smaller distributed data warehouses (sometimes called data marts), or some combination of the two, you will always come to the point where you must decide on how the data is to be structured.
          This is, after all, one of the most key concepts in data warehousing and what differentiates it from the more typical operational database and decision support application building. That is, you structure the data and build applications around it rather than structuring applications and bringing data to them.
           Data warehouse modeling is a process that produces abstract data models for one or more database components of the data warehouse. It is one part of the overall data warehouse development process, which is comprised of other major processes such as data warehouse architecture, design, and construction. We consider the data warehouse modeling process to consist of all tasks related to requirements gathering, analysis, validation, and modeling. Typically for data warehouse development, these tasks are difficult to separate.This may suggest a rather broad gap between modeling and design activities, which in reality certainly is not the case.
           The separation between modeling and design is done for practical reasons: it is our intention to cover the modeling activities and techniques quite extensively. Some trend-setting authors and data warehouse consultants have taken this point to what we consider to be the extreme. That is, they are presenting what they are calling a totally new approach to data modeling. It is called dimensional data modeling, or fact/dimension modeling. Fancy names have been invented to refer to different types of dimensional models, such as star models and snowflake models. Numerous arguments have been presented against traditional entity-relationship (ER) modeling, when used for modeling data in the data warehouse.
            Rather than taking this more extreme position, we believe that every technique has its area of usability. For example, we do support the many criticisms of ER modeling when considered in a specific context of data warehouse data modeling, and there are also criticisms of dimensional modeling. There are many types of data warehouse applications for which ER modeling is not well suited, especially those that address the needs of a well-identified community of data analysts interested primarily in analyzing their business measures in their business context.
            Likewise, there are data warehouse applications that are not well supported at all by star or snowflake models alone. For example, dimensional modeling is not very suitable for making large, corporatewide data models for a data warehouse. With the changing data warehouse landscape and the need for data warehouse modeling, the new modeling approaches and the controversies surrounding traditional modeling and the dimensional modeling approach all merit investigation. And that is another purpose of this post. Because it presents details of data warehouse modeling processes and techniques, the post can also be used as an initiating for those who want to learn data warehouse modeling.


 2. Data Warehousing Architecture and Implementation Choices 

                 In this post we discuss the architecture and implementation choices available for data warehousing. During the discussions we may use the term data mart. Data marts, simply defined, are smaller data warehouses that can function independently or can be interconnected to form a global integrated data warehouse. However, in this post, unless noted otherwise, use of the term data warehouse also implies data mart. Although it is not always the case, choosing an architecture should be done prior to beginning implementation. The architecture can be determined, or modified, after implementation begins. However, a longer delay typically means an increased volume of rework. And, everyone knows that it is more time consuming and difficult to do rework after the fact than to do it right, or very close to right, the first time. The architecture choice selected is a management decision that will be based on such factors as the current infrastructure, business environment, desired management and control structure, commitment to and scope of the implementation effort, capability of the technical environment the organization employs, and resources available.
            The implementation approach selected is also a management decision, and one that can have a dramatic impact on the success of a data warehousing project. The variables affected by that choice are time to completion, return-on-investment, speed of benefit realization, user satisfaction, potential implementation rework, resource requirements needed at any point-in-time, and the data warehouse architecture selected.

 3. Architecture Choices Selection of an architecture

                    Architecture Choices Selection of an architecture will determine, or be determined by, where the data warehouses and/or data marts themselves will reside and where the control resides. For example, the data can reside in a central location that is managed centrally. Or, the data can reside in distributed local and/or remote locations that are either managed centrally or independently. The architecture choices we consider in this book are global, independent, interconnected, or some combination of all three. The implementation choices to be considered are top down, bottom up, or a combination of both. It should be understood that the architecture choices and the implementation choices can also be used in combinations.
      For example, a data warehouse architecture could be physically distributed, managed centrally, and implemented from the bottom up starting with data marts that service a particular workgroup, department, or line of business.

4. Global Warehouse Architecture 

             A global data warehouse is considered one that will support all, or a large part, of the corporation that has the requirement for a more fully integrated data warehouse with a high degree of data access and usage across departments or lines-of-business. That is, it is designed and constructed based on the needs of the enterprise as a whole. It could be considered to be a common repository for decision support data that is available across the entire organization, or a large subset thereof. Top Down Implementation A top down implementation requires more planning and design work to be completed at the beginning of the project. This brings with it the need to involve people from each of the workgroups, departments, or lines of business that will be participating in the data warehouse implementation. Decisions concerning data sources to be used, security, data structure, data quality, data standards, and an overall data model will typically need to be completed before actual implementation begins. The top down implementation can also imply more of a need for an enterprisewide or corporatewide data warehouse with a higher degree of cross workgroup, department, or line of business access to the data.
            This approach is depicted in with this approach, it is more typical to structure a global data warehouse. If data marts are included in the configuration, they are typically built afterward. And, they are more typically populated from the global data warehouse rather than directly from the operational or external data sources. Bottom Up Implementation A bottom up implementation involves the planning and designing of data marts without waiting for a more global infrastructure to be put in place. This does not mean that a more global infrastructure will not be developed; it will be built incrementally as initial data mart implementations expand. This approach is more widely accepted today than the top down approach because immediate results from the data marts can be realized and used as justification for expanding to a more global implementation. depicts the bottom up approach. In contrast to the top down approach, data marts can be built before, or in parallel with, a global data warehouse. And as the figure shows, data marts can be populated either from a global data warehouse or directly from the operational or external data sources.

 5. Architecting the Data A data warehouse

            Architecting the Data A data warehouse is, by definition, a subject-oriented, integrated, time-variant collection of data to enable decision making across a disparate group of users. One of the most basic concepts of data warehousing is to clean, filter, transform, summarize, and aggregate the data, and then put it in a structure for easy access and analysis by those users. But, that structure must first be defined and that is the task of the data warehouse model. In modeling a data warehouse, we begin by architecting the data. By architecting the data, we structure and locate it according to its characteristics. In this chapter, we review the types of data used in data warehousing and provide some basic hints and tips for architecting that data. We then discuss approaches to developing a data warehouse data model along with some of the considerations. Having an enterprise data model (EDM) available would be very helpful, but not required, in developing the data warehouse data model. For example, from the EDM you can derive the general scope and understanding of the business requirements. The EDM would also let you relate the data elements and the physical design to a specific area of interest. Data granularity is one of the most important criteria in architecting the data. On one hand, having data of a high granularity can support any query. However, having a large volume of data that must be manipulated and managed could be an issue as it would impact response times.
          On the other hand, having data of a low granularity would support only specific queries. But, with the reduced volume of data, you would realize significant improvements in performance.

6. Structuring the Data In structuring the data

         Structuring the Data In structuring the data, for data warehousing, we can distinguish three basic types of data that can be used to satisfy the requirements of an organization: · Real-time data · Derived data · Reconciled data In this section, we describe these three types of data according to usage, scope, and currency. You can configure an appropriate data warehouse based on these three data types, with consideration for the requirements of any particular implementation effort. Depending on the nature of the operational systems, the type of business, and the number of users that access the data warehouse, you can combine the three types of data to create the most appropriate architecture for the data warehouse.

7. Data Modeling

              Data Modeling for a Data Warehouse This chapter provides you with a basic understanding of data modeling, specifically for the purpose of implementing a data warehouse. Data warehousing has become generally accepted as the best approach for providing an integrated, consistent source of data for use in data analysis and business decision making. However, data warehousing can present complex issues and require significant time and resources to implement. This is especially true when implementing on a corporatewide basis. To receive benefits faster, the implementation approach of choice has become bottom up with data marts. Implementing in these small increments of small scope provides a larger return-on-investment in a short amount of time. Implementing data marts does not preclude the implementation of a global data warehouse.
            It has been shown that data marts can scale up or be integrated to provide a global data warehouse solution for an organization. Whether you approach data warehousing from a global perspective or begin by implementing data marts, the benefits from data warehousing are significant. The question then becomes, How should the data warehouse databases be designed to best support the needs of the data warehouse users? Answering that question is the task of the data modeler. Data modeling is, by necessity, part of every data processing task, and data warehousing is no exception. As we discuss this topic, unless otherwise specified, the term data warehouse also implies data mart. We consider two basic data modeling techniques in this book: ER modeling and dimensional modeling. In the operational environment, the ER modeling technique has been the technique of choice. With the advent of data warehousing, the requirement has emerged for a technique that supports a data analysis environment. Although ER models can be used to support a data warehouse environment, there is now an increased interest in dimensional modeling for that task. In this chapter, we review why data modeling is important for data warehousing. Then we describe the basic concepts and characteristics of ER modeling and dimensional modeling.

8. The Process of Data Warehousing

            The Process of Data Warehousing This chapter presents a basic methodology for developing a data warehouse. The ideas presented generally apply equally to a data warehouse or a data mart. Therefore, when we use the term data warehouse you can infer data mart. If something applies only to one or the other, that will be explicitly stated. We focus on the process of data modeling for the data warehouse and provide an extended section on the subject but discuss it in the larger context of data warehouse development. The process of developing a data warehouse is similar in many respects to any other development project. Therefore, the process follows a similar path. What follows is a typical, and likely familiar, development cycle with emphasis on how the different components of the cycle affect your data warehouse modeling efforts.
             It is certainly true that there is no one correct or definitive life cycle for developing a data warehouse. We have chosen one simply because it seems to work well for us. Because our focus is really on modeling, the specific life cycle is not an issue here. What is essential is that we identify what you need to know to create an effective model for your data warehouse environment. There are a number of considerations that must be taken into account as we discuss the data warehouse development life cycle. We need not dwell on them, but be aware of how they affect the development effort and understand how they will affect the overall data warehouse design and model. · The life cycle diagram in  seems to infer a single instance of a data warehouse. Clearly, this should be considered a logical view. That is, there could be multiple physical instances of a data warehouse involved in the environment. As an example, consider an implementation where there are multiple data marts. In this case you would iterate through the tasks in the life cycle for each data mart. This approach, however, brings with it an additional consideration, namely, the integration of the data marts. This integration can have an impact on the physical data, with considerations for redundancy, inconsistency, and currency levels. Integration is also especially important because it can require integration of the data models for each of the data marts as well. If dimensional modeling were being used, the integration might take place at the dimension level. Perhaps there could be a more global model that contains the dimensions for the organization. Then when data marts, or multiple instances of a data warehouse, are implemented, the dimensions used could be subsets of those in the global model. This would enable easier integration and consistency in the implementation. · Data marts can be dependent or independent. In the previous consideration we addressed dependent data marts with their need for integration. Independent data marts are basically smaller in scope data warehouses that are stand-alone. In this case the data models can also be independent, but you must understand that this type of implementation can result in data redundancy, inconsistency, and currency levels. The key message of the life cycle diagram is the iterative nature of data warehouse development. This, more than anything else, distinguishes the life cycle of a data warehouse project from other development projects. Whereas all projects have some degree of iteration, data warehouse projects take iteration to the extreme to enable fast delivery of portions of a warehouse. Thus portions of a data warehouse can be delivered while others are still being developed. In most cases, providing the user with some data warehouse function generates immediate benefits. Delivery of a data warehouse is not typically an all-or-nothing proposition. Because the emphasis of this book is on modeling for the data warehouse, we have left out discussion about infrastructure acquisition. Although this would certainly be part of any typical data warehouse effort, it does not directly impact the modeling process.

 9. Requirements Gathering 

         The traditional development cycle focuses on automating the process, making it faster and more efficient. The data warehouse development cycle focuses on facilitating the analysis that will change the process to make it more effective. Efficiency measures how much effort is required to meet a goal. Effectiveness measures how well a goal is being met against a set of expectations. The requirements identified at this point in the development cycle are used to build the data warehouse model. But, the requirements of an organization change over time, and what is true one day is no longer valid the next. How then, do you know when you have successfully identified the user¢s requirements? Although there is no definitive test, we propose that if your requirements address the following questions, you probably have enough information to begin modeling: · Who (people, groups, organizations) is of interest to the user? · What (functions) is the user trying to analyze? · Why does the user need the data? · When (for what point in time) does the data need to be recorded? · Where (geographically, organizationally) do relevant processes occur? · How do we measure the performance or state of the functions being analyzed? There are many methods for deriving business requirements. In general, these methods can be placed in one of two categories: source-driven requirements gathering and user-driven requirements gathering.

 10. Source-Driven Requirements Gathering 

             Source-driven requirements gathering, as the name implies, is a method based on defining the requirements by using the source data in production operational systems. This is done by analyzing an ER model of source data if one is available or the actual physical record layouts and selecting data elements deemed to be of interest.

 11. User-Driven Requirements Gathering 

            User-driven requirements gathering is a method based on defining the requirements by investigating the functions the users perform. This is usually done through a series of meetings and/or interviews with users. The major advantage to this approach is that the focus is on providing what is needed, rather than what is available. In general, this approach has a smaller scope than the source-driven approach. Therefore, it generally produces a useful data warehouse in a shorter timespan. 11. Data Warehouse Modeling Techniques Data warehouse modeling is the process of building a model for the data that is to be stored in the data warehouse. The model produced is an abstract model, and in this sense, it is a representation of reality, or at least a part of reality which the data warehouse is assumed to support. When considered like this, data warehouse modeling seems to resemble traditional database modeling, which most of us are familiar with in the context of database development for operational applications (OLTP database development). This resemblance should be considered with great care, however, because there are a number of significant differences between data warehouse modeling and OLTP database modeling. These differences impact not only the modeling process but also the modeling techniques to be used.

 12. Selecting a Modeling 

               Tool Modeling for data warehousing is significantly different from modeling for operational systems. In data warehousing, quality and content are more important than retrieval response time. Structure and understanding of the data, for access and analysis, by business users is a base criterion in modeling for data warehousing, whereas operational systems are more oriented toward use by software specialists for creation of applications. Data warehousing also is more concerned with data transformation, aggregation, subsetting, controlling, and other process-oriented tasks that are typically not of concern in an operational system. The data warehouse data model also requires information about both the source data that will be used as input and how that data will be transformed and flow to the target data warehouse databases. Thus, the functions required for data modeling tools for data warehousing data modeling have significantly different requirements from those required for traditional data modeling for operational systems. In this chapter we outline some of the functions that are of importance for data modeling tools to support modeling for a data warehouse. The key functions we cover are: diagram notation for both ER models and dimensional models, reverse engineering, forward engineering, source to target mapping of data, data dictionary, and reporting. We conclude with a list of modeling tools.

 13. Populating the Data Warehouse 

Populating is the process of getting the source data from operational and external systems into the data warehouse and data marts. The data is captured from the operational and external systems, transformed into a usable format for the data warehouse, and finally loaded into the data warehouse or the data mart. Populating can affect the data model, and the data model can affect the populating process.

Hadoop Introduction

Hadoop Introduction 


Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture Hadoop has two major layers namely:
(a) Processing/Computation layer (MapReduce), and
(b) Storage layer (Hadoop Distributed File System).

MapReduce MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware.
It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
It is highly fault-tolerant and is designed to be deployed on low-cost hardware.
It provides high throughput access to application data and is suitable for applications having large datasets.
Apart from the above-mentioned two core components,
Hadoop framework also includes the following two modules:
1.Hadoop Common: These are Java libraries and utilities required by other Hadoop modules.
2.Hadoop YARN: This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?
 It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative,
you can tie together many commodity computers with single-CPU,
as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput.
Moreover, it is cheaper than one high-end server.
So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines. Hadoop runs code across a cluster of computers.

This process includes the following core tasks that Hadoop performs:

1. Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).
2. These files are then distributed across various cluster nodes for further processing.
3. HDFS, being on top of the local file system, supervises the processing.
4. Blocks are replicated for handling hardware failure.
5. Checking that the code was executed successfully.
6. Performing the sort that takes place between the map and reduce stages.
7. Sending the sorted data to a certain computer.
8. Writing the debugging logs for each job. Advantages of Hadoop.

1. Hadoop framework allows the user to quickly write and test distributed systems.
2. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.
3. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
4. Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
5. Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.
Installation Hadoop Hadoop is supported by GNU/Linux platform and its flavors.
Therefore, we have to install a Linux operating system for setting up Hadoop environment.
In case you have an OS other than Linux, you can install a Virtualbox software in it and have Linux inside the Virtualbox.
Pre-installation Setup Before installing Hadoop into the Linux environment, we need to set up Linux using ssh (Secure Shell).
Follow the steps given below for setting up the Linux environment.

Hadoop Downloading Download and extract Hadoop
2.4.1 from Apache software foundation using the following commands.
$ su password: # cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf
hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Modes of Hadoop Operation
Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three supported modes:
1. Local/Standalone Mode: After downloading Hadoop in your system, by default, it is configured in a standalone mode and can be run as a single java process.
2. Pseudo Distributed Mode: It is a distributed simulation on single machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is useful for development. 3. Fully Distributed Mode: This mode is fully distributed with minimum two or more machines as a cluster. We will come across this mode in detail in the coming chapters. Installing Hadoop in Standalone Mode Here we will discuss the installation of Hadoop

2.4.1 in standalone mode.
There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.
Setting Up Hadoop. You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop Before proceeding further, you need to make sure that Hadoop is working fine.
Just issue the following command: $ hadoop version
If everything is fine with your setup,
then you should see the following result:
Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4
It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is configured to run in a non-distributed mode on a single machine.
HDFS OVERVIEW Hadoop File System was developed using distributed file system design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault-tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from possible data losses in case of failure.

HDFS also makes applications available to parallel processing. Features of HDFS It is suitable for the distributed storage and processing.
1. Hadoop provides a command interface to interact with HDFS.
2. The built-in servers of namenode and datanode help users to easily check the status of cluster.
3. Streaming access to file system data.
4. HDFS provides file permissions and authentication. HDFS follows the master-slave architecture and it has the following elements.
Namenode The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software.
It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:
1. Manages the file system namespace.
2. Regulates client’s access to files.
3. It also executes file system operations such as renaming, closing, and opening files and directories.
Datanode The datanode is a commodity hardware having the GNU/Linux operating system and datanode software.
For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.
1. Datanodes perform read-write operations on the file systems, as per client request.
2. They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
Block Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks.
In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS Fault detection and recovery: Since HDFS includes a large number of commodity hardware, failure of components is frequent.

Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.

Huge datasets: HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.

Hardware at data: A requested task can be done efficiently, when the computation takes place near the data.
 Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

What is MapReduce?
 MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes.
Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial.
But, once we write an application in the MapReduce form, scaing the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change.
This simple scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm

1. Generally MapReduce paradigm is based on sending the computer to where the data resides!
2. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
A) Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
B) Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will be stored in the HDFS.

3. During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
4. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
5. Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
6. After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective) The MapReduce framework operates on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job: (Input) -> map -> -> reduce -> (Output). Terminology
1. PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
2. Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
3. NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
4. DataNode - Node where data is presented in advance before any processing takes place.
5. MasterNode - Node where JobTracker runs and which accepts job requests from clients.
6. SlaveNode - Node where Map and Reduce program runs.
7. JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
8. Task Tracker - Tracks the task and reports status to JobTracker.
9. Job - A program is an execution of a Mapper and Reducer across a dataset.
10. Task - An execution of a Mapper or a Reducer on a slice of data.
11. Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode. Hadoop Distributions Hadoop Distributions aim to resolve version incompatibilities

• Distribution Vendor will – Integration Test a set of Hadoop products – Package Hadoop products in various installation formats
• Linux Packages, tarballs, etc. – Distributions may provide additional scripts to execute Hadoop – Some vendors may choose to backport features and bug fixes made by Apache – Typically vendors will employ Hadoop committers so the bugs they find will make it into Apache’s repository.

Distribution Vendors

• Cloudera Distribution for Hadoop (CDH)
• MapR Distribution
• Hortonworks Data Platform (HDP)
• Apache BigTop Distribution
• Greenplum HD Data Computing Appliance Cloudera Distribution for Hadoop (CDH)
• Cloudera has taken the lead on providing Hadoop Distribution – Cloudera is affecting the Hadoop eco-system in the same way RedHat popularized Linux in the enterprise circles
• Most popular distribution – http://www.cloudera.com/hadoop – 100% open-source
• Cloudera employs a large percentage of core Hadoop committers
• CDH is provided in various formats – Linux Packages, Virtual Machine Images, and Tarballs Cloudera Distribution for Hadoop (CDH)
• Integrates majority of popular Hadoop products – HDFS, MapReduce, HBase, Hive, Mahout, Oozie, Pig, Sqoop, Whirr, Zookeeper, Flume
• CDH4 is used in this class Supported Operating Systems • Each Distribution will support its own list of Operating Systems (OS)
• Common OS supported – Red Hat Enterprise – CentOS – Oracle Linux – Ubuntu – SUSE Linux Enterprise Server
• Please see vendors documentation for supported OS and version – Supported Operating Systems for CDH4: https://ccp.cloudera.com/display/CDH4DOC/Before+You+Install+CDH4+on+a+Cl uster#BeforeYouInstallCDH4onaCluster-SupportedOperatingSystemsforCDH4
Thanks For read My Article.Any Query Comment Below.

Artificial Intelligent-IV

Artificial Intelligent-IV Hello ,                So    we have go forward to learn new about Artificial Intelligent S...