Blogs

Docker

David Lee's picture

On June, An open source technology called Docker was strongly introduced by Google uber engineer Eric Brewer, who said that a developer technology hasn’t taken off so enormously since the rise of Ruby on Rails framework. Docker is a Container to package network services, make various tasks isolated on a computer server, preventing them interfering with one another. And much more importantly, this Container can be easily moved to different servers to deploy service without heavily effort.

There is a vision of developers with cloud computing that the internet should be treated as a giant computer which can provide unlimited computing resource. But it is not true. The same service is hard to run on different platforms or hosts. Virtual Machine provide a solution, but it is required to deploy an image of whole the operating system. By comparison, Docker provides an extremely lightweight way to deploy service more quickly and more conveniently.

We must introduce Eric Brewer, one of the elite engineer in Google. In the mid-1990s, as a professor of University of California, Berkeley, Brewer built Inktomi, the first web search engine to run on a vast network of cheap machines, as opposed to one enormously powerful–and enormously expensive–computer server. Over the next two decades, companies like Google, Amazon and Facebook learned on the philosophy of CAP theorem by Brewer, reach to an extreme distance. “He is the grandfather of all the technologies that run inside Google,” says Craig Mcluckie, a longtime product manager for Google’s cloud services.  read more »

SCSS

Juyun Wang's picture

Sass (Syntactically Awesome Stylesheets) is a CSS3 extension language that contains two syntaxes: Sass (Syntactically Awesome Stylesheets) and SCSS (Sassy CSS). The former has an indented syntax and does not require parentheses or semicolons. Instead, it uses indentation to replace brackets. The latter is similar to the original CSS3 and requires parentheses and semicolons. However, there is no difference in functionality between the two. The choice of which syntax is completely a matter of personal preference, but note that some features do not have the same syntax. You can refer to the official documents before writing.

Sass adds features such as nested rules, variables, mixins, selector inheritance, etc. Below are some examples of SCSS features:

1. Variables
SCSS can define its own variables so that you can assign values like color code, font style and width size. In the past, if you wanted to modify a specific color, you would have to search for the value one by one and change its value. Now, you just need to modify the variable's value.  read more »

CSS3 Gradient Buttons

June Huang's picture

The gradient property in CSS3 allows us to display smooth transitions of colors on page elements without the use of images or JavaScript. This property is now supported by major browsers like Internet Explorer, Firefox, Chrome, and Safari. Below I will demonstrate how to use this property to create gradient buttons.

Button CSS:

.button {

display: inline-block;
padding: 6px 22px;
-webkit-box-shadow: 1px 1px 0px 0px #FFF6AD;
-moz-box-shadow: 1px 1px 0px 0px #FFF6AD;
box-shadow: 1px 1px 0px 0px #FFF6AD;
-moz-border-radius: 5px;
-webkit-border-radius: 5px;
border-radius: 5px;
border: 1px solid #FFCC00;
color: #333333;
font: 14px Arial;
text-shadow: 1px 1px 0px #FFB745;

}  read more »

Google URL Shortener

Juyun Wang's picture

The Google URL Shortener is a service that takes long URLs and squeezes them into fewer characters to make a link that is easier to share, or email to friends. Google provides URL Shortener API for free with a daily limit of 1,000,000 queries.We can use the Google URL Shortener API to programmatically interact with this service and to develop applications that use simple HTTP methods to store, share, and manage goo.gl short URLs. Below is the example of Google URL Shortener APIwith PHP:

1. Get your API key : https://code.google.com/apis/console
2. Shorten a long URL

To send and get the json data we use the PHP cURL functions.

define ( 'API_KEY', 'AIzaSyB8NLUnMd4Vnhg7ee3FN0EMMXmC5cZlwLA' );
define ( 'API_URL', 'https://www.googleapis.com/urlshortener/v1/url' );

if (isset ( $_REQUEST ['url'] ) && ! empty ( $_REQUEST ['url'] ))
{

$longUrl = $_REQUEST ['url'];

// Create the data
$postData = array (
'longUrl' => $longUrl,
'key' => API_KEY
);

// Encoded into JSON
$jsonData = json_encode ( $postData );

// Initialize the cURL session
$ch = curl_init ();
curl_setopt ( $ch, CURLOPT_URL, API_URL );
curl_setopt ( $ch, CURLOPT_POST, 1 );
curl_setopt ( $ch, CURLOPT_HTTPHEADER, array ( 'Content-type:application/json' ) );
curl_setopt ( $ch, CURLOPT_POSTFIELDS, $jsonData );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, 0 );

// Execute the cURL session
$result = curl_exec ( $ch );

// Close the cURL session
curl_close ( $ch );
$shortUrl = json_decode ( $result );
echo $shortUrl->id;

}

We are creating a simple URL Shortener page example:

You can also call this method to expand any goo.gl short URL, or Look up a short URL's analytics. The Detail can refer to Google URL Shortener API official website: https://developers.google.com/url-shortener/?hl=zh-TW

Cookies

June Huang's picture

Cookies, in short, are pieces of information that are stored on your computer when you visit a website. Websites use cookies to keep track of your activities on the website, for example, your login state. When you browse through or revisit the website, the website's cookie data are sent back to the website so that certain activity-related information can be displayed to you. Below I shall provide some examples of using cookies and discuss the privacy concerns regarding them.

Websites can use cookies to track user activity, save preferences and gather information about the user. The following are some examples where cookies are used: E-commerce websites like Amazon, use cookies to remember the items that customers put in their shopping carts. In doing so, customers do not necessarily have to log in before they are able to shop on the website. Cookies can be used to store customizations like language, localization settings and interface layout for the users' needs and convenience. Cookies are also commonly used for personalized advertising, for example, advertisements on Google and Facebook. Various information about your interests can be gathered through recent searches and the links you click. Once advertisers learn these information, they can offer advertisements that would appeal to you so that you are more tempted to click on the advertisement.

Although cookies are simply just text files and can be deleted by the user at any time, there are privacy issues associated with some of its uses. Most people are not aware that cookies are used at all and that their personal data are being collected. Even if they did know, they will have no control over what third party websites do with their information. Browsers offer cookie settings that lets users allow or disable cookies for specific or all websites. Disabling cookies for all websites might not be a good idea since some websites require cookies in order to function. Lastly, remember that cookies are not shared between different browsers so you will have to edit the cookie settings individually on each browser.

References:
[1] HTTP cookie. (2012, September 17). In Wikipedia, The Free Encyclopedia. Retrieved 11:14, September 19, 2012, from http://en.wikipedia.org/w/index.php?title=HTTP_cookie&oldid=513075176
[2] How to Enable Cookies. Amazon: Help. Retrieved 16:55, September 19, 2012, from http://www.amazon.com/gp/help/customer/display.html?ie=UTF8&nodeId=200156940
[3] Advertising privacy FAQ. Google: Policies & Priciples. Retrieved 17:16, September 19, 2012, from http://www.google.com/policies/privacy/ads/
[4] Cookies, Pixels, and Similar Technologies. Facebook: Help Center. Retrieved 17:48, September 19, 2012, from http://www.facebook.com/help/cookies

What's new in Ext JS 4?

Ruby Lin's picture

Ext JS 4 uses MVC architecture. MVC(Model-View-Controller) can organize your code, maintain easy, reduce the amount of codes you have to write, and improve efficiency. It fits big projects.

Here is the folder structure:

application_name

app

controller

class1.js

model

class1.js

store

class1.js

view

Grid.js

Panel.js

data

class1.json

app.js

index.html

app.js - The Application class contains global settings for your application (such as the app's name), as well as maintains references to all of the models, views and controllers used by the app.

controller - listen for events (usually from views) and take some actions.

model - define a Model for real-world object that we want to model in the system.

store - tell it the Model and the Proxy to use to load and save its data.

view - usually defined as a subclass of an Ext JS component(such as grid, panel).

And Ext JS 4 enables you to load any number of records into a grid without paging. Past load a large amount of records will cause an error: out of memory. The new grid uses a virtualized scrolling system to handle potentially infinite data sets without any impact on client side performance.

Else like: multiple versions side by side, split DOM for high performance, more fantastic charts...etc.

More details can refer to the official website: http://docs.sencha.com/ext-js/4-0/

Valgrind: the memory error detector

David Lee's picture

C/C++ programming language provide powerful memory operating ability through pointers, and programmers can make efficient and flexible control to low-level memory. However, dynamic memory errors become one of the toughest part to debug. Almost every programmers have suffered segmentation fault or memory leak problem. Those errors which only appear at the run-time and can be detected by the compiler spend a lot of development time. Fortunately, there are many memory profiling tool can help programmers find the bugs effectively. Valgrind, the topic of this article, is one of these convenient tools.

Valgrind is a open source dynamic analysis software, which can be used to detect memory, cache and threading bugs, can use branch-prediction function to profile program performance, or even to plugin external tools to make more detailed program testing. This article only introduces the memory error detection function of Valgrind.

First we need a program to be detected, of course. Suppose we write a source code as follows (this code is cited by the Valgrind website,) and name it “test.c”.  read more »

Login with Facebook and retrieve user identities

Hank Chen's picture

Facebook, which is abbreviated as FB, is a social-network website. Many websites integrate FB for promotions. How to integrate FB into your website? There are

1.Your options for API:
Facebook official API can be divide to two parts: one part is web programming, which is implemented by JavaScript or PHP; another port is mobile programming for iOS or android. Because FB is so popular recently, many program languages also develop APIs for FB, such as SilverLight, Flash, .Net or Java. And this article choose JavaScript for example.

2.Before developing:
2.1.Sign up for a Facebook account.
2.2.Login your Facebook account and sign up for “Developer”. You can get information from following URL:
http://sofree.cc/fb-app-1/

3.Start creating your Facebook APP:
3.1.After complete Step 2, click the button “Create New APP”, to get your App ID and APP secret, just like the following screenshot with red box. Then, fill out the “Basic Info” form, like the following screenshot with blue box. For example, our website for testing is “me.cellpoint.com”, so you fill “me.cellpoint.com” in the field “App Domain”, and fill “http://me.cellpoint.com” in “URL”. Finally, click “Save” for complete your work.  read more »

Libgtop

David Lee's picture

How to get the resource usage of Linux system, such as memory and CPU utilization, at the runtime of a process? We can read the file of system in the directory /proc/<process id>/stat, or we can use the “top” command in the shell; Howerver, extra effort is required with both approach because the file or interface need to be parsed before we use it. Here is another method to get the information about resource usage of entire system or a specific process: Ligbtop, a open source library based on C programming.

Libgtop is a library of GNOME project, used to implement the “top” functionality of the desktop environment. It depends on Glib, another library of GNOME. The latest version of Libgtop is 2.28. Noticed that Glib 2.6.0 and Intltool 0.35.0 or later versions need to be installed before we install Libgtop.

In general, the CPU utilization is caculated according the time CPU spend in different mode. These usually can be divided into user mode, nice mode, system(kernel) mode and idle mode. We can use the API of Libgtop to get the CPU time (clock clicks) of each mode from system boot. For example, the source code below can be used to caculate the CPU utilization.

#include <glibtop>
#include <glibtop/cpu.h>

double cpu_rate;
int dt, du, dn, ds;
glibtop_cpu cpu_begin,cpu_end;
glibtop_get_cpu(&cpu_begin);
sleep(1);
glibtop_get_cpu(&cpu_end);
dt = cpu_end.total - cpu.begin.total;
du = cpu_end.user - cpu.begin.user;
dn = cpu_end.nice - cpu.begin.nice;
ds = cpu_end.sys - cpu.begin.sys;
cpu_rate = 100.0 * (du+dn+ds) / dt

Note that we need to get the clock click count at two different points in time, so the function glibtop_get_cpu is called twice. On the other hand, the monitor of memory utilization is much more simply, as:

#include <glibtop>
#include <glibtop/mem.h>

double mem_rate;
glibtop_mem memory;
glibtop_get_mem(&memory);
mem_rate = 100.0 * memory.used / memory.total;

There are a variety of types of resource can be monitored by Libgtop. In addition to system CPU and memory utilization described above, also includes CPU and memory utilization of specific process, swap, file system, network interface, and so on. The Detail API and data structure can refer to GNOME’s official website: http://developer.gnome.org/libgtop/

Introduction of Google File System

David Lee's picture

Why can Google dominate the search engine market? One important reason is the excellent performance relies on the file system. Google has designed a unique distributed file system to meet its huge storage demand, known as Google File System (GFS). Google did not release GFS as open source software, but still released some technical details, including an official paper.

There are two mainly differences between GFS and traditional distributed file system. First, component failures are the norm rather than the exception. The failures can be caused by application bugs, operating system bugs, human errors, and even hardware or network problems. Since even the expensive hard disk device can not completely exclude all failures, Google just simply builds its storage machine by mutiple inexpensive comodity, and against failures through integrate constant monitoring, error detection, fault tolerance, and automatic recovery to GFS.

Second, most files are mutated by appending new data rather than overwriting or removing existing data. Once written, data are usually need only to be readable but not writable. And most reading operations are “large streaming reads”, where individual operations typically read hundreds of KBs, more commonly 1 MB or more. Notice that the system stores a modest number of large files, each typically 100 MB or larger in size. GFS support small files, but does not optimize for them.

The architecture of GFS similar to the supernode (Master) and distributed nodes (chunkservers) approach. Real data will be stored in chunkservers, which report their state to Master periodically. When a client wants to read a file, it query to Master about the state of target chunkserver, and Master response the location of chunkserver if it is in idle stage. Hence the client can request chunk data to the chunkserver.

GFS supports the large volume and flows for Google search engine. On the other hand, BigTable, a database system used by a number of Google applications such as Gmail, Google Maps, Youtube and other cloud services, also built on GFS. We can say that GFS is the killer technology in the cloud generation.

More details can refer to the GFS: http://labs.google.com/papers/gfs.html

Semaphore functions in PHP

Ruby Lin's picture

Semaphore is a variable or abstract data type that provides a simple but useful abstraction for controlling access by multiple processes to a common resource in a parallel programming environment.

A useful way to think of a semaphore is as a record of how many units of a particular resource are available, coupled with operations to safely (i.e. without race conditions) adjust that record as units are required or become free, and if necessary wait until a unit of the resource becomes available.

Semaphores are a useful tool in the prevention of race conditions and deadlocks; however, their use is by no means a guarantee that a program is free from these problems. Semaphores which allow an arbitrary resource count are called counting semaphores, whilst semaphores which are restricted to the values 0 and 1 (or locked/unlocked, unavailable/available) are called binary semaphores.

The following are semaphore functions in PHP:
int ftok (string $pathname, string $proj) - Convert a pathname and a project identifier to a System V IPC key.

sem_acquire (resource $sem_identifier) - Acquire a semaphore.

resource sem_get (int $key [,int $max_acquire = 1 [,int $perm = 0666 [,int $auto_release = 1]]]) - Get a semaphore id.

bool sem_release (resource $sem_identifier) - Release a semaphore.

bool sem_remove (resource $sem_identifier) - Remove a semaphore.

resource shm_attach (int $key [, int $memsize [, int $perm]]) - Creates or open a shared memory segment.

bool shm_detach (resource $shm_identifier) - Disconnects from shared memory segment.

mixed shm_get_var (resource $shm_identifier, int $variable_key) - Returns a variable from shared memory.

bool shm_has_var (resource $shm_identifier, int $variable_key) - Check whether a specific entry exists.

bool shm_put_var (resource $shm_identifier, int $variable_key, mixed $variable) - Inserts or updates a variable in shared memory.

bool shm_remove_var (resource $shm_identifier, int $variable_key) - Removes a variable from shared memory.

bool shm_remove (resource $shm_identifier) - Removes shared memory from Unix systems.

http://php.net/manual/en/book.sem.php

GNU libextractor

Shawn Lin's picture

Introduction

GNU libextractor is GNU’s library for extracting meta data from files. Meta data includes format information (such as mime type, image dimensions, color depth, recording frequency), content descriptions (such as document title or document description) and copyright information (such as license, author and contributors). Currently, libextractor supports the following formats: HTML, PDF, PS, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC, MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL, RIFF (AVI), MPEG, QT and ASF. Also, various additional MIME types are detected.

Libextractor is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License. GNU libextractor uses plugins to handle various file formats. Technically a plugin can support multiple file formats; however, most plugins only support one particular format. By default, GNU libextractor will use all plugins that are available and found in the plugin installation directory. Applications can request the use of only specific plugins or the exclusion of certain plugins.  read more »

What is memcache?

HH Tu's picture

Today, I will introduce an useful technique in fetching database - Memcache. it is a distributed memory caching system, we can build a highly efficient cloud system with it. The basic concept behind this method is to use key-based structure to fetch and store data into memory. The original idea comes from Brad Fitzpatrick who used this method to enhance LiveJournal.com(2003). There are lots of website which use this method: LiveJournal, Wikipedia, Flickr, Twitter, Youtube, Digg, WordPress.com…etc. It can reduce almost all the databases loading time, and has better access and resource utilization to the database when a Memcache miss happened. It got key-based cache & distributed memory object caching system, but the authentication needs to control by the users.

It is good to store frequently used information to reduce the need to retrieve. The simplest example is like when you browsing on the Internet, most of the website contents will be downloaded into your folder, it is used to improve the speed when you browsing same website in the next time. Memcache system use the same viewpoint. It takes part of your computer memory to make your computer faster access, deployed and accessed from anywhere over a network, and you can create more and more cache as you want(of course, you need enough memory), and even more, it treats all cache as one single node which means you can combine several computer memory and use together!! What a wonderful mechanism. All operations should run in O(1) time.  read more »

What is Node.js?

Paul Chien's picture

JavaScript has traditionally only run in the web browser, but recently there has been considerable interest in bringing it to the server side as well, thanks to the CommonJS project. Other server-side JavaScript environments include Jaxer and Narwhal. However, Node.js is a bit different from these solutions, because it is event-based rather than thread based. Web servers like Apache that are used to serve PHP and other CGI scripts are thread based because they spawn a system thread for every incoming request. While this is fine for many applications, the thread based model does not scale well with many long-lived connections like you would need in order to serve real-time applications like Friendfeed or Google Wave.

Node.js, uses an event loop instead of threads, and is able to scale to millions of concurrent connections. It takes advantage of the fact that servers spend most of their time waiting for I/O operations, like reading a file from a hard drive, accessing an external web service or waiting for a file to finish being uploaded, because these operations are much slower than in memory operations. Every I/O operation in Node.js is asynchronous, meaning that the server can continue to process incoming requests while the I/O operation is taking place. JavaScript is extremely well suited to event-based programming because it has anonymous functions and closures which make defining inline callbacks a cinch, and JavaScript developers already know how to program in this way. This event-based model makes Node.js very fast, and makes scaling real-time applications very easy.

http://docs.pylonsproject.org/projects/pyramid/1.0/narr/introduction.html

LDAP

David Lee's picture

Consider two different issues: First, a huge organization with thousands of members, many departments and IT resources. How to maintain an updatable and accessible online address book for it? Second, a MIS staff need to maintain different sets of username and password for a number of different systems, such as linux login, apache, samba, mail service, etc.) How to make his work easily? These two issues seem irrelevant, but can be served by the same solution: LDAP (Lightweight Directory Access Protocol).

LDAP is a protocol for accessing online directory service, based on X.500. It omitted many complicated details of X.500 protocol to be a flexible and lightweight network application protocol build on IP networks. For the first issue above, with the flexible design LDAP allows us to catalog different type of resources to be a distributed online database. And, for the second issue, it also provides a standardized interface for referring to difference applications, thus integration with different configuration of those applications can be easily.

With the macro perspective, LDAP constructs multiple data to be a tree structure, called DIT (Directory Information Tree). A DIT can be cut into many sub-trees, each of them can be stored in a different LDAP server to achieve the distributed architecture. Each record in DIT can be replaced by a unique distinguished name (DN). As the “absolute path” in general file systems, DN is used to identifier the address in DIT.  read more »

Web Application Frameworks

June Huang's picture

Due to the growing use of the Web and web services, web sites in the Web 2.0 era no longer support only static content. Site content has become dynamic so that users can perform real-time tasks such as checking and sending mail. The scale of our web projects becomes vast and it becomes complex to maintain as new features are continually added.

Web application frameworks provide a software architectural model that aid us to organize and manage the different components of our web application. They also provide some useful libraries for example: accessing the database, rendering templates and managing sessions.

Many web application frameworks use a Model-View-Controller (MVC) architecture that defines the logical components of the web application. The following are the explanations of each model, view and controller:

Model
The application model is used to handle the data of the system. In other words, it includes the data and functions that are used to manipulate the data. Controllers and views obtain and change data with the model.

View
The view is the rendered component of the application that is seen by the user, in other words, the user interface. The user uses the user interface to interact with the application.

Controller
Controllers are used to handle requests from the user and returns the response to the user. It obtains the required data from the model, prepares it into a suitable format, inserts the data in the view and renders the view for the user.

A typical request to the server happens as follows: The user interacts with user interface and a request is sent to the server. The main controller handles the request by determining the appropriate delegate controller and passes the control to that controller. The delegate controller interacts with the model to gather or update data for the view, renders the view and returns the control to the main controller. The main controller responds with the rendered view. The cycle repeats when the user interacts with the user interface and sends a new request.


References:
[1] Web application framework. (2011, May 28). In Wikipedia, The Free Encyclopedia. Retrieved 15:23, May 30, 2011, from http://en.wikipedia.org/w/index.php?title=Web_application_framework&oldid=431373642
[2] Model–view–controller. (2011, May 26). In Wikipedia, The Free Encyclopedia. Retrieved 17:12, May 30, 2011, from http://en.wikipedia.org/w/index.php?title=Model%E2%80%93view%E2%80%93controller&oldid=430946706

Protocol Buffers

Shawn Lin's picture

 Introduction

  • flexible, efficient, automated mechanism for serializing structured data.
  • think XML, but smaller, faster, and simpler.
  • use special generated source code to easily write and read your structured data.
  • update your data structure without breaking deployed programs that are compiled against the "old" format.

Why not just use XML?
Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:

* are simpler
* are 3 to 10 times smaller
* are 20 to 100 times faster
* are less ambiguous
* generate data access classes that are easier to use programmatically  read more »

CodeIgniter 2.0.2 Released

Ruby Lin's picture

 There are many PHP frameworks available today, and some of the top PHP frameworks used by developers today include: The Zend Framework, CakePHP, Symfony, CodeIgniter, Seagull, Yii. These frameworks bring a number of benefits to your PHP development, for examples:

1. MVC(Model-View-Controller) architecture

2. Separate PHP from HTML

3. User-friendly URL namespaces

4. Rapid development

These frameworks have their own positives and negatives. Each programer has a different style and different priorities when it comes to adopting a tool kit to use when building apps. CodeIgniter is an open source web application framework that helps you write incredible PHP programs, and it is well-known for following features:

1. a small footprint

2. exceptional performance

3. ease-of-use

4. clear, thorough documentation

5. nearly zero configuration

6. no command line

7. no large-scale monolithic library

CodeIgniter attracts me because of it is easy to understand and easy to extend. And it has a number of supporting helpers, libraries, and plug-ins that you can use. All the tools you need are in one little package. If it is not enough, you can create your own libraries. CodeIgniter also has some security tools. For both users and developers, security is a key question. Cross Site Scripting (XSS) is one of the most common application-layer web attacks. CodeIgniter comes with a Cross Site Scripting Hack prevention filter which can either run automatically to filter all POST and COOKIE data that is encountered, or you can run it on a per item basis.

CodeIgniter 2.0.2 was released. This is a security maintenance release. The security fix patches a small vulnerability in the cross site scripting filter.

http://codeigniter.com/

What is Machine Learning?

HH Tu's picture

Nowadays, if a programmer wants to solve a word parsing problem, he would write a program to solve it. First of all, he needs to input a file and write some code instructions to parse it, then the program collects the useful information and output it. This is simple, but unfortunately, it cannot be the only rule to solve all the problems in the world. Humans can identify whether an e-mail is spam or ham easily, but it is not easy to find a useful algorithm to do it.

Spam mails can be different and thus very difficult to identify. Even the human brain cannot remember or identify every possible case. Today if we can rely on computers to help us collect data, auto-extract useful information, order the results to what we want, etc. and even self-learn to give us a prediction, it would be great!! The point is we don't have a direct algorithm but we have data.

Assume we have thousands of clients around the world and have tens of millions of e-mails every day. If we want to identify whether an e-mail is spam or ham, we can see previous mails and give an approximate prediction. According to traditional statistical analysis, it will take a lot of substantial time and money!! Furthermore, spam mail behavior changes over time and mail types change due to the different locations in the world. If we just follow the traditional rule, it will fail some day. But from another perspective, if we know this e-mail is sent by the spammer to broadcast advertisements or sent by a general manager to issue an order, we can easily handle it. You can write code instructions to quickly filter it out or leave it, but it is not an easy job to get this information.  read more »

Pyramid Introduction

Paul Chien's picture

Pyramid is a general, open source, Python web application development framework. Its primary goal is to make it easier for a developer to create web applications. The type of application being created could be a spreadsheet, a corporate intranet, or a social networking platform; Pyramid’s generality enables it to be used to build an unconstrained variety of web applications.
The first release of Pyramid’s predecessor (named repoze.bfg) was made in July of 2008. We have worked hard to ensure that Pyramid continues to follow the design and engineering principles that we consider to be the core characteristics of a successful framework:

Simplicity
Pyramid takes a “pay only for what you eat” approach. This means that you can get results even if you have only a partial understanding of Pyramid. It doesn’t force you to use any particular technology to produce an application, and we try to keep the core set of concepts that you need to understand to a minimum.

Minimalism
Pyramid concentrates on providing fast, high-quality solutions to the fundamental problems of creating a web application: the mapping of URLs to code, templating, security and serving static assets. We consider these to be the core activities that are common to nearly all web applications.

Documentation
Pyramid’s minimalism means that it is relatively easy for us to maintain extensive and up-to-date documentation. It is our goal that no aspect of Pyramid remains undocumented.

Speed
Pyramid is designed to provide noticeably fast execution for common tasks such as templating and simple response generation. Although the “hardware is cheap” mantra may appear to offer a ready solution to speed problems, the limits of this approach become painfully evident when one finds him or herself responsible for managing a great many machines.

Reliability
Pyramid is developed conservatively and tested exhaustively. Where Pyramid source code is concerned, our motto is: “If it ain’t tested, it’s broke”. Every release of Pyramid has 100% statement coverage via unit tests.

Openness
As with Python, the Pyramid software is distributed under a permissive open source license.


http://docs.pylonsproject.org/projects/pyramid/1.0/narr/introduction.html

Web Crawlers - Crawling Policies

June Huang's picture

Continuing from my last blog entry on web crawlers, let me now give a more detailed explanation as to how web crawlers traverse the Web. Web crawlers use a combination of policies to determine their crawling behavior, such policies include a selection policy, a revisit policy, a politeness policy and a parallelization policy. I shall discuss each of these as follows.

As only a percent of the Web can be downloaded, a web crawler must use a selection policy to determine which resources are relevant to download. This is more useful than downloading a random portion of the Web. An example of a selection policy is the PageRank policy (Google) where the importance of a page is determined by the links to and from that page. Other examples of selection policies are based on the context of the page and the resources’ MIME type.

Web crawlers use revisiting policies to determine the cost associated with an outdated resource. The goal is to minimize this cost. This is important because resources in the Web are continually created, updated or deleted; all within the time it takes a web crawler to finish its crawl through the Web. It is undesirable for the search engine to return an outdated copy of the resource. The cost to revisit the page are based on freshness and age, where freshness focuses on whether or not the local copy is the current copy of the resource and age focuses on how long ago the local copy was updated.

The politeness policy is used so that the performance of a site is not heavily affected whist the web crawler downloads a portion of the site. The server may be overloaded as it has to handle the requests of the viewers of the site as well as the web crawler. Solutions proposed to alleviate the load are: introducing an interval that restricts the web crawler from overloading server with requests and the robot exclusion protocol where the administrators indicate which portions of the site are not to be accessed by the crawler.

Parallelization policies are used to coordinate multiple web crawlers crawling the same Web space. The goal is to maximize the download rate of the resources as well as refraining the web crawlers from downloading the same pages.

[1] Web crawler. (2011, February 22). In Wikipedia, The Free Encyclopedia. Retrieved 16:24, March 4, 2011, from http://en.wikipedia.org/w/index.php?title=Web_crawler&oldid=415343979

Design Patterns in JavaScript

Paul Chien's picture

 The fact that JavaScript is so expressive allows you to be very creative in how design patterns are applied to your code. There are three main reasons why you would want to use design patterns in JavaScript:

  1. Maintainability: Design patterns help to keep your modules more loosely coupled. This makes it easier to refactor your code and swap out different modules. It also makes it easier to work in large teams and to collaborate with other programmers.
  2. Communication: Design patterns provide a common vocabulary for dealing with different types of objects. They give programmers shorthand for describing how their systems work. Instead of long explanations, you can just say, “It uses the factory pattern.” The fact that a particular pattern has a name means you can discuss it at a high level, without having to get into the details.
  3. Performance: Some of the patterns we cover in this book are optimization patterns. They can drastically improve the speed at which your program runs and reduce the amount of code you need to transmit to the client. The flyweight and proxy patterns are the most important examples of this.

There are two reasons why you might not want to use design patterns:

  1. Complexity: Maintainability often comes at a cost, and that cost is that your code may be more complex and less likely to be understood by novice programmers.
  2. Performance: While some patterns improve performance, most of them add a slight performance overhead to your code. Depending on the specific demands of your project, this overhead may range from unnoticeable to completely unacceptable.

Implementing patterns is the easy part; knowing which one to use (and when) is the hard part. Applying design patterns to your code without knowing the specific reasons for doing so can be dangerous. Make an effort to ensure that the pattern you select is the most appropriate and won’t degrade performance below acceptable limits.

[1] Ross Harmes and Dustin Diaz (2008). Pro JavaScript Design Patterns

Parallel programming language Erlang!

Shawn Lin's picture

 Telecommunication companies like Nortel Networks and T-Mobile develop their system with Erlang to achieve ‘Concurrent’ and ‘Fault-Torrent’ capabilities. In addition to concurrent and fault-tolerant, multi-core and Hyper-Threading (HT) processor environments are very good environments for the Erlang language .

Erlang solves one of the most pressing problems facing developers today: how to write reliable, concurrent, high-performance systems. It's used worldwide by companies who need to produce reliable, efficient, and scalable applications.

Moore's Law is the observation that the amount you can do on a single chip doubles every two years, but Moore's Law is taking a detour. Rather than producing faster and faster processors, companies such as Intel and AMD are producing multi-core devices: single chips containing two, four, or more processors. If your programs aren't concurrent, they'll only run on a single processor at a time. Your users will think that your code is slow.

Erlang is a programming language designed for building highly parallel, distributed, fault-tolerant systems. It has been used commercially for many years to build massive fault-tolerated systems that run for years with minimal failures.

Erlang programs run seamlessly on multi-core computers: this means your Erlang program should run a lot faster on a quad-core processor than on a single core processor, all without you having to change a line of code.

Developing systems with Erlang has the following benefits:

  • Write a program, move to the implementation of a multi-core environment, the speed will naturally become faster (or even possible to achieve linear acceleration, n-core to enhance the n-fold).
  • You can write fault-tolerant systems, the computer will restart after crash.
  • You can write a "hot-swap code" system, you can upgrade your code while it is processing, without suspending it.
  • The program is incredibly streamlined.


Erlang's Mnesia provides a database management system (Database Management System, DBMS). Mnesia is an integrated DBMS and can be accessed at a fairly rapid pace. It can be set across a number of separate entities for data replication node to provide fault-tolerant operation.
In addition to Mnesia, you will always use the OTP library when developing systems with Erlang. OTP is a set of Erlang libraries and open source programs, to help the Erlang programs establish industrial grade applications. OTP is Erlang’s source of power; using OTP can be quite easy to write a solid server.

http://www.erlang.org/doc/

Web Crawlers

June Huang's picture

Looking up information on the Internet has become a daily task for many of us. Thanks to the invention of search engines, it is not laborious to do. Search engines are convenient to use as they produce immediate results from countless sources. From Web pages to images and videos, we are able to search through almost everything anyone can ever find in the Web. To be able to return results, search engines first make use of a computer program called a web crawler that explores the resources on the Web. Web crawlers look at the pages’ contents and store information about the page so that when the user requests something, the search engine can find related resources and return them to the user. In this article I shall give a brief introduction of how search engines manage and find what we are interested in.

To begin with, the web crawler is given a list of URLs. The crawler visits a page and identifies keywords and links. It then determines which pieces of information are worth adding or updating. The web crawler will then download a portion of that page and index some metadata, for example the page’s URL, for future searches. The newly found links are then added to the list of URLs for the crawler to continue exploring.

Web crawlers have to select which pages it should visit to obtain information because there are infinitely many pages on the Internet and pages can be constantly added, modified or deleted. Policies are used to determine whether a page is worth visiting as it is impractical to visit every single page in the Web and possibly visiting it multiple times to check for updates. An example of a policy is Google’s PageRank policy that weighs the importance of a page depending on the links to the page and the PageRank of those pages. The number of pages that link to a specific page represents the page’s importance and therefore contributes to its PageRank. The higher the PageRank the more the page is worth indexing. Distributed web crawling is also used to share the URLs for exploring and page download so as to optimise the crawl through the Web.

References:
[1] Web crawler. (2010, December 22). In Wikipedia, The Free Encyclopedia. Retrieved 11:21, December 29, 2010, from http://en.wikipedia.org/w/index.php?title=Web_crawler&oldid=403711331
[2] PageRank. (2011, January 2). In Wikipedia, The Free Encyclopedia. Retrieved 11:15, January 6, 2011, from http://en.wikipedia.org/w/index.php?title=PageRank&oldid=405547279

Drupal Introduction

Ricky Wu's picture

Drupal is one of the best Content Management Systems (CMS). It is written in PHP and requires a MySQL database. Its basic installation can be easily turned into many different types of web sites - from simple web logs to large online communities.

Here is a list of the Drupal benefits:

  • Easy to install - Drupal installation described here;
  • Easy to use - no programming knowledge needed! Read this tutorial to learn the basics of Drupal.
  • Lots of features including Search Engine Friendly URLs(SEF), categories, search function and many more;
  • Lots of modules to extend your site's functionality;
  • Flexibility - you can easily turn your Drupal installation into a forum, blog, wiki and many other types of web sites;
  • Free to use, open source. You can freely install Drupal and you can modify the source code to fit your needs;
  • Lots of users and a large community - easy to find solutions to your problems.

By enabling and configuring individual modules, an administrator can design a unique site which can be used for knowledge management, web publishing, community interaction purposes, etc.
Here are some typical Drupal usages:

  • Content management - Via a simple, browser-based interface, members can publish stories, blogs, polls, images, forums, etc. Administrators can easily customize the design of their Drupal installation.
  • The Drupal classification system allows hierarchical ordering, cross-indexing of posts and multiple category sets for most content types. Access to content is controlled through administrator-defined user roles. A search option is also available.
  • Weblog - A single installation can be configured as an individual personal weblog site or multiple individual weblogs. Drupal supports the Blogger API, provides RSS feeds for each individual blog and can be set to ping weblog directories when new content is posted on the home page.
  • Discussion-based community - A Drupal web site can be successfully used as a discussion forum. Comment boards, attached to most content types, make it simple for members to discuss new posts. Administrators can control whether content and comments are posted without approval, with administrator approval or through community moderation. With the built-in news aggregator, communities can subscribe to and then discuss content from other sites.
  • Collaboration - Used for managing the construction of Drupal, the project module is suitable for supporting other open source software projects. The wiki-like collaborative book module includes versions control, making it simple for a group to create, revise and maintain documentation or any other type of text.

Drupal is a powerful, developer-friendly tool for building complex sites. Like most powerful tools, it requires some expertise and experience to operate. But it’s not friendly to user as beginner.

Referrence :

Drupal - Official Website