Binglong's space

Random notes on computer, phone, and life

YUV/YCbCr Chroma Subsampling Ratio

Posted by binglongx on November 15, 2015

A color image is often chroma subsampled to reduce the bit rate, by exploiting that the human visual system is less sensitive to the resolution of color information than to that of luminance. To perform chroma subsampling, the image is first conceptually converted into Y’CbCr (or YUV) color space, then the CbCr (or UV) channels are spatially subsampled to reduce the data rate. The luma channel (Y) normally is not subsampled and retains the full resolution.

There are many different ways to perform chroma subsampling. For example, you may often hear YUV 4:2:2 or YUV 4:2:0, or sometimes just simply YUV422 or YUV420. What do they mean?

The subsampling ratio is denoted as YUV J:a:b. In the full resolution image, consider a neighborhood of Jx2 pixels (J-pixel width and 2-pixel height), also called a super pixel or macro pixel. The Y channel is not subsampled, so it receives all Jx2 samples. In most cases, J is always 4. The U and V channels are always subsampled at the same rate, so it is sufficient to consider either U or V channel. The horizontal samples received in U/V channel after subsampling is a (horizontal samples in the super pixel for subsampled U/V channel). The vertical samples in the subsampled U/V channel is either 1 or 2, however, this is not what b means. Officially b is the number of changes from the first row to the second row in subsampled U/V super pixel. See the figure below for an illustration (example is YUV 4:4:4).


I however find it is much easier to remember b as the samples in the second row of subsampled super pixel in U/V, with 0 meaning no second row. With that, the illustration changes a bit:


Please note that a and b do NOT mean the relative sampling rates of U and V channels with respect to the Y channel. Since U and V channels always share the same sampling method, it is not necessary to do so. Rather, a and b tell the horizontal and vertical sampling rates, although the meaning of b is a bit wacky.

Below is the list of existing YUV subsampling showing subsampled YUV channels:


Among them, YUV444, YUV422 and YUV420 are used quite often. YUV440, YUV410 and YUV411 are rarely used.

In summary, in YUV J:a:b

  • J = 4 for most YUV chroma subsampling
  • a = number of horizontal samples in U/V for a Jx2 neighborhood.
    • a = J (= 4) : full horizontal resolution in U/V
    • a = J/2 (= 2): half horizontal resolution in U/V
    • a = J/4 (= 1): quarter horizontal resolution in U/V
  • b = number of samples in U/V in the second row of a Jx2 neighborhood.
    • b = a: full vertical resolution in U/V
    • b = 0: half vertical resolution in U/V (no second row)

The total number of samples in a Jx2 neighborhood:

  • Before subsampling: A = Jx2x3 (assuming 3-channel RGB or YUV full resolution)
  • Y = Jx2
  • U = V = a+b
  • Total = Y + U + V = 2x(J+a+b)
  • Compressed date rate = Total/A = (J+a+b)/(3xJ), for example:
    • YUV444 : 100%
    • YUV422 : 66.7%
    • YUV420 : 50%

Please note, the YUV subsampling ratio specifies only the sampling ratio. It does NOT dictate the memory layout of the resulted Y/U/V samples. For each YUV-J:a:b subsampling ratio, there might exist various planar, interleaved, and planar-interleaved mixed YUV layouts (i.e., image formats). For example, both NV12 and NV21 image formats use YUV420 subsampling ratio. For details of individual YUV image formats, please consult YUV pixel formats at

Posted in C++, Smart Phone | Tagged: , , , , , , , , | Leave a Comment »


Posted by binglongx on October 9, 2015

参考1:美国使领馆 免面谈签证 中信银行 代签/代传递 (“申请美国签证”网站作了不少改进,但是基本流程还是一样的)



07/27/2015: 联系国内父母去拍签证要求的照片
08/01/2015: 照片拍好,并且照相馆email给我电子版
08/18/2015: DS-160网上填写完毕(用上以上照片)
08/28/2015: 签证申请网站走流程,打印签证收费单(CGI/MRV),通过微信发图片回去
09/01/2015: 到当地中信银行交申请费,收费单在手机微信里显示即可
09/05/2015: 回到签证申请网站,已经显示缴费完毕,打印文件提交信(Dropbox confirmation)到PDF
09/18/2015: 费了一些周折,国内那边打印了DS-160确认页,给签证官的邀请信,文件提交信
09/25/2015: 到当地中信银行递交申请。新政策:不需要交签证费收据了。只交:DS-160确认页,邀请信,文件提交信,护照,照片。


美国国务院领事电子申请中心 (查签证状态):

09/29/2015: No status.
09/30/2015: Ready. Your case is open and ready for your interview, fingerprints, and required documents. If you have already had your interview, please check your status after two business days. If no interview was required, please check back in two business days for the status of your application.
10/02/2015: Administrative Processing. Your visa case is currently undergoing necessary administrative processing. This processing can take several weeks. Please follow any instructions provided by the Consular Officer at the time of your interview. If further information is needed, you will be contacted. If your visa application is approved, it will be processed and mailed/available within two business days.
10/06/2015: Issued. Your visa is in final processing. If you have not received it in more than 10 working days, please see the webpage for contact information of the embassy or consulate where you submitted your application.

US Travel Docs (查护照状态):

09/28/2015: 你所输入的护照号码并没有状态
09/29/2015: 你的护照仍然于领事馆中
10/07/2015: 护照已从领事馆那边收回,目前正在安排​​运送
10/09/2015: 你的护照可随时领取

10/09/2015: 当地中信银行打电话通知父母,已获签证的护照已从广州到达本地,可以领取。


Posted in Life | Tagged: , , , , , , , , , , , , , , , , , | Leave a Comment »

C++ Concurrency Memo

Posted by binglongx on October 3, 2015

[From C++ Concurrency In Action, Anthony Williams]

  • Invariant: A statement that is always true about a particular data structure; consistency. E.g., the member variable contains the number of elements in the list.
  • Race condition: Anything where the outcome depends on the relative ordering of execution of operations on two or more threads. The threads race to perform their respective operations.
    • Benign race condition: All possible outcomes are acceptable in such a race condition.
    • Problematic race condition: A race condition leads to broken invariants. When we talk about race condition, we usually mean problematic race condition.
    • Data race: a specific race condition that arises because of concurrent modification to a single object. Data race causes undefined behavior.
  • Memory model
    • Structural aspect: how things are laid out in memory
      • Object: A region of storage to represent data in memory.
      • Memory location: Either a scalar type object or a sequence of adjacent bit fields.
      • Object and memory location relationship
        • Every variable (scalar, bitfield, struct/class, etc.) is an object. An object may contain sub-objects (e.g., class vs. members).
        • Every object occupies at least one memory location.
        • Variable of fundamental type (int, char, pointer, etc.) is exactly one memory location, whatever size.
        • Adjacent bit fields are part of the same memory location.
    • Concurrency aspect: it’s all about memory location.
      • If two+ threads access separate memory locations respectively, there is no problem.
      • If two+ threads access the same memory location, you need to be careful.
        • If all threads are reading the same memory location, there is no problem.
        • If either thread is modifying the same memory location, there is potential for a race condition.

(to be updated)

Posted in C++ | Tagged: , , , , , , , , | Leave a Comment »

Threaded Pipe Connection in Plumbing

Posted by binglongx on October 3, 2015

When buying a shower head, shower arm, hose with threaded end, or connector, at least these aspects are important:

  • Material. In most cases, both sides of the connection should use the same material.
  • Gender. This is obvious, because only a female-male pair would make a connection.
  • Size. The size is the nominal inside diameter (ID) of the pipe that the part works with. The size is nominal because the actual inside diameter may vary for the same nominal ID when the pipe wall thickness (i.e., schedule) changes, i.e., the actual outside diameter (OD) is fixed for a nominal ID. See Nominal Pipe Size for more details. The only important thing to remember here is that the nominal sizes of both sides of a connection should match. For example, a 1/2-in shower head will connect to a 1/2-in shower arm, but not a 3/4-in shower arm. Most residential showers or sink faucets are 1/2 inches in size.
  • Thread standard. For a threaded connection, there exists more than one standards, so both sides of the connection need to use compatible standards.

The two major thread standards are:

  • NPT (National Pipe Tapered). NPT has tapered threads, so it would tighten itself. It is able to provide mechanical joining and fluid sealing simultaneously. Often pipe tape is applied to threads before connection for better sealing. NPT is often used for more or less permanent connections, for example, a shower head to a shower arm, because it needs to be tightened to some extent to avoid leaking.
  • IPS (Iron Pipe Straight). IPS only provides mechanical joining. The fluid sealing is often provided by a rubber washer. By tightening the mechanical thread connection, both sides sandwich with the washer in the middle to create a sealed water path without leaking. IPS is often used for quick connection when using tools would be inconvenient, for example connecting a gardening hose to an outdoor faucet.


Posted in Home | Tagged: , , , , , , , , , , , , , , , , , , | Leave a Comment »

Upgrading Android Studio Breaks Old Project

Posted by binglongx on September 28, 2015

I had an old Android Studio 0.5.2 project that built a simple android beat maker app, Simple Metronome. I needed to modify the app to use a wider bpm range. However I have upgraded Android Studio to version 1.3.2. After loading the old project, it immediately showed troubles.

One of the errors was with app\build.gradle in the line "runProguard false":

    …Could not find method runProguard() for arguments [false]…

It seemed that Gradle in the newer Android Studio no longer has runProgard() method. Per this StackOverflow thread, I modified the line to “minifyEnabled false” and it fixed the problem.

Then Android Studio was not happy with lines like “<item name="actionModeShareDrawable">” in certain intermediate values.xml files:

    …No resource found that matches the given name: attr ‘android:actionModeShareDrawable’…

This was caused by AppCompat that requires API 21 of the build tool according to StackOverflow. Following the suggestion, I modified app\build.gradle to have:

    compileSdkVersion 21
    buildToolsVersion ‘21.0.1’

There were other minor errors in the middle, but Android Studio 1.3.2 was nice enough to show clickable links in the Messages window, for example, to install a new Gradle plug-in or new version of build tool. I followed the links and the problems quickly went away.

After that I was able to build the app with wider bpm range without problems.

I also notice that “Make Project” does not generate the apk file in Android 1.3.2. I have to click “Run” to get the apk file built under app\build\outputs\apk.

Posted in Android | Tagged: , , , , , , , , , , , , , , , | 1 Comment »

shared_ptr Constructor Deletes Raw Object At Failure

Posted by binglongx on September 13, 2015

A Simple Example

Consider this example:

#include <iostream>
#include <memory>

void foo()
    std::shared_ptr<int> sp;
    // create an object
    int* p = new int(42);
    // try to let sp own the object
        sp = std::shared_ptr<int>(p);
    catch(std::bad_alloc& e)
        // failed: out of memory, sp does not own the object
        std::cout << *p << std::endl;
        delete p;
    // success: now sp owns the object
    // do our real work below
    *sp = 55;
    std::cout << *sp << std::endl; // "55"
    // RAII sp dtor deletes p

int main() 
    return 0;

At Line 14 it tries to construct a temporary shared_ptr to own the object. If that is successful, sp will own the object and RAII takes care of the deletion of the object in the end. However, since the first shared_ptr object for the raw object needs to allocate a control block to hold the reference count and other information (such as deleter), there is a chance that this could fail. If the construction of shared_ptr fails it throws std::bad_alloc exception. The code in Line 20 tries to delete the raw object if the shared_ptr object cannot be constructed.

The code appears reasonable. But in fact, the code is buggy. If the std::bad_alloc exception is thrown, Line 19 (and 20) would cause crash. The reason is because shared_ptr constructor deletes p if it could not construct and throw an exception.

This is a bit weird. Normally, as good practice, if you cannot do something you try not to change anything, i.e., keep the state as before, i.e., like you are never called to do the work, similar to the strong guarantee of exception safety. In this case, shared_ptr constructor shall not delete the object pointed to by p.

I think the reason that this overload of shared_ptr constructor chooses to delete p upon failure, is to support this use case:

    std::shared_ptr<int> sp(new int);

In the example above, if the shared_ptr constructor does not delete the object upon failure, the object is leaked. In fact, this is the preferred usage of shared_ptr. Using a raw pointer to hold an object and then initializing a shared_ptr with the raw pointer is discouraged usage of shared_ptr.

In the first example code, even Lines 19 and 20 are buggy, it is very unlikely to run there. If you run the code, you will find it mostly just print “55”. The reason is that it is very slim chance that shared_ptr would actually fail to allocate internally the tiny control block object (only a few bytes).

A Better Example

The example below can demonstrate the bug.

#include <iostream>
#include <memory>
#include <vector>

struct Yoo
    Yoo(const char* s) : str(new std::string(s)) {}
    ~Yoo() { delete str; str = nullptr; }
    std::string* str;

const size_t K = 1024;
const size_t SZ = 16 * K;    // some big size

struct YooDeleter
    void operator()(Yoo* p) { delete p; }
    char dummy[SZ];    // make the deleter big

void bar()
    std::shared_ptr<Yoo> sp;

    Yoo* p = new Yoo("Hello");

    // try to exhaust heap memory.
    std::vector<std::unique_ptr<YooDeleter>> holder;
    holder.reserve(2048 * K); // 2M entries, big enough
    for (;;)
            std::unique_ptr<YooDeleter> p(new YooDeleter);
        catch (...)
    std::cout << "Heap used: " << holder.size() << " x " << SZ / K << "KB" << std::endl;

        YooDeleter x;
        std::cout << "Creating shared_ptr..." << std::endl;
        sp = std::shared_ptr<Yoo>(p, x);
    catch (std::bad_alloc& e)
        // cannot construct sp: out of memory
        std::cout << "Out of memory..." << std::endl;
        std::cout << "I am about to crash..." << std::endl;
        std::cout << p->str->c_str() << std::endl; // undefined behavior, mostly crash
        delete p;   // will not even arriving here!
    catch (...)
        std::cout << "Unknown error..." << std::endl;

    // it will never be here!
    *sp->str = "Bye";
    std::cout << sp->str->c_str() << std::endl; // "Bye"

int main()
    return 0;
int _tmain(int argc, _TCHAR* argv[])
    return 0;

To demonstrate the crash, Yoo has a raw pointer member. Whenever Yoo is valid, the pointer points to a string object. If Yoo is destroyed, the pointer is nullptr, and anyone tries to dereference it would result in undefined behavior, most likely crash. YooDeleter is a dumb deleter for Yoo objects. Additionally, YooDeleter is big and takes 16KB memory. When a shared_ptr is initially created for a raw Yoo object with a YooDeleter object, the control block of the shared_ptr will be at least as big as YooDeleter, i.e., 16KB+ memory.

We need to manually create a condition that shared_ptr would fail to allocate the memory for the control block. The best way to achieve that, is to prove that allocating that much of memory fails. So we try to exhaust the heap memory by repeatedly allocating YooDeleter objects, until it cannot allocate one more. At this point, we are sure that the shared_ptr constructor call would internally fail.

The code above has been tested on Windows:


As a 32-bit Windows application, the code exhausted the heap with about 126K of 16KB blocks, i.e., about 2GB memory.

The same code on is able to get 1GB out of the heap before it crashes and shows the bug:


Notice that the 16KB size of YooDeleter was carefully chosen.

If I change it to 32KB, Visual Studio 2013 has no problem with it (in fact I tried 512KB and so on as well) and still exposes the code problem:


The however has some problem. It crashes at the shared_ptr construction call without throwing std::bad_alloc or any other exception, emits Runtime error signal 11. Not sure if their C++14 toolchain is broken or the run-time environment has some problems.



shared_ptr construction takes over the object from the raw pointer, no matter if its construction succeeds or fails. Even attempting to construct a shared_ptr means raw pointer losing the object ownership. If it fails due to out of memory, the object is deleted for ever, since the owning shared_ptr cannot have a life. Always try to follow this use to avoid confusion:

    std::shared_ptr<T> sp(new T(...));

Or, better, if you do not use custom deleter, use std::make_shared for less typing and better memory efficiency (where the control block is allocated with the managed object as a single piece of memory):

    auto sp = std::make_shared<T>(...); // ... : the T constructor parameters

Posted in C++ | Tagged: , , , , , , , , , , , , , , , , | Leave a Comment »

Compile-Time Calculation, constexpr and Variadic Template

Posted by binglongx on September 10, 2015

The Factorial Example

Recursive function

Consider the factorial example:

#include <iostream>

int factorial(int n) 
        return 1;  // base case
    return n*factorial(n-1);

int main() 
    std::cout << factorial(4) << std::endl;  // "24"
    return 0;

The recursive function calculates the factorial at run-time. Notice the base case where recursion can end.

Class template

Sometimes, we may need to calculate things at compile-time. The compiler calculates the value, so there is no run-time cost. Template meta-programming can perform compile-time calculations. For example, this class template calculates factorial at compile-time:

#include <iostream>

template<int n>
struct Factorial {
    static const int value = n*Factorial<n-1>::value;

// base case
struct Factorial<0> {
    static const int value = 1;

int main() 
    std::cout << Factorial<4>::value << std::endl; // "24"
    return 0;

Function template

Since the factorial calculation does not involve type calculation, it is also possible to use function template to calculate it at compile time:

#include <iostream>

template<int n>
int factorial() { return n * factorial<n-1>(); }

// base case
int factorial<0>() { return 1; }

int main() 
    std::cout << factorial<4>() << std::endl; // "24"
    return 0;

Notice that both need to use specialization to make a base case such that the compile-time recursion can end. In both cases, the value 24 is calculated by the compiler, and factorial takes no CPU cycles at run-time.

constexpr function

The class template and function template approaches have a drawback: it cannot do run-time calculation.

    int a = rand();      // a: run-time value
    Factorial<a>::value; // compile error: a is not constant expression
    factorial<a>();      // compile error: a is not constant expression

If we want the run-time calculation of factorial, we have to put back the recursive factorial function in the first example.

In C++11, constexpr can be used to create a function that can be calculated at both compile-time and run-time. This is a constexpr factorial:

#include <iostream>
#include <cstdlib>    // rand

constexpr int factorial(int n) 
    return (n==0)? 1 : n*factorial(n-1);

int main() 
    constexpr int a = factorial(4);
    std::cout << a << std::endl;  // "24"
    int b = rand();
    if( b<0 )
        b = 0;
    else if( b>10 )
        b = 10;
    std::cout << "factorial of "<<b<<" : "<<factorial(b) << std::endl;
        // "factorial of 10 : 3628800"
    return 0;

Notice that the same function can be “called” at both compile-time and run-time. Now we only need to maintain the sole constexpr factorial function.

The Max Example

The constexpr factorial function is beautiful, not only because it can be calculated both by the compiler and at run-time, but also because its base case is embedded in the function itself, just like the run-time recursive function. This means that it does not need a class template specialization or function template specialization to serve as the base case. Having the specializations incurs extra constructs to maintain and poses additional burden.

Let’s consider a different example: a max function that takes arbitrary number of parameters.

We will skip the run-time version of max, because it is trivial to write such a function to take a container or a range of two iterators, perform pair-wise comparison and return the result.

Variadic template

Now this is variadic template version of max:

#include <iostream>

// base case
template<typename T>
const T& max(const T& a) { return a; }

template<typename T, typename... Ts>
const T& max(const T& a, const T& b, const Ts&... rest)
    const auto& m = max(b, rest...);
    return a < m ? m : a;

int main()
    int m1 = max(1);
    int m2 = max(1,2);
    int m3 = max(1,2,3);
    int m4 = max(1,2,3,4);
    std::cout << m1 << " " << m2 << " " << m3 << " " << m4 << std::endl; // "1 2 3 4"
    return 0;

The variadic templated function is pretty cool because it can take any number of arguments and remains type-safe.

constexpr variadic template

But again, the result of the max function above is not constexpr, therefore not compile-time known:

    constexpr int m = max(1,2,3,4); // compile error
    float a[m];

So we can try to create a constexpr function, which we also wish to remove the base case specialization:

template<typename T, typename... Ts>
constexpr const T& max(const T& a, const Ts&... rest)
        return a;
        const auto& m = max(rest...); // no matching function for call to 'max()'
        return a < m ? m : a;

int main() 
    constexpr int m1 = max(1);
    return 0;

The code however does not compile. The reason is that, when sizeof…(Ts) is 0 the compiler would see max(rest…) as max() call, and there is no match for such call. In fact, because variadic template function implementation relies on this recursive call technique, it seems that it would eventually arrive at a function body where there exists a call to a function that does not match the function template itself. Basically, we have to have a function overload or function template specialization to be the base case.

This is the correct constexpr max implementation:

#include <iostream>
#include <cstdlib>  // rand

// base case
template<typename T>
constexpr const T& max(const T& a) { return a; }

template<typename T, typename... Ts>
constexpr const T& max(const T& a, const T& b, const Ts&... rest)
    const auto& m = max(b, rest...);
    return a < m ? m : a;

int main() 
    //int a = max(1, 3.14); // compiler error
    //constexpr int m0 = :max(); // compile error: no matching call to max()
    constexpr int m1 = max(1);
    constexpr int m2 = max(1,2);
    constexpr int m3 = max(1,2,3);
    constexpr int m4 = max(1,2,3,4);
    std::cout << m1 << " " << m2 << " " << m3 << " " << m4 << std::endl; // "1 2 3 4"
    int b = rand();
    //constexpr int c = max(b); // b is not usable in constant expression
    int c = max(b, b+1, b+42);
    std::cout << c-b << std::endl;    // "42"
    return 0;

As you can see, the constexpr max function call can be calculated by the compiler for constexpr m1 through m4, as well as at run-time for c. At the same time, because it is a variadic template function, you can pass as many arguments as you want (obviously you cannot pass 0 arguments to max). The max function would also complain if your arguments are not of the same type.


Just some interesting exploration of compile-time calculation, constexpr and variadic template.

Posted in C++ | Tagged: , , , , , , , , , , , , , , | Leave a Comment »

Build and Run OpenCL Application with AMD APP SDK and Installable Client Driver Loader through HelloWorld Sample

Posted by binglongx on September 3, 2015

The installation of AMD OpenCL driver was discussed in a previous post. Here let’s explore a bit more on how the AMD OpenCL HelloWorld sample project is built and run.

Include Header File

HelloWorld.cpp includes the OpenCL header file CL/cl.h. Additional include directories shows that it comes from $(AMDAPPSDKROOT)/include:


As part of the AMD APP SDK installation, the environment variable AMDAPPSDKROOT is set to the directory where the SDK is installed, C:\Program Files (x86)\AMD APP SDK\3.0 in my PC:


And it’s not surprising that we can find the include subdirectory there, and CL/cl.h in it:


This is the header file that contains the OpenCL API functions, such as clGetPlatformIDs and others.

Link Library File

Eventually, the function and symbol names from the cl.h header file should be found by the linker when the executable is built.

The relevant additional library directory is $(AMDAPPSDKROOT)/lib:


And the library file to be linked is OpenCL.lib:


In fact, AMD APP SDK provides library files for both 32-bit (x86) and 64-bit (x86-64) Windows:


The library file OpenCL.lib is however very small, only 28KB:


This probably means that OpenCL.lib is an import library rather than a static library. The LIB tool confirms this:


No .obj files are listed and only OpenCL.dll is referred to in OpenCL.lib.

Dynamic Link Library File

It is clear that HelloWorld.exe will need the dynamic link library OpenCL.dll to run. Go to the directory where HelloWorld.exe is built:


We do not find OpenCL.dll here. Pull 32-bit HelloWorld.exe into 32-bit Dependency Walker window, it reveals that HelloWorld.exe does depend on OPENCL.DLL:


Dependency Walker also shows that OpenCL.DLL is from C:\Windows\System32. But 64-bit Dependency Walker shows that 64-bit HelloWorld.exe also depends on a 64-bit OpenCL.dll from the same C:\Windows\System32 directory:


If we check carefully, we can see that the two OpenCL.dll files are different: the 32-bit version is 58KB, and the 64-bit version is 64KB. But they have the same filename and cannot exist in the same directory!

VoidTools Everything search tool shows that the OpenCL.dll at C:\Windows\System32 is 64KB, but OpenCL.dll under C:\Windows\SysWOW64 is 58KB:


It turns out that Dependency Walker 32-bit version sees the faked C:\Windows\System32 directory, which in fact is C:\Windows\SysWOW64 in Windows 7 x64. This is how 64-bit Windows tricks 32-bit applications into believing that they are running in a real 32-bit environment. The WOW64 DLLs would automatically be seen by 32-bit applications as system wide available DLLs.

Nevertheless, two versions of OpenCL.dll are installed to Windows system directories (32-bit and 64-bit) by OpenCL driver installation, and therefore available to all applications.

Installable Client Driver Loader

Everything seems fine so far. HelloWorld is able to find the header file, link to the import library, and run with OpenCL.dll installed at the system directory.

But, checking the OpenCL.dll file with Dependency Walker reveals something not making sense:

  • OpenCL.dll only depends on ADVAPI32.DLL and KERNEL32.DLL. The two depended DLLs are Windows system DLLs. This suggests that the bulk of the OpenCL implementation is inherent in OpenCL.dll, not in other depended DLLs.
  • But OpenCL.dll is very small, both versions are less than 64KB. It’s very unlikely that AMD is able to pack their OpenCL implementation into such a tiny binary file.

If we check the Details tab of the OpenCL.dll properties, it shows something interesting:


The DLL is not provided by AMD! Rather, it is built by Khronos, the standardization organization of OpenCL (and OpenGL and others). Obviously, AMD’s OpenCL implementation is not in this DLL.

However the product name, Khronos OpenCL ICD, gives the hint about the nature of this DLL file. ICD stands for Installable Client Driver. This is a mechanism to allow multiple OpenCL implementations from different vendors to co-exist in the same system. The mechanism has a few parts:

  • ICD Loader. This is a “well-known” proxy that the user OpenCL application talks to. In our case, it is OpenCL.dll provided by Khoronos.
    • ICDL implements all OpenCL API functions. So the user does not need to know the actual OpenCL implementation and needs only to talk to OpenCL.dll for all OpenCL business.
    • ICDL forwards actual OpenCL API calls to actual vendor implementations.
  • ICD Vendor Libraries, i.e., ICDs. They are the actual OpenCL implementation libraries from vendors. There can be multiple OpenCL libraries from different vendors in the system.
  • ICD Loader Vendor Discovery. The discovery has two parts:
    • Vendor enumeration. The vendors need to register their libraries to the system. This is where the ICD Loader finds all vendor libraries:
      • On Windows, vendor library DLL file paths can be found as names in Windows registry key HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors with DWORD value 0.
      • On Linux, each vendor library should drop a text file in directory /etc/OpenCL/vendors., where the text file should have only one line with the shared object file path.
      • On Android, although pretty much Linux, ICD is only available with OpenCL 2.0 and later (see source).
    • Adding libraries. For each enumerated vendor library:
      • ICD Loader dynamically loads the library through LoadLibrary/dlopen;
      • ICD Loader queries for symbols for clIcdGetPlatformIDsKHR, clGetPlatformInfo and clGetExtensionFunctionAddress in the dynamically loaded library through GetProcAddress/dlsym;
      • ICD Loader calls clIcdGetPlatformIDsKHR, clGetPlatformInfo and clGetExtensionFunctionAddress to get available platforms, their information and extension function addresses, and make sure the vendor library is ICD compliant.
      • If any of the steps above fails, the vendor library is considered not ICD compliant, and ignored by ICD Loader.

How ICDL Works

The OpenCL user application always calls clGetPlatformIDs first before it makes any other OpenCL calls. The implementation of clGetPlatformIDs in ICD Loader performs the discovery step described above, and possibly returns the aggregated platform_ids from different ICD vendors. For any ICD compliant driver, the returned platform_id object must have a dispatch member:

typedef struct _cl_platform_id* cl_platform_id;  // cl.h

struct _cl_platform_id    // in vendor implementation
    struct _cl_icd_dispatch *dispatch;
    // ... remainder of internal data


The definition of _cl_icd_dispatch is provided by Khronos to members in Khronos, which contains function pointers to all OpenCL API functions. It is not public, but is similar to this:

struct _cl_icd_dispatch
    CL_API_ENTRY cl_int (CL_API_CALL *clGetPlatformIDs)(
        cl_uint          num_entries,
        cl_platform_id * platforms,
        cl_uint *        num_platforms) CL_API_SUFFIX__VERSION_1_0;

    CL_API_ENTRY cl_int (CL_API_CALL *clGetPlatformInfo)(
        cl_platform_id   platform, 
        cl_platform_info param_name,
        size_t           param_value_size, 
        void *           param_value,
        size_t *         param_value_size_ret) CL_API_SUFFIX__VERSION_1_0;

    /* ...continues... */


In fact, the struct _cl_icd_dispatch* dispatch member is in every OpenCL object in addition to cl_platform_id. Therefore, almost every OpenCL API function implementation in ICD Loader is a straightforward redirection similar to this:

cl_abc clXYZ(cl_object_type obj, ...) 
    return obj->dispatch->clXYZ(obj, ...);


Where, cl_object_type can be any OpenCL object, for example, cl_platform_id, cl_device_id, cl_context and so on, while clXYZ is a public OpenCL function. This is possible because all OpenCL objects would have the dispatch field per ICD.

In a rough C++ analogy:

  • _cl_icd_dispatch is an interface that contains all the OpenCL API functions as virtual methods.
  • Each OpenCL object implement the _cl_icd_dispatch interface.
  • ICD Loader calls the corresponding method in the interface, therefore the implementation of the virtual function from the vendor is called.

AMD OpenCL Implementation

On 64-bit Windows, you can run Registry Editor of either 64-bit version (C:\Windows\regedit.exe) or 32-bit version (C:\Windows\SysWOW64\regedit.exe). They show exactly the same registry information.

This is the OpenCL ICD registry for 64-bit Windows native registry key HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors. 64-bit OpenCL.dll will find entries here.


The reflected registry key for 32-bit Windows on 64-bit Windows is HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Khronos\OpenCL\Vendors. 32-bit OpenCL.dll will find entries here.


Notice the both Vendors entries above do not have full path to the ICD dll. Basically, it means those DLL files are installed under Windows system directories. As Everything shows, the AMD OpenCL 64-bit driver amdocl64.dll is under C:\Windows\System32 directory, and the 32-bit driver amdocl.dll is under C:\Windows\SysWOW64. They are registered to the corresponding registry entry for ICD Loader to discover.



As we expected, the sizes of the actual AMD OpenCL implementation DLL files are much bigger than the ICD Loader: 47MB for 64-bit and 39MB for 32-bit! A bit more details about AMD OpenCL implementation:


Dependency Walker shows that it depends on some Windows system DLLs and OpenGL. It exposes OpenCL functions such as clBuildProgram, and some AMD proprietary functions such as aclWriteToMem.



This blog post walks through the header and import library files in building AMD HellowWorld OpenCL sample project, and explores the Khronos ICD Loader OpenCL.dll as well as AMD OpenCL implementation drivers on Windows in running the sample.

Posted in OpenCL | Tagged: , , , , , , , , , , , , , , , , , , , | 1 Comment »

Update Driver of AMD Radeon HD 6470M in HP EliteBook 8460p for OpenCL

Posted by binglongx on September 1, 2015


I would like to try out OpenCL on my laptop. The HP EliteBook 8460p Windows 7 x64 laptop came with an old factory graphics driver for AMD Radeon HD 6470M chip. The driver was too old and it did not support OpenCL. According to the guideline, I need to:

AMD Driver Installers

The first latest driver installer that I downloaded was amd-catalyst-15.7.1-without-dotnet45-win7-64bit.exe (238MB). Although the package appeared too big, it seemed successful in installation without any error message. But I checked the installation log and there were 4 or 5 obscure error entries. When I ran the installed AMD Catalyst Control Center, it said “Catalyst Control Center cannot be started”:


This was a bad sign. From my past experience with AMD video driver installer, I suspected it again went wrong. After I installed AMD APP SDK and built the HelloWorld OpenCL example in Visual Studio, the example crashed at the call of clGetPlatformIDs. It’s obvious that the OpenCL driver was also not installed correctly.

The GPU Caps Viewer is a very popular tool to check the GPU and driver software status. So I downloaded the latest GPU Caps Viewer 1.23.0 and wanted to check around. However, GPU Caps Viewer also crashed at launching, obviously it did not expect calls to functions like clGetPlatformsIDs would crash.

So I had to go back to Windows Programs and Features, run AMD Catalyst Install Manager to uninstall all AMD driver software components, and try to reinstall them. While uninstallation was ongoing, all the displays went blank (I had two external monitors connected to the dock, and the laptop was on dock). After waiting for 5 minutes the monitors did not seem to come back on, I had to hold the power button of the laptop for 10 seconds or so to force shut down the laptop. When the laptop booted up, it showed an awful 640×480 VGA desktop. The worse, the laptop did not detect my external displays on dock!

Now I decided to download a different driver installer, amd-catalyst-15.7.1-with-dotnet45-win7-64bit.exe (286MB) and hoping better luck. The bulky thing again seemed to install successfully, but the Catalyst Control Center still did not run, and the screen was still VGA. Basically it did not install successfully. I even thought that maybe HP’s driver, rather than AMD’s driver, would work, and downloaded the HP’s corresponding driver sp64287.exe. Upon installation, it did not change anything, and my external displays were still gone.

System Restore

In the end I thought maybe I should quit trying OpenCL on PC. I wanted my full resolution screen and big external displays back! The last resort was Windows System Restore. I wanted to restore to a restore point before all this mess. Luckily I had one such restore point. After restoration and reboot, my default laptop screen was full resolution, but the two external displays were still not detected. System Restore reported that the restoration was not completed successfully, probably due to anti-virus software interference. But this is my work laptop and I could not turn off anti-virus software protection due to lack of security privileges.

I decided to chat with IT for help, to either install the AMD driver properly, or restore to the previous restore point correctly. I was amazed how today’s IT working remotely on my computer, but obviously they did not have much experience to this situation.

Manual Driver Update

While IT was remotely working on my laptop, I suddenly wanted to have another try. In Device Manager, under Display adapters, it had only one entry, Standard VGA Graphics Adapter. This was obviously not correct. Right click on the entry, then choose Update Driver Software, from there I wanted to manually install drivers rather than using AMD’s installer.

So I first used 7-Zip to open the downloaded driver installer amd-catalyst-15.7.1-with-dotnet45-win7-64bit.exe. Although it is an executable file, it is in fact a compressed archive of many driver files. 7-Zip reveals that the file contains quite some directories and file. Skipping the directory with garbled characters (probably 7-Zip’s problem), I extracted the directory with Bin64, Config, Images, and Packages to my hard drive. The drivers seem to be under Packages\Drivers directory.

Now I came back to Update Driver Software, and choose Browse my computer for driver software. Then I chose the parent directory with extracted files from AMD installer. After a few minutes, the driver was installed. Upon reboot, the two external displays come back on, and the screen resolutions are full!

Device Manager shows that the Display adapter is AMD Radeon HD 7400M Series. It is a bit off from 6470M, but it seems to work fine. At least it is not the lame Standard VGA Graphics Adapter.


I do not have the fancy Catalyst Control Center software installed, but that’s not a big deal.

The better, when I run GPU Caps Viewer, it does not crash any more:


More details on OpenCL support of AMD Radeon HD 6470M:


This means not only the graphics driver was installed correctly, but also the OpenCL 1.2 driver was also installed correctly. Now the AMD APP OpenCL HelloWorld example code would run:



If you could not install AMD graphics driver correctly using AMD installer, try to extract the driver files in the installer with 7-Zip and manually update the drivers in Device Manager. This even installs OpenCL drivers.

Posted in Computer and Internet | Tagged: , , , , , , , , , , , , , , , , , | 2 Comments »


Posted by binglongx on August 18, 2015





  • 数码照片
  • 填表信息


申请人应该使用六个月内新拍的数码照片,具体要求参见国务院网站照片要求。如果照片仅仅是大小或头像位置不符合要求,此网站有免费的在线照片工具,点击网页右侧的“START PHOTO TOOL”, 你可能裁剪并得到符合要求的照片。

有了照片之后,可以在填写DS-160的网站测试照片是否会被正式接受。先到申请非移民签证填写DS-160网页,从“Get Started”下面选择你要申请签证的城市。这时候,网页会出现“Test Photo”的链接。点该链接,就可以上载你的照片,看看是否符合要求。不符合要求的话,重新准备好照片再说。



  • 个人信息:姓名(包括中文电码),出生日期/地址,身份证号,住址,电话,护照信息等。准备材料:身份证,护照,姓名电码。
  • 旅行信息:旅行性质,目的地,时程,落脚点,出钱方,同行人,过往入出美国日期,上次美国签证,拒签历史等等。准备材料:大致行程,过往5次入出美历史,护照的上次签证页。
  • 美国联系人信息:姓名,单位,关系,地址,电话,电邮。
  • 家庭信息:父母姓名/生日,在美直系亲属姓名/在美身份,配偶信息。准备材料:申请人的父母的生日要事先准备好。
  • 工作教育信息:职业,工作单位,高中职学校,当兵等。
  • 安全背景信息:有没有恐怖,贩毒,迫害,洗钱,间谍等等活动,不是牛人一般一概选NO了。
  • 申请地点:面谈的美国使领馆所在城市。
  • 申请表填写人:代填写者的姓名/地址/关系。

这里是一份我填过的DS-160表格,里面的个人信息都已经删去,但是你可以参考都有哪些空要填。这个PDF文件被Microsoft Word转换编辑过,格式不是很美观,凑合看吧。如果你能手工都填好内容,上网填就很快了。



申请非移民签证填写DS-160网页,从“Get Started”下面选择你要申请签证的城市(中信银行代传递免面谈要选广州)。

新开一个申请表,选“START AN APPLICATION”。网页会显示一个Application ID,让你设一个问题和答案,这样如果填写中断了,你还能用Application ID和答案找回表格继续填写。

之后就是逐页逐页填写大量的信息了。将近最后时,会让你上载照片。最后确认提交。提交之后,你可以选“Print Confirmation”,打出的确认页就是要交给签证官的那张纸,Confirmation Number条码就是Application ID,通常是十位字母数字,形如AA00AAA00。

打出确认页后,不要急着退出。再选“Print Application”,可以打出完整的DS-160表格信息。保留这份完整表格,下次再申请签证填DS-160时可以参考,非常有用。上面我分享的表格就是从这里得到的。

如果是两个或更多的人一起申请签证,这时可以选“Create a Family Application”。这样会填一个新的DS-160表格,有一个新的Application ID/Confirmation Number,只是有的表格部分已经用前一个表格的内容自动填好,省点事。



Posted in Life | Tagged: , , , , , , , , , , , , , , , , , , | 2 Comments »


Get every new post delivered to your Inbox.

Join 52 other followers