KDE and internal storage

Looking at KDE4 and related technologies, a few things about data storage hit me. Not that I know anything about it, but that does usually not prevent me from trying to be clever about it.

Several parts needs to either index, cache or store data.

Apps

  • Strigi indexes your harddisk, just a index. No real data
  • Akonadi caches your data, especially email/contacts/calendar items, but also in the future probably instant messaging logs, notes and others
  • Nepomuk handles and stores tags on files and other data
  • Amarok indexes your music collection and stores related info, including statistics

Among these, there is probably many other, but these I just noticed.

How do these apps store their things

As the smart reader has guessed, all these uses different storage methods.

Strigi

Indexes your data and store it in a clucene on disk format. If you remove your strigi indexes, all you lose is the cpu time used to generate it.
Strigi is not a KDE thing, but the KDE libraries depends on it.

Akonadi

Caches your PIM data and make it available easily for applications. Akonadi does not itself keep the data. Akonadi people has evaluated SQLite, but found it insufficient. Akonadi people also evaluated MySQL/Embedded, but did also have issues with that, so they ended up using a full MySQL server.
Akonadi is not a KDE thing, but it originates from KDE and is used by the KDE Pim libraries.

Nepomuk

Handles and stores metadata and tags on files, emails and other data. The nepomuk data is all in RDF format. Nepomuk uses Soprano for storage and soprano has pluggable backends.
One Soprano backend, probably the most used, is a Redland backend, which too slow for effective nepomuk usage, but is a C++ thing. Redland internally can use Berkeley DB or MySQL. The usage in Soprano is Berkeley DB format.
The other Soprano backend is using Sesame2, a java rdf storage. Quite some people are against this because it is java. Internally, it seems that it is using its own ondisk format, but is much more effective than the Redland/BDB backend, and the recommended one to use.

Amarok

Indexes and plays your music and does statistics over it. The amarok people have evaluated SQLite and found it not good enough for the job. The amarok people are going for MySQL/Embedded for now.

Overview

All in all, we have a clucene index, a full MySQL server, a BerkelyDB or custom format and a MySQL/Embedded used. I wonder how much communication between these projects there have been regarding the choice of storages.

Wacky ideas

Now that I already in the beginning said that I didn’t know enough about what I was talking about, I also feel well suited to make suggestions for the future.

What if …

  • Amarok hooked into the Akonadi MySQL database process
  • Amarok skipped the concept of databases and let Strigi index the music and stored the metadata and statistics in Nepomuk
  • Soprano Redland backend hooked into the Akonadi MySQL database instead of a BerkeleyDB file

That’s it for now, next up, maybe, interesting places to use plasma.

24 comments on “KDE and internal storage
  1. Maki says:

    Would be great if there was a unified qt/kde database solution. Something like phonnon and wekit

  2. Ian Monroe says:

    There was some communication between the Amarok and Akondai projects. Akonadi used MySQL for a while now, so it was mostly on the “why did you make the decisions that you made” level. Akondai use of MySQL made the decision to go with MySQL Embedded more attractive because it means we can do something like “Amarok hooked into the Akonadi MySQL database process” pretty easily in the future.

    A small API (probably using D-Bus on the backend to communicate connection details) to enable any KDE application to trivially create MySQL databases would be pretty powerful.

    “Amarok skipped the concept of databases and let Strigi index the music and stored the metadata and statistics in Nepomuk”. There has been some work on this front (Amarok even packages some Strigi scanners), but the big problem is that stream-based parsers like Strigi may never be as good as Taglib, or at least its a difficult challenge.

    Daniel Winter did a GSoC project this summer creating a Nepomuk backend for Amarok. Its scheduled for a KDE 4.2 release.

  3. Nikos says:

    I agree this at least does not look elegant. Are there substantial reasons for not using a common MySQL instance since it is already used by one framework and can meet the requirements of all ? Sincerely, I’m not trying to flame, it’s just curiosity.

  4. Thomas Georgiou says:

    I think some work has been done on amarok using nepomuk to manage its collections, but it was not stable enough yet. It might have been deferred to 2.1.

  5. mike says:

    *** Just a user, so devs please be nice ***

    The source for SQLite has been included into DigiKam. I also wonder why so many different methods have been used. ie.. sql lite, mysql embedded and a full mysql server… i mean… if the whole server is there, why dont all of the programs use it? wouldnt that reduce code size and memory required to run all these apps when run together?

  6. Probably KDE should have something like Phonon for its data storage.

  7. Jos van den Oever says:

    @Ian Monroe
    You can also write non-streaming analyzers for Strigi. The only downside to those is that they only work on real files. Taglib only works on files for which you have a block device file descriptor. So you lose no power when using Strigi. You would use a throughanalyzer for doing this type of analyzer.

    @Ritesh Ray Sarraf
    :-)
    That would be nice fior simple things. There already is QSql to abstract away the database access. It still leaves the sql incompatibilities though. Something like hibernate would be really nice. This is a ORM framework for java.

  8. pvandewyngaerde says:

    what if the next-gen filesystem could do this all transparantly ?

  9. Lee says:

    I agree that all of these should use a common layer. For most app uses, Nepomuk seems like a good solution, with the DB choice behind that. However, I’d also like to see Qt have DB-bound widgets, like VB does, where you can just drop a table into your app visually, and attach it to a DB table or query. This is necessary for RAD, and, of course, a fairly basic feature these days.

  10. DanielW says:

    Ok, a few things to add/correct here:

    In a default KDE 4.1 installation as shipped by KDE it is that way:

    Strigi:
    It uses Nepomuk for storage. No own clucene index.

    Nepomuk:
    It uses as you said either Redland or Sesame2 as RDF storage. But it has also a clucene fulltext index. And Nepomuk stores the data coming from strigi in both.

    And for Amarok:
    As Ian said, I am working on a Nepomuk backend in Amarok and a service in Nepomuk which will use Strigis data together with Taglib to get the music data into Nepomuk. When using this there is no extra need to do a collection scan in Amarok. (but well, it will be slower as the Mysql embedded solution. So everyone has to decide what is more important for them: speed or integration and extra features)

    And redland using mysql:
    I have not tried that. But I expect it to be slower than Sesame2. And also redland does not provide all SPARQL features. For example the OPTIONAL keyword is not supported with redland. I do not think that using mysql for redland will change that.

  11. Anon says:

    The amarok and akonadi people has evaluated sql lite and declined it’s utilization. Any concrete reason? Or where we can find the “discussion” about this decision?

    (I’m considering to use sql lite in a personal project and would be great to know what are it’s cons).

  12. hvralpha says:

    Great idea to integrate all the applications into a strong open backend like mysql. This makes export to other applications and systems simple and quick to search all types of data. Will also accelerate the KDE development process. Brilliant idea.

  13. @Anon

    Amarok has actually used sqlite for a very long time ( in the 1.x series ). There are a few reasons that we are switching to mysql-embedded.

    1. To get rid of the complexities of having multiple database backends. Previously we have maintained support for both sqlite and external mysql and postgres server. Mysql-embedded will allow us to supply a default configuration free database for most users, and allow power users to connect to an external mysql server without any overhead in most of our code. This should make both users and developers happy! :-)

    2. sqlite runs very well up to a certain point, but once your database grows too large, mysql-embedded is significantly faster. Personally, I dont have that large a local collection, but when using the Jamendo service ( which creates a very large database ) the speed difference is _very_ significant.

    – Nikolaj

  14. Miguel says:

    So in my desktop I’ll have to run a MySQL server in order to access my own locally stored contact? :(

    Couldn’t it at least have been database server agnostic? Design to a interface, not to a implementation.

  15. Kevin Krammer says:

    @Miguel: it is, since it is using Qt’s database abstraction layer and only very few DB specific parts.

    I.e some people are working on (or have already completed) the bits for supporting postgres

    http://techbase.kde.org/Projects/PIM/Akonadi#Which_DBMS_does_Akonadi_use.3F

    @Anon
    http://techbase.kde.org/Projects/PIM/Akonadi#Why_not_use_sqlite.3F

    @Lee
    Qt3 has explicit database widgets, even in designer, and Qt4 has database bound models for its model/view framework

  16. Kanenas says:

    About mysql vs sqlite:
    In my experience sqlite outperforms mysql when one knows what he is doing. Sqlite out of the box is considerably conservative towards the safe side. Slightly tweeking sqlite can make large differences in speed.

    Please before judging sqlite do a little research beforehand. Mysql sacrifices a lot of safety to be fast (notice the difference in the number of fsyncs these databases do out of the box). By sacrificing a little safety, sqlite can outperform mysql.

  17. Ian Monroe says:

    @Kanenas I wouldn’t doubt this really. But we aren’t database experts, we don’t know how to optimizer our queries outside of adding a few indexes and we certainly don’t know how to hack sqlite. Given that many distros have Amarok using system sqlite I wonder if its even possible.

  18. Ian Monroe says:

    @Kevin don’t you see what we’re loosing from supporting multiple backends? It lessens the potential for a nice desktop-wide database – Amarok is never going to support both mysql and postgresql, we don’t want to support multilple SQL dialects.

    And you really get nothing in return. Any user only needs one backend at a time. Its like what Amarok figured out with multiple sound engines in 1.2-1.3: they give headaches but few benefits.

  19. Ian Monroe says:

    @Miguel: both akondai and Amarok will start the server for you. You’ll hardly know its there. :)

  20. jstaniek says:

    @Jos van den Oever
    “There already is QSql to abstract away the database access. It still leaves the sql incompatibilities though. Something like hibernate would be really nice.”

    There’s KexiDB, and now Predicate (in development)…

  21. b0b says:

    When using sqlite, appart for using indexes and writing optimized requests, here’s 2 easy optimization that will make it faster:

    – enclose the more request that you can in a BEGIN; … COMMIT; block
    – use “PRAGMA SYNCHRONOUS=OFF” to delay write on disk.

  22. kanenas says:

    To add to the optimizations that b0b said:

    Choose a better page size when creating the database (default 1024 is too small)
    PRAGMA page_size = 4096 (or 8192)

  23. Jim says:

    I personally think a full SQL server should be a gimmy in a modern OS. MySQL is common enough, respected enough, and really isn’t all that terrible to have loaded all the time for the typical desktop capable system.

    I have a music collection large enough that it makes Amarok painful if used with a sqlite backend. So, I use MySQL with Amarok instead. I don’t notice its footprint much at all. I do notice Amarok and the rest of my system being more responsive when Amarok goes about switching tracks and selecting new tracks to add to a dynamic playlist.

    It would be cool to see more applications using a full sql database and take advantage of some of the sophisticated preformance enhancing features that come for free in a more complete database engine (such as key caching).

  24. aexl says:

    What if…

    amarok stored its metadata in nepomuk AND made its great tagging frontend usable to tag ANY content. Imagine Amarok tag&search for openoffice metadata!