Database Forum / General DB Topics / DB Theory / September 2008
Non-text database theory
|
|
Thread rating:  |
Rune Allnor - 05 Sep 2008 10:26 GMT Hi all.
This might be off topic for this group; if so please direct me to a more appropriate group.
I have 20 years of programming experience (hobby / personal scale) and am getting my feet wet with databases for the first time. The project at hand needs a database to handle large amounts of data. The data are measured by sonar and amounts to the hundreds of GB, so one would prefer to save the data on some binary format to save time on the text <-> binary conversions.
The textbooks I have found on database theory solely deal with text data, i.e. data that are stored as tables in text files, which I suppose is OK for educational purposes.
1) Where can I find material on 'real-life' databases which deal with the storage and handling of binary data? 2) Are there database implementations which are better suited for my application than others? I would like to keep the application platform independent, and use C++ as my programming language.
Rune
jefftyzzer - 05 Sep 2008 18:24 GMT > Hi all. > [quoted text clipped - 25 lines] > > Rune Hmmm...there's virtually no limit to the kinds of data a modern RDBMS can store, particularly with the extended type capabilities that came along with the object-relational wave of the last decade. The RMoD certainly doesn't circumscribe (data) types.
Anyway, although it sounds like your textbook is using textual attributes in its examples, RDBMSs are quite capable of efficiently storing and allowing you to manipulate binary data. Are you speaking of sonar *images* here, or some other, more fine-grained, measurement?
As to recommended books, I think for what you're working on, stepping away from theory books (not in general, mind you!) and looking for books that are specific to the RDBMS you're working with (which is, by the way, what?) would take you farther on this specific question.
For books that are more theory-oriented, perhaps _Databases, Types and the Relational Model_ by Date would be of interest to you.
Regards,
--Jeff
Rune Allnor - 05 Sep 2008 20:12 GMT > > Hi all. > [quoted text clipped - 30 lines] > along with the object-relational wave of the last decade. The RMoD > certainly doesn't circumscribe (data) types. I'm a total nephyte on the subject; acronyms are foreign to me. RDBMS = Relational DataBase Management System...? RMoD = ?
> Anyway, although it sounds like your textbook is using textual > attributes in its examples, RDBMSs are quite capable of efficiently > storing and allowing you to manipulate binary data. Are you speaking > of sonar *images* here, or some other, more fine-grained, measurement? It's anything and everything. Lots of data, measurements and information flowing all over the place; keeping track is a full-time job. Literally.
> As to recommended books, I think for what you're working on, stepping > away from theory books (not in general, mind you!) and looking for > books that are specific to the RDBMS you're working with (which is, by > the way, what?) would take you farther on this specific question. Just looking at options for now. I know what need be done, the question is if there are good or bad ways of doing it.
> For books that are more theory-oriented, perhaps _Databases, Types and > the Relational Model_ by Date would be of interest to you. Thanks.
Rune
jefftyzzer - 05 Sep 2008 20:21 GMT > > > Hi all. > [quoted text clipped - 58 lines] > > Rune Yep, you got "RDBMS" right, and "RMoD" = the Relational Model of Data, the body of theory collectively undergirding RDBMS's (note that the vendors' fidelity to the model varies, but that's a story for another day).
--Jeff
Ed Prochak - 22 Sep 2008 21:05 GMT > > > Hi all. > [quoted text clipped - 58 lines] > > Rune For specific products I was going to suggest PROGRESS, but it looks like they abandoned their database product. (I guess that happens when you haven't looked at a product for over 10 years.)
Still it may be useful to search some of the vendor sites to see if they have something that does what you want. HTH, ed
Tim X - 06 Sep 2008 02:53 GMT > Hi all. > [quoted text clipped - 23 lines] > platform > independent, and use C++ as my programming language. Many databases have the concept of a 'blob', (binary large object), which you could use. However, in most cases it isn't going to gain you much.
The data storage and retrieval aspects of a database are only part of the benefits of a DBMS. The real power comes from the ability to retrieve sets of data based on various criteria or attributes. However, with binary data, there is often little in the way of attributes that can be easily identified in the data itself - after all, its just sequences of 1s and 0s. In fact, with binary data, storing it in the database can actually complicate things because more often than not, you will use other stand-alone applications to process the data. If its in the database, you will now need to create some interface between the database and the applicaiton that processes the data. This could be as easy as having the database dump the data into a disk file that the application can then read, but then what has the database actually given you?
In most cases however, you do have meta information about the data. This could be things like the date and time the data was obtained, the location, interesting characteristics, data size etc. This is the data I would store in the database together with information about where the file is stored in the filesystem. The database could be responsible for generating unique filenames, which is very useful if you have lots of them as you don't have to think about it and you can use names that are less user friendly, such as just sequencial numbers etc. The DB might even manage a special filesystem hierarchy, grouping files into directories based on certain meta data attributes.
This would give you the best of both worlds in that you can obtain lists of data files from the database that represent data that meet certain characteristics e.g. all data from a particular location, date, time etc and at the same time, allow you to use other data processing applications on the data directly at the filesystem level and whthout the additional DBMS layer (assuming the processing doesn't change meta information stored in the database).
The other advantage of this approach is that you won't need one of the larger commercial databases, such as Oracle or DB2. In fact, you could probably use things like sql lite, mysql or even Berkley DB hashes.
HTH
Tim
 Signature tcross (at) rapttech dot com dot au
Rune Allnor - 06 Sep 2008 08:50 GMT ...
> In most cases however, you do have meta information about the data. This > could be things like the date and time the data was obtained, the [quoted text clipped - 6 lines] > even manage a special filesystem hierarchy, grouping files into > directories based on certain meta data attributes. Your description matches what I want, I am not sufficiently familiar with the terminology to realize that what I was asking for was not a database as such.
This must have been done thousands of times already. I don't want to invent wheels, so is there a description around on how to do these things? One question which immediately comes to mind is how to protect the logged files from being tampered with.
> This would give you the best of both worlds in that you can obtain lists > of data files from the database that represent data that meet certain [quoted text clipped - 7 lines] > larger commercial databases, such as Oracle or DB2. In fact, you could > probably use things like sql lite, mysql or even Berkley DB hashes. Ah. Just what I want. Thanks for clarifying the big picture.
Rune
Seun Osewa - 06 Sep 2008 17:28 GMT > This must have been done thousands of times already. I don't want to > invent wheels, so is there a description around on how to do these > things? One question which immediately comes to mind is how to > protect > the logged files from being tampered with. This is not a database question. Some Ideas - Password protect the computer on which they are stored. - Encrypt them with openssl and store the keys with you. - Hash the files (SHA1) so you'll know when they've been changed.
Tim X - 07 Sep 2008 04:12 GMT > ... >> In most cases however, you do have meta information about the data. This [quoted text clipped - 17 lines] > protect > the logged files from being tampered with. The answer depends on the OS your on. For example, if we are talking about Linux or one of the other members of the *nix family, I would probably handle this by creating a specific user and group for the application. You can then control access via normal OS access controls, such as adding users to the group, using umask to ensure file/directory permissions are set appropriately etc. . Under windows and other platforms you have similar functionality, but I'm not familiar enough with windows to give a detailed description.
An important consideration when working out how to lay everything out is backup and rstore. If you have lots of data in lots of files, you will want to make sure they are set out in a way that makes adding new data straight forward and that also makes it easy to do backups. How you approach this depends on how much the data changes and the total amount of data and what backup facilities you ahve available.
The design of the database to manage the meta data will depend on what meta data you have. However, this is really just the application of good database design principles.
If your database supports database constraints, such as foreign key constraints, check constraints, not null constraints etc, then use them. Some argue these are bad because they restrict your ability to make changes in the future. I think this is rubbish and a sign of a lack of real analysis and design. Use the datatypes that best match your 'natural' data and how it is to be used. Be wary of data that uses the word number in it, it may not be a number. for example, I can't count the number of systems I've worked on when the original design used a number field for something like a staff number or reference number. These sort of numbers are often best represented by character types because its not unusual for them to have leading 0s, which are significant in the sense they are part of the data. However, if you define the data as a number type, you generally can't have leading zeros. Number types should only be required when you plan to use them in numeric/math type operations. Try to use the data size that best fits your data model. I often see poor database design where a column has been defined to be as large as possible. Again, this is often done in the misguided belief that it adds flexibility. However, doing so also means that bad data can get into the system. For example, if you know all values in column A should never be larger than 10 characters, then define it to be that large. Then when something tries to insert a larger value, you know that either the value is bogus or there has been some change in your domain and you now need to increase the size of that field - the point is, you are alerted to either the fact something is doing something it shouldn't or there is an ierror in your underlying data model. An important point with databases is that the old maxim of GIGO is fundamental (Garbage In Garbage Out). Any database related application is only as good as the quality of the data it manages. No matter how flash, useful or sophisticated your application, if the data is unreliable, the application is unreliable.
While analysis and modeling are important, its also important to actually get something up and running. I'm a big believer in doing prototypes. No matter how much analysis, planning and design you do, there will always be things you discover or realise during the implementation that just were not obvious in the planning/design stage. Just trying to do the implementation teaches you a lot about your problem domain that won't be obvious from reading or thinking alone. Identify the core functionality you want to address. Keep it simple and avoid the temptation to add additional functionality (note it down for later, but move on). Keep it really simple and try to solve your key problem first. Add bells and whistles later when your more comfortable with the problem domain and have a better understanding of it. Try to get something out as quickly as possible and if others are going to use it, get them to start playing with it and get feedback.
HTH
Tim
 Signature tcross (at) rapttech dot com dot au
Volker Hetzer - 11 Sep 2008 20:01 GMT Rune Allnor schrieb:
> Hi all. > [quoted text clipped - 10 lines] > prefer to save the data on some binary format to save time on the > text <-> binary conversions. Sounds like modeling isn't the big thing in your application. (Might play a role though.) Offhand I can think of three standard applications that store massive amounts of binary data, with a bit of meta stuff around it: - pornographic sites have to serve huge amounts of imagery and videos. You might look down your nose at it but in terms of design and technology they are state of the art in private enterprises. -radio telescopes process and filter even greater amounts of data, much like your sonar data but orders of magnitude more. - the storage and processing facilities of film studios are geared to storage, retrieval and shifting of data around to various processing facilities.
> The textbooks I have found on database theory solely deal with text > data, i.e. data that are stored as tables in text files, which I [quoted text clipped - 4 lines] > the > storage and handling of binary data? As others here have already told, most databases can store blobs either internally or (transparently) in a file system. You still access them through the database but it allows for instance to store the meta data locally and all the binary stuff on a network drive on a file server or large storage area network. It gets more "real life" if you read the database specific documentation. This here, for instance is for oracle 11g: http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28393/toc.htm
> 2) Are there database implementations which are better suited for my > application than others? I would like to keep the application > platform > independent, and use C++ as my programming language. I'm not sure about database independence, I don't think BLOB access has been standardized. But this would be just a couple of classes with some database dependent innards and a generic interface. Normally BLOB access means that you have to read out either a stream or fetch the data in packets and all BLOBable databases pack this functionality in one shape or other which you can easily repackage for a generic access. As for C++ and platform independence, not sure about that. Most databases offer - the generic interface (ODBC, OLEDB. ADO.NET) for the platform, - the database specific (OC(C)I, libmysqlclient) but platform independent interface or - Java connectivity as platform and database independent but language specific interface. You decide.
As for storing the meta data too in a hierarchical structure, I think it's worth investigating the meshed approach of the entity relationship model. The tree is a special case of it but you'll find soon that an ERM allows you to model your data more precisely and gives you more powerful retrieval possibilities.
So, without knowing the slightest thing about the technical environment your solution has to operate in, nothing about the kind of queries that are run, nothing about required performance or reliability and and not much about security I'd recommend some kind of database that allows you to separate blob storage and meta data storage.
For tamper-proofing the whole thing, securing the database and file server would be a start. Everything else, database accounts, roles, grants, audit trails, encryption and so on are greatly dependent on your application and its users. (How many? How often do they change? Etc.)
Lots of Greetings! Volker
 Signature For email replies, please substitute the obvious.
Evan Keel - 13 Sep 2008 17:42 GMT > Hi all. > [quoted text clipped - 25 lines] > > Rune You will be fine. Find a copy of "Handbook of Relational Database Design (Fleming, von Halle), old school but relevant.
Evan
|
|
|