Safety/Security and Extensibility/Scalability in Software System Design and Architecture

2018-12-25 21:55:00 UTC+8
3313 字
17 分钟阅读


The article discusses the importance of security/safety and extensibility/scalability in software architecture. It explains the definitions, comparisons, and scenarios related to these quality attributes, as well as strategies and tactics to improve each attribute. The article also provides examples of how these attributes can be affected by various events and attacks. It concludes by discussing strategies and tactics to enhance each attribute, their benefits and penalties, and the challenges they may face.

Azure AI Language Service驱动


Security/safety and extensibility/scalability are two pairs of quality attributes that is of great importance in software architecture. This articles discusses about their definitions and comparisons, followed by their respective general scenarios, which are complemented with typical concrete scenarios to give reader a clearer and concrete picture. Finally, some strategies and tactics to improve each attribute are given, each with description, benefits and penalities.

It is the first assignment for course Software Architecture of NJUSE.

Definitions and Comparisons

In this part, the definitions of, relationships and differences between selected pairs of quality attributes (security/safety and extensibility/scalability) are analyzed and presented with examples.


A "safe" system is such that the harm from accidental mishaps to the system itself can be minimized or avoided. A "secure" system is such that some important properties of a system (like integrity, access, accountability, availability and confidentiality) can still be maintained under intentional attacks.

In another word, "safety" means the ability to reduce the risk of and harm from unintentional mishaps to the system’s stakeholders and valuable assets, whereas the "security" indicates lowering of the risk of and harm from intentional attacks.

It can be observed that both attributes require a system to be able to preserve important properties of a system and minimize the harm, but from difference types of accidents.

"Safety" focuses on unintentional incidents, i.e. the incidents that not meant to cause damage to the system itself. For example, a "safe" social media system should still be functional if one of its servers is disconnected from Internet because of a maloperation during a construction (like cutting a critical wire by mistake); an artificial satellite should keep intact for its major components operational when facing a strong cosmic ray, but some degree of slowdown is tolerable; in a safety system, important data (like financial transactions) won’t be unrecoverably lost if a disk containing these data malfunctions unexpectedly.

"Security", on the other hand, talks about intentional attacks, i.e. the attacks that targets to the system from the very beginning. For example, in a secure system, no plaintext passwords shall be compromised when hackers are attacking database; a medical system should hold long enough under terroristic cyber attacks to be able to get professional assists from law enforcement department, since a breakdown of such system might cause disasters; if a hacker had gained access to the system illegally, the system should be able to detect their existence, remove their privilege, and then report the bug that was abused as soon as possible.


System always grows as time goes, but by two directions: vertically or horizontally. A system might need to implement more functions than originally anticipated or change the implementation that has already been made (vertically); a system might also be required to process more requests without excessive changes to the system itself in the future (horizontally).

Both extensibility and scalability focus on the growth of a system, but from difference perspective. Extensibility is the ability to extend vertically: that is to add "extensions" (like new features, modification to existing modules etc.) without too much changes and impacts to be expected. Scalability, on the other hand, is the ability and potential to scale horizontally: i.e. the ability to effectively handle growing or reducing amount of work using existing system or with minimal changes to the system itself as the number of works increases or decreases.

For example, an extensible frontend project usually indicates that adding a new page (to meet newly-derived business requirements) can be easily implemented without a deep dive into existing code. It also might mean that changing a style to an existing common component can be done within one place which takes effect for all of its occurrences. An extensible system with complex dataflow should be able to integrate a data processing module without an overhaul to the whole dataflow.

As for scalability, existing cloud service providers (Microsoft Azure, AWS etc.) all provide a scalable infrastructure that can gradually accommodate more demands as more companies are migrating their services to cloud based platform for better performance, maintainability and cost. A new concept of computing, serverless or functional computing, are gaining ground in cloud service territory because of its "infinite scalability", which means a service can adapt to handle any amounts of requests the service is actually facing, freeing developers from caring infrastructure and codebase themselves as the number of requests increases.

General and Concrete Scenarios

In this part, scenario-based analysis method is applied on the aforementioned pairs of quality attributes to create their general and two concrete scenarios respectively.


Portion of ScenarioPossible Values
SourceInternal or external to the system
StimulusEvents that is unintentional to damage but will affect the system
ArtifactSystem or one or more modules of the system
EnvironmentPortion of system might be influenced by this event
ResponseDetect the event
- Detect the event’s occurrence
- Analyze the affected modules
- Notify related entities
Avoid or minimize the damage
- Disable or isolate affected modules
- Deploy backup assets to restore functionality
- Switch into a degraded mode
- Be offline during repair
- Restore to normal after the damage is fixed
Avoid future occurrence
- Find the vulnerabilities
- Report the vulnerabilities
- Fix the vulnerabilities
Response MeasureTime to detect the events
Time to notify the related entities
Time to mask affected modules
Time to restore functionality
Time to repair critical modules
Estimated damage for accidents
Time to find the vulnerabilities
Time to fix the vulnerabilities


SourceA construction team unrelated to the system
StimulusCut a network wire that connects system to the Internet by mistake
ArtifactNetwork module
EnvironmentThe affected network module is using the wire when the event occurs
ResponseThe module detects the event and switches to another router bypassing the broken wire
Response MeasureThe detection and reaction take only 30s for the system to go back to normal, during which the throughput dropped 5%.
StimulusUnexpected earthquake
ArtifactA disaster control system for a nuclear power plant
EnvironmentThe earthquake damages containers and causes leaking of radioactive materials.
ResponseDetect the leaking, initiate early emergency process (like shutting related reactors), notify authorities
Response MeasureThe detection, initiation and notification take only 3 min. Leaked materials are within control and won’t cause severe biohazard.


Portion of ScenarioPossible Values
SourceInternal or external to the system
StimulusIntentional attacks to the system
ArtifactSystem or one or more components
EnvironmentThe system doesn’t foresee the attack
ResponseDetect the attack
- Detect the attack’s occurrence
- Analyze the damage
- Report the attack
Maintain the properties
- Disable compromised modules
- Deploy backup assets
- Lock critical modules and data
- Remove attackers from the system
- Switch to degraded mode
- Restore to normal after the attack is resolved
Avoid future occurrence
- Find the vulnerabilities
- Report the vulnerabilities
- Fix the vulnerabilities
Response MeasureTime to detect the attack
Time to switch to degraded mode
Time to secure critical modules and data
Time to remove or block the attackers
Time to deploy backup assets
Estimated damage of the attack
Time to find the vulnerabilities
Time to fix the vulnerabilities


SourceA malicious hacker team
StimulusA well-coordinated and massive DDoS attack
ArtifactSome part of the system that can barely hold the attack
EnvironmentThe hacker initiated a massive DDoS attack to the system
ResponseDetect the attack; Detect and block attack source; Disable compromised module; Deploy backup server resources; Notify security team.
Response MeasureTime to detect and notify the attack is 30s. 80% attack sources have been blocked after 30 minutes. The system goes back to normal in 1 hours.
SourceA lone-wolf hacker
StimulusAn unauthorized entry to critical database
ArtifactA database that contains critical and confidential data
EnvironmentThe database is operational
ResponseDetect the entry; Block the entry; Report the event to authority; Report the vulnerabilities the intruder uses.
Response MeasureThe detection, blocking and reporting take 15 seconds and no data is leaked. The vulnerabilities are fixed in 1 day.


Portion of ScenarioPossible Values
SourceEnd user, developers, requestor
StimulusA directive to add/delete/modify functionality on existing system
ArtifactCode, data, interfaces, components, resources, configurations…
EnvironmentBusiness analysis time, runtime, compile time, build time, initiation time, design time, test time
Response- Understand extension
- Design extension
- Make extension
- Test extension
- Deploy extension
Response Measure- Time and material cost of communicating, understanding, designing, making, testing and deploying of the system extension
- Time and material cost of reeducating users or other stakeholders after system extensions
- Other affected modules not originally anticipated during actual processing
- Potential time and material cost of newly-introduced defects


StimulusWish to add a new functionality onto an existing website
Environmentdesign time, business analysis time
ResponseUnderstand, design, make, test and deploy the new requirement
Response MeasureAll changes deployed in 3 days but brought 70 more bugs which took 3 days more to resolve. New tutorials are written to teach the end users about the new functionality
StimulusWish to change a provider for a service that depends on third-party services
Environmentdesign time, runtime, test time
ResponseDesign, make, test and deploy the new requirement
Response MeasureAll changes made in 1 days. Only 1 module are affected, and no more defects are introduced.


Portion of ScenarioPossible Values
SourceEnd user, developers, requestor
StimulusThe need to use existing system to handle different number of requests from originally designed with minimal change
ArtifactSystem or one or more components in the system
EnvironmentSystem’s operation mode
Response- Evaluate possibilities and potential change
- Make the change, if necessary
- Process requests
Response MeasureLatencies on different level of load
Max and min number of requests
The improvement brought by the scale
The cost to expand or shrink scale
The cost to change existing system to adapt for the change
The cost to resolve defects and interference when executing a scaling


StimulusWish to handle 3 times more requests than originally planned during a unit time
ArtifactThe whole system
EnvironmentSystem hasn’t been online yet.
ResponseThe system is evaluated as capable to accommodate the increase of requests, so no changes should be made.
Response MeasureThe max number of requests is 5 times more than plan and the latency increases only 10% after the 3 times increase. No more cost needed.
SourceDeveloper, end user
StimulusThe need to improve calculation precision for a complex algorithm within original time
ArtifactThe calculating module
EnvironmentSystem has been operational.
Response800 more CPUs are added into the mainframe as the evaluation indicates, no more changes required.
Response MeasureThe process doesn’t interfere normal operation. No more cost or change except CPUs’ are needed. The precision improvement increases the sale of the system by 30%.

Strategies and Tactics

In this part, strategies and tactics to improve each QAs are presented as well as their benefits and penalties to other attributes and QAs.


AvoidRedundancyIntroduce redundant assets into the systemIncrease robustness and security, avoid single-point of failureIncrease cost, complicate system arch design
Avoid risk designAvoid making arch designs that has high possibility to cause problem in the future.Reduce the possibility for problems to occurLimit the decisions that can be beneficial in other perspectives
DetectDesignated monitoring systemA complete and separate system to constantly monitor the critical perspectives of the systemGet accurate, complete data and error report in time without interfering original systemIncrease cost, a new point of failure to be kept watch on
HeartbeatSystem sends a signal every time interval to report its statusEasier to implement and integrate; detect event in timeMay affect system performance
HandleDegradeLimit the system’s functionality to limit the potential damageMaintain basic functionality while handling the problemAffect user experiences during degradation
Disable affected modulesDisable the affected modules completely, fix it and then goes back to normalCompletely avoid further damage and be able to be fixed quicklyMight cause a complete breakdown of a function


AvoidTestFind and fix as many vulnerabilities as possible before putting the system into useAvoid further and usually more damage at a security breach in runtimeIncrease development time and material cost
Simplify designSimplify the architecture design to avoid vulnerabilities that comes with unnecessary componentsAvoid vulnerabilities and save resourcesMight be negative for other QA like modifiability and extensibility
Add security strategiesAdd more strict security methods (like 2-step auth) to protect the system from being hackedIncrease the cost of hacking to reduce the hacker’s benefit and interestIncrease the complexity. Negative impacts on usability and efficiency
DetectLogLog all entries to protected areaEasy to integrate and implementMight not be effective for well-prepared attacks; entries are too many to check
ReportReport suspicious and abnormal operationReduce amount of work to check all the logsSome operation might be mistakenly ignored
HandleIsolate or shutdown compromised modulesIsolate or shutdown the compromised modules to limit the damageCompletely avoid further damageMight cause a complete breakdown of a function
Delete critical dataDelete critical and confidential data, if backed up, to avoid data leakingAvoid data leakingNot applicable if no backup is available.


Improve inner architectureSplit by functionSplit a large system by function so that each function can be run individuallyAdding or modifying function won’t affect existing onesNeed careful design; might not be the most efficient and performant
Constant refactoringConstantly refactor the arch as development goes on, not relying on an unrealistic “perfect” archA good balance between cost and quality within a development cycleHigh skill requirement for developers and teams
Improve outer interface designExpose only necessary interfacesOnly exposes necessary APIsIncrease flexibilities on implementation; reduce interface changes; improve securityReduce the flexibility of usage; hard to determine the “necessity” of interfaces
Do one thing, do it wellAn interface should focus on one small piece of work and do it well.Improve usability, implementation flexibility and interoperability; also helps in scalabilityNeed careful design


SplitSplit by responsibilitySplit a system by different responsibilities into difference layers (data accessing, calculating, viewing etc.)Optimize each layer with their own characteristics; easy to scale each layer accordinglyMore complicated architecture design; more time and material cost
Partition databasePartition database so that pressure to database can be “divided and conquered".More throughput and scalability from the databaseComplicated architecture design; not always applicable; inappropriate partition may lower performance
Make use of cacheDeliver static contents from cheap sourcesSplit static contents out of dynamic parts and deliver static contents from cheaper and more scalable sources (like CDN)Reduce server pressure and make the most use of precious calculating resourcesdata synchronization might be a problem
Use in-memory database as cacheUse in-memory database (like redis) to avoid frequent access to actual databaseReduce access to database, improve performance and responsivenessA new layer to worry about; more complicated architecture design


5.10 Measuring the System Scalability. (n.d.). Retrieved from Lebanese Republic Office of the Minister of State for Administrative Reform: http://www.omsar.gov.lb/ICTSG/105OS/5.10_Measuring_the_System_Scalability.htm

Bloch, J. (2006, October 22-26). How to Design a Good API and Why it Matters. Proceeding OOPSLA '06, (pp. 506-507). Portland, Oregon, USA. doi:10.1145/1176617.1176622

Firesmith, D. G. (2010). Engineering Safety- and Security-Related Requirements for Software-Intensive Systems. Carnegie Mellon University, Software Engineering Institude, Pittsburgh, PA 15213.

Kellyh, T. (2008). Safety Tactics for Software Architecture Design. The University of York, High Integrity Systems Engineering Group, Department of Computer Science .

Seovic, A. (2010). Achieving Performance, Scalability and Availability Objectives. In M. F. Aleksandar Seovic, Oracle Coherence 3.5.

Serhiy. (2017, April 14). How to Increase The Scalability of a Web Application. Retrieved from Romexsoft: https://www.romexsoft.com/blog/improve-scalability/

Shoup, R. (2008, May 27). Scalability Best Practices: Lessons from eBay. Retrieved from InfoQ: https://www.infoq.com/articles/ebay-scalability-best-practices